Are you passionate about scaling cutting-edge AI across distributed systems? Together with our partner, a prominent Online Fashion & Beauty Retailer in Europe, we’re looking for an experienced ML Engineer - Distributed Training Specialist to develop and optimize large language models (LLMs) tailored to the fashion industry.
Working with massive-scale data, we’re creating LLMs designed to entertain and inspire customers, shaping the future of AI in fashion. Join us and make an impact!
- Implement and optimize distributed training pipelines for large-scale multimodal models
- Set up and maintain training infrastructure across multiple nodes/GPUs
- Develop and optimize data loading pipelines for multimodal inputs
- Monitor and improve training efficiency and resource utilization
- Implement checkpointing and fault tolerance mechanisms
- Bachelor's/Master's in Computer Science, Engineering, or related field
- 5+ years of experience in ML engineering
- Proven track record with large-scale model training and optimization
- Experience with multimodal data processing and training
- Proficiency in Python, PyTorch
- Proficiency with Cloud Technologies
- Strong understanding of distributed systems and parallel computing
- Preference but not a must: Strong experience with distributed training frameworks (DeepSpeed, FSDP, Megatron)