Annapurna Labs, an Amazon company, is seeking a Senior Machine Learning Engineer to join their AWS Neuron Distributed Training team. This role focuses on developing and optimizing distributed training solutions for AWS's custom ML accelerators - Trainium and Inferentia.
The position involves working with cutting-edge ML technologies, including Large Language Models (LLMs) like GPT and Llama, as well as other ML model families such as Stable Diffusion and Vision Transformers. You'll be collaborating with chip architects, compiler engineers, and runtime engineers to build and optimize distributed training solutions.
As part of AWS, you'll be working with the world's most comprehensive cloud platform, helping to pioneer new innovations in cloud computing. The team maintains a strong culture of mentorship and knowledge-sharing, with opportunities for career growth and development.
The role offers competitive compensation ranging from $151,300 to $261,500 based on location, plus equity and comprehensive benefits. You'll be part of a diverse, inclusive environment that values work-life harmony and embraces unique perspectives.
Key technical aspects include working with PyTorch, JAX, XLA, and distributed training libraries like FSDP, Deepspeed, and Nemo. You'll be responsible for optimizing performance on AWS custom silicon and ensuring efficient model training at scale.
The ideal candidate should have strong software development skills, deep technical expertise in machine learning, and the ability to work effectively in cross-functional teams. This is an opportunity to shape the future of ML infrastructure at AWS while working with some of the most advanced AI/ML technologies available.
Join a team that's dedicated to innovation, values continuous learning, and offers the chance to work on challenging problems at global scale. Your work will directly impact AWS customers' ability to train and deploy large-scale machine learning models efficiently.