Taro Logo

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Annapurna Labs designs silicon and software that accelerates innovation for AWS cloud solutions.
$151,300 - $261,500
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Annapurna Labs, an Amazon company, is seeking a Senior Machine Learning Engineer to join their AWS Neuron Distributed Training team. This role focuses on developing and optimizing distributed training solutions for AWS's custom ML accelerators - Trainium and Inferentia.

The position involves working with cutting-edge ML technologies, including Large Language Models (LLMs) like GPT and Llama, as well as other ML model families such as Stable Diffusion and Vision Transformers. You'll be collaborating with chip architects, compiler engineers, and runtime engineers to build and optimize distributed training solutions.

As part of AWS, you'll be working with the world's most comprehensive cloud platform, helping to pioneer new innovations in cloud computing. The team maintains a strong culture of mentorship and knowledge-sharing, with opportunities for career growth and development.

The role offers competitive compensation ranging from $151,300 to $261,500 based on location, plus equity and comprehensive benefits. You'll be part of a diverse, inclusive environment that values work-life harmony and embraces unique perspectives.

Key technical aspects include working with PyTorch, JAX, XLA, and distributed training libraries like FSDP, Deepspeed, and Nemo. You'll be responsible for optimizing performance on AWS custom silicon and ensuring efficient model training at scale.

The ideal candidate should have strong software development skills, deep technical expertise in machine learning, and the ability to work effectively in cross-functional teams. This is an opportunity to shape the future of ML infrastructure at AWS while working with some of the most advanced AI/ML technologies available.

Join a team that's dedicated to innovation, values continuous learning, and offers the chance to work on challenging problems at global scale. Your work will directly impact AWS customers' ability to train and deploy large-scale machine learning models efficiently.

Last updated 3 months ago

Responsibilities For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Lead efforts to build distributed training support into PyTorch and JAX using XLA
  • Optimize models for peak performance on AWS custom silicon
  • Work with chip architects, compiler engineers and runtime engineers
  • Create, build and tune distributed training solutions with Trainium instances
  • Develop and enable performance tuning of ML model families including LLMs

Requirements For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
  • Bachelor's degree in computer science or equivalent
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming experience
  • 5+ years of leading design or architecture experience
  • 5+ years of full software development life cycle experience
  • Experience as a mentor, tech lead or leading an engineering team
  • Experience in machine learning, data mining, statistics or natural language processing

Benefits For Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
401k
  • Full range of medical benefits
  • Financial benefits
  • Work-life harmony
  • Mentorship and career growth opportunities

Interested in this job?

Jobs Related To Amazon Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training