Software Engineer- AI/ML, AWS Neuron Machine Learning Distributed Training, ML Accuracy

Amazon is a global technology company providing cloud computing, e-commerce, AI, and digital streaming services.
$129,300 - $223,600
Machine Learning
Mid-Level Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS

Description For Software Engineer- AI/ML, AWS Neuron Machine Learning Distributed Training, ML Accuracy

Join AWS Neuron team as a Software Engineer focused on AI/ML distributed training. This role is part of the Machine Learning Applications (ML Apps) team, working on AWS's cloud-scale machine learning accelerators Inferentia and Trainium. You'll be responsible for developing and optimizing distributed training solutions for massive scale language models, vision transformers, and other ML models.

The position is within Annapurna Labs, acquired by AWS in 2015, which serves as AWS's infrastructure provider. You'll work alongside chip architects, compiler engineers, and runtime engineers to create cutting-edge distributed training solutions for Trn2 and Trn1 systems. The role requires expertise in both software development and machine learning, particularly with frameworks like FSDP, Deepspeed, and other distributed training libraries.

AWS offers an inclusive team culture with ten employee-led affinity groups and various learning experiences. The team values work-life balance, offering flexible working hours and supporting professional growth through mentorship and knowledge sharing. You'll be part of a diverse team working on revolutionary cloud infrastructure products that impact millions of users worldwide.

This is an opportunity to work with cutting-edge ML technology, contribute to high-impact projects, and shape the future of cloud-based machine learning infrastructure. The role combines technical depth in ML systems with the scale and impact of AWS's cloud platform, making it ideal for engineers passionate about both software development and machine learning.

Last updated 10 hours ago

Responsibilities For Software Engineer- AI/ML, AWS Neuron Machine Learning Distributed Training, ML Accuracy

  • Build distributed training support into Pytorch, Tensorflow, JAX
  • Develop and maintain Neuron compiler and runtime stacks
  • Tune ML models for highest performance
  • Work with chip architects and compiler engineers
  • Enable and performance tune various ML model families including LLMs

Requirements For Software Engineer- AI/ML, AWS Neuron Machine Learning Distributed Training, ML Accuracy

Python
  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Deep Learning industry experience

Benefits For Software Engineer- AI/ML, AWS Neuron Machine Learning Distributed Training, ML Accuracy

Medical Insurance
401k
  • Work-life balance
  • Flexible working hours
  • Mentorship & Career Growth
  • Medical benefits
  • 401k

Interested in this job?

Jobs Related To Amazon Software Engineer- AI/ML, AWS Neuron Machine Learning Distributed Training, ML Accuracy

Systems Engineer, AI/ML

Systems Engineer position at AWS focusing on AI/ML services, combining cloud infrastructure expertise with artificial intelligence systems support.

Software Engineer- AI/ML, AWS Neuron

Software Engineer position for AWS Neuron team working on AI/ML infrastructure and distributed training solutions.

Software Engineer- AI/ML, AWS Neuron Distributed Training

Senior Software Engineer position at AWS Neuron focusing on distributed training solutions for machine learning, working with cutting-edge ML accelerators and frameworks.

Software Development Engineer, Ring AI

Software Development Engineer position at Ring AI (Amazon) in Iasi, Romania, focusing on computer vision and machine learning infrastructure for smart home security solutions.

Systems Development Engineer, AI/ML

Systems Development Engineer position at AWS focusing on AI/ML services, involving cloud infrastructure automation, system operations, and development of large-scale distributed systems.