Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS infrastructure provider specializing in silicon engineering, hardware design, software, and operations.
$129,300 - $223,600
Machine Learning
Mid-Level Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS

Description For Software Engineer- AI/ML, AWS Neuron Distributed Training

AWS Neuron is seeking a Software Engineer to join their Machine Learning Applications team, focusing on distributed training solutions. The role involves working with AWS's innovative ML accelerators - Inferentia and Trainium - and their corresponding servers (Trn1 and Inf1). You'll be responsible for developing and optimizing distributed training support for various ML models, including large language models like GPT-2/3 and vision transformers.

The position sits within Annapurna Labs, an AWS infrastructure provider acquired in 2015, which has delivered numerous successful products including AWS Nitro, ENA, EFA, Graviton, and F1 EC2 Instances. You'll work alongside chip architects, compiler engineers, and runtime engineers to create and optimize distributed training solutions using technologies like FSDP and Deepspeed.

The role combines deep technical expertise in both software development and machine learning, with a focus on performance optimization and scalability. You'll be part of a team that values work-life balance, mentorship, and career growth, with opportunities to work on cutting-edge ML infrastructure that impacts millions of users worldwide.

Amazon offers a comprehensive benefits package and a culture that embraces diversity through various employee-led affinity groups. The company's 16 Leadership Principles emphasize seeking diverse perspectives, continuous learning, and earning trust. The team supports flexible working hours and maintains a balanced approach to professional and personal life.

This is an excellent opportunity for someone passionate about ML infrastructure, distributed systems, and high-performance computing, with the chance to work on technology that powers some of the most advanced ML applications in the cloud computing industry.

Last updated 2 days ago

Responsibilities For Software Engineer- AI/ML, AWS Neuron Distributed Training

  • Build distributed training support into PyTorch and TensorFlow using XLA
  • Develop and tune ML models for highest performance on AWS Trainium and Inferentia silicon
  • Work with chip architects, compiler engineers and runtime engineers
  • Create and optimize distributed training solutions with Trn1
  • Enable and performance tune various ML model families including LLMs and vision models

Requirements For Software Engineer- AI/ML, AWS Neuron Distributed Training

Python
  • 3+ years of non-internship professional software development experience
  • 3+ years of system design and architecture experience
  • Experience programming with at least one software programming language
  • Deep Learning industry experience
  • Experience with full software development life cycle
  • Bachelor's degree in computer science or equivalent (preferred)
  • Experience with PyTorch/JAX/TensorFlow (preferred)

Benefits For Software Engineer- AI/ML, AWS Neuron Distributed Training

Medical Insurance
Mental Health Assistance
  • Work-life balance
  • Mentorship opportunities
  • Career growth opportunities
  • Medical benefits
  • Employee-led affinity groups
  • Flexible working hours

Interested in this job?

Jobs Related To Amazon Software Engineer- AI/ML, AWS Neuron Distributed Training

Software Development Engineer II, Amazon

Amazon SDE II role focusing on AWS and ML technologies to build customer-centric solutions for Private Brands, offering competitive compensation and growth opportunities.

Systems Engineer, AI/ML

Systems Engineer position at AWS focusing on AI/ML services, combining cloud infrastructure expertise with artificial intelligence systems support.

Software Engineer- AI/ML, AWS Neuron

Software Engineer position for AWS Neuron team working on AI/ML infrastructure and distributed training solutions.

Software Engineer- AI/ML, AWS Neuron Distributed Training

Senior Software Engineer position at AWS Neuron focusing on distributed training solutions for machine learning, working with cutting-edge ML accelerators and frameworks.

Software Development Engineer, Ring AI

Software Development Engineer position at Ring AI (Amazon) in Iasi, Romania, focusing on computer vision and machine learning infrastructure for smart home security solutions.