AWS Utility Computing (UC) is at the forefront of cloud innovation, providing groundbreaking products and services that distinguish AWS in the industry. This senior role within the Machine Learning Applications (ML Apps) team for AWS Neuron focuses on developing and optimizing distributed training solutions for cutting-edge AI models.
The position involves working with AWS's custom silicon accelerators (Inferentia and Trainium) and their corresponding server implementations (Trn1 and Inf1). You'll be responsible for creating high-performance distributed training solutions for large-scale language models like GPT-2/3, stable diffusion, and Vision Transformers.
As a senior engineer, you'll collaborate across teams with chip architects, compiler engineers, and runtime specialists to build and enhance distributed training capabilities in major frameworks like PyTorch, TensorFlow, and JAX. The role requires deep expertise in both software development and machine learning, with a focus on distributed training technologies like FSDP and Deepspeed.
The team culture emphasizes knowledge-sharing and mentorship, with senior members actively participating in code reviews and one-on-one mentoring. AWS values diverse experiences and backgrounds, fostering an inclusive environment through employee-led affinity groups and ongoing learning opportunities.
Career growth is strongly supported, with resources for knowledge-sharing and professional development. The company emphasizes work-life harmony and provides comprehensive benefits including competitive base pay, equity compensation, and various medical and financial benefits.
This position offers an opportunity to work on breakthrough AI/ML technologies while being part of Amazon's larger mission to be Earth's Best Employer. The role combines technical leadership with hands-on development, making it ideal for experienced engineers passionate about advancing the field of machine learning infrastructure.