Join the Machine Learning (ML) Infrastructure team at AWS as a Software Development Engineer, where you'll be at the forefront of building tools that ensure peak performance of AWS ML and High Performance Computing (HPC) technologies. This role is part of Annapurna Labs, a key AWS subsidiary that develops cutting-edge software and hardware solutions for ML on EC2.
As a member of our team, you'll work on critical infrastructure that monitors and optimizes massive testing workloads at scale. Your responsibilities will include developing automated CI/CD pipelines, managing large-scale clusters, and creating sophisticated monitoring systems using AWS Managed Grafana and Athena. You'll be instrumental in building solutions that help detect and prevent performance regressions before they impact customers.
The position requires expertise in Python, TypeScript, and infrastructure as code (IaC) using CDK. You'll work with advanced technologies including SLURM, Active Directory, and various AWS services. The role offers competitive compensation ranging from $129,300 to $223,600 based on location and experience, plus comprehensive benefits.
This is an excellent opportunity for engineers passionate about ML infrastructure who want to make a significant impact on AWS's ML and HPC capabilities. You'll be working with innovative technologies like Trainium, Neuron, and Elastic Fabric Adapter (EFA), helping to make AWS the premier platform for AI workloads at scale.
The ideal candidate brings strong experience in software development, CI/CD automation, and ML/HPC systems. You'll need to be comfortable managing complex infrastructure across multiple instance types and operating systems, while maintaining high standards for code quality and automation. Join us in shaping the future of cloud-based machine learning infrastructure at AWS.