At AWS AI, we are building the next-generation AI platform optimized for Large Language Models (LLMs) and distributed training. This role is part of the SageMaker team, focusing on making deep learning workload training accessible in the cloud. As a Senior Software Development Engineer, you'll be instrumental in designing and developing distributed machine learning systems that serve our worldwide customer base.
The position involves working on Amazon SageMaker's HyperPod Data Plane, where you'll build innovative solutions for large-scale model training (100+ billion parameter GPT models across 1000s of GPU devices). You'll collaborate with ML scientists and customers to shape our strategy and roadmap, while also serving as a technical lead on complex projects.
The role combines deep technical expertise in distributed systems, high-performance computing, and machine learning infrastructure. You'll work with cutting-edge technologies including Kubernetes, PyTorch, and NVIDIA GPUs, while having the opportunity to contribute to open-source communities.
AWS offers a collaborative environment with a strong focus on work-life balance. The team embraces diversity and inclusion, supported by employee-led affinity groups and ongoing learning experiences. You'll have opportunities for mentorship and career growth, working alongside experienced engineers in a knowledge-sharing environment.
This is a unique opportunity to have a significant impact on AWS's AI infrastructure and help shape the future of machine learning at scale. The role offers competitive compensation, including base salary, equity, and comprehensive benefits, reflecting Amazon's commitment to total compensation.