Software Development Engineer, SageMaker HyperPod Data Plane

Amazon Web Services (AWS) is a leading cloud computing platform building next-generation AI infrastructure and services.
$129,300 - $223,600
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS

Description For Software Development Engineer, SageMaker HyperPod Data Plane

At AWS AI, we are building the next-generation AI platform optimized for Large Language Models (LLMs) and distributed training. This role is part of the SageMaker team, focusing on making deep learning workload training accessible in the cloud. As a Senior Software Development Engineer, you'll be instrumental in designing and developing distributed machine learning systems that serve our worldwide customer base.

The position involves working on Amazon SageMaker's HyperPod Data Plane, where you'll build innovative solutions for large-scale model training (100+ billion parameter GPT models across 1000s of GPU devices). You'll collaborate with ML scientists and customers to shape our strategy and roadmap, while also serving as a technical lead on complex projects.

The role combines deep technical expertise in distributed systems, high-performance computing, and machine learning infrastructure. You'll work with cutting-edge technologies including Kubernetes, PyTorch, and NVIDIA GPUs, while having the opportunity to contribute to open-source communities.

AWS offers a collaborative environment with a strong focus on work-life balance. The team embraces diversity and inclusion, supported by employee-led affinity groups and ongoing learning experiences. You'll have opportunities for mentorship and career growth, working alongside experienced engineers in a knowledge-sharing environment.

This is a unique opportunity to have a significant impact on AWS's AI infrastructure and help shape the future of machine learning at scale. The role offers competitive compensation, including base salary, equity, and comprehensive benefits, reflecting Amazon's commitment to total compensation.

Last updated 18 hours ago

Responsibilities For Software Development Engineer, SageMaker HyperPod Data Plane

  • Developing innovative solutions for supporting Large Language Model training in a cluster of nodes
  • Develop and maintain a performant, resilient and fully-managed service for training large-scale foundation models
  • Optimizing distributed training by profiling and addressing performance bottlenecks
  • Serve as technical lead on complex projects using best practice engineering standards
  • Hire and mentor junior development engineers

Requirements For Software Development Engineer, SageMaker HyperPod Data Plane

Python
Go
Kubernetes
  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Experience in multi-threaded asynchronous C++/Go development
  • Prior experience with Kubernetes and high performance computing
  • Experience in large language model training

Benefits For Software Development Engineer, SageMaker HyperPod Data Plane

Medical Insurance
401k
  • Medical Insurance
  • 401k

Interested in this job?

Jobs Related To Amazon Software Development Engineer, SageMaker HyperPod Data Plane

Software Development Engineer, Ring Cloud Computer Vision

Senior Software Engineer role at Amazon Ring, focusing on cloud-based computer vision services and AI-powered distributed systems serving millions of users globally.

Sr Software Dev Engineer, Deep Learning Compilers

Senior Software Engineering role at Amazon focusing on deep learning compiler development for AI acceleration in consumer devices, offering competitive compensation and the chance to work on cutting-edge technology.

Senior Software Engineer, Amazon Games AI Research

Senior Software Engineer position at Amazon Games focusing on AI/ML innovation in gaming, implementing advanced AI systems and tools for game development.

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Senior ML Engineer role at AWS focusing on distributed training systems for AI accelerators, working with cutting-edge ML models and custom silicon solutions.

Sr. Machine Learning Engineer, Amazon General Intelligence (AGI)

Senior Machine Learning Engineer position at Amazon's AGI team, focusing on developing cutting-edge large language models and generative AI applications.