At AWS AI, we are building the next-generation AI platform to accelerate customer development in LLMs and Generative AI. This role is part of the Amazon SageMaker team, focused on making deep learning workload training accessible in the cloud. As a Software Development Engineer, you'll be instrumental in designing, developing, and deploying distributed machine learning systems for our global customer base.
The position involves working on cutting-edge technology for large-scale deep learning model training, handling models with 100+ billion parameters and managing thousands of GPU devices. You'll be part of a team that's directly impacting AWS's AI infrastructure and the broader machine learning community.
Key responsibilities include developing innovative solutions for LLM training in clustered environments, optimizing distributed training performance, and serving as a technical lead on complex projects. You'll work with internal teams, technology partners, and the open-source community, particularly with frameworks like PyTorch and NVIDIA/GPU technologies.
The ideal candidate will have strong experience in multi-threaded asynchronous C++/Go development, Kubernetes, high-performance computing, and building scalable systems. You should be comfortable with ambiguity, have strong analytical skills, and thrive in an entrepreneurial environment.
AWS offers a collaborative and inclusive culture with ten employee-led affinity groups across 190 global chapters. The team values work-life balance and provides flexibility in working hours. You'll have opportunities for mentorship and career growth, working alongside experienced professionals in a supportive environment that celebrates knowledge sharing.
This role offers competitive compensation based on location and experience, with additional benefits including equity, sign-on payments, and comprehensive medical and financial benefits. Join us in shaping the future of AI infrastructure and help our customers leverage the power of machine learning at scale.