AWS Utility Computing (UC) provides product innovations that continue to set AWS's services and features apart in the industry. This senior role is part of the AWS Neuron team, focusing on distributed training for cloud-scale Machine Learning accelerators. The position involves working with AWS Inferentia and Trainium, our custom ML accelerators, developing and optimizing solutions for large-scale ML models including LLMs like GPT and Llama.
The role combines deep software engineering expertise with machine learning knowledge, requiring work with frameworks like PyTorch and TensorFlow, and distributed training libraries such as FSDP and Deepspeed. You'll collaborate with chip architects and compiler engineers to optimize performance on custom silicon.
Annapurna Labs, acquired by AWS in 2015, is fundamental to AWS's infrastructure, delivering products like AWS Nitro, Graviton, and ML Accelerators. The team emphasizes knowledge-sharing, mentorship, and career growth, supporting members through code reviews and development opportunities.
AWS values diverse experiences and maintains an inclusive culture through employee-led affinity groups and ongoing learning experiences. Work-life harmony is prioritized, ensuring success at work doesn't compromise personal life. The position offers comprehensive benefits, equity compensation, and competitive salary based on location and experience.
Key responsibilities include leading distributed training support development, performance tuning of ML models, and working across teams to optimize solutions for AWS's custom silicon. The role requires strong software development skills combined with machine learning expertise, making it ideal for candidates passionate about pushing the boundaries of cloud computing and AI technology.