AWS Neuron is seeking a talented Software Engineer to join their Machine Learning Applications (ML Apps) team. This role focuses on developing and optimizing AWS's cloud-scale machine learning accelerators - Inferentia and Trainium - and their corresponding Trn1 and Inf1 servers.
The position involves working with cutting-edge ML technologies, particularly in distributed training of large language models like LLama4, Mixtral, and DBRX. You'll collaborate closely with chip architects, compiler engineers, and runtime engineers to build and tune distributed training solutions using Trainium.
Key responsibilities include implementing distributed training support in PyTorch and JAX frameworks using XLA and the Neuron compiler stack. The role requires both strong software development skills and deep machine learning knowledge to optimize model performance on AWS Trainium systems.
The team values work-life balance and fosters an inclusive culture supported by Amazon's 16 Leadership Principles. You'll have opportunities for mentorship and career growth in a collaborative environment that celebrates knowledge sharing. The position offers competitive compensation including base pay, equity, and comprehensive benefits.
This is an exciting opportunity to work at the intersection of machine learning and systems engineering, helping to build the infrastructure that powers next-generation AI applications. You'll be part of a team that's pushing the boundaries of what's possible in distributed ML training while maintaining Amazon's high standards for engineering excellence.