The Performance Assured Networking organization (PAN) at Amazon is seeking a Principal Network Development Engineer to lead their ML Networking initiatives. This role sits at the intersection of machine learning infrastructure and network engineering, focusing on delivering high-performance networks for ML workloads. The position involves working with specialized network products and custom control plane solutions to meet the scale, performance, and availability needs of ML workloads.
The successful candidate will be responsible for transforming Amazon's approach to ML networking, starting with developing comprehensive systems for measuring and optimizing network performance for ML workloads in production. They will need to design and implement intelligent measurement systems, develop new traffic pattern classification methods, and create adaptive network configurations based on workload characteristics.
This role requires someone who can bridge theoretical knowledge with practical implementation, delivering production-grade systems while maintaining the flexibility to adapt to emerging ML innovations. The position involves working with technologies like RDMA, RoCEv2, EFA, and InfiniBand, while maintaining a deep understanding of ML training patterns and NCCL internals.
As the technical authority for ML networking performance at AWS, this role will influence how Amazon builds and operates its ML infrastructure for years to come. The ideal candidate will combine deep networking expertise with strong leadership skills, capable of driving cross-team initiatives and establishing best practices for the organization.
The role offers the opportunity to work with cutting-edge technologies in ML and networking, while solving complex challenges at scale. Amazon provides a collaborative environment focused on innovation and customer success, with opportunities to influence the future of ML infrastructure.