Taro Logo

Principal Network Development Engineer, ML Networking

Global technology company leading in e-commerce, cloud computing, and artificial intelligence.
Principal Software Engineer
In-Person
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS

Description For Principal Network Development Engineer, ML Networking

The Performance Assured Networking organization (PAN) at Amazon is seeking a Principal Network Development Engineer to lead their ML Networking initiatives. This role sits at the intersection of machine learning infrastructure and network engineering, focusing on delivering high-performance networks for ML workloads. The position involves working with specialized network products and custom control plane solutions to meet the scale, performance, and availability needs of ML workloads.

The successful candidate will be responsible for transforming Amazon's approach to ML networking, starting with developing comprehensive systems for measuring and optimizing network performance for ML workloads in production. They will need to design and implement intelligent measurement systems, develop new traffic pattern classification methods, and create adaptive network configurations based on workload characteristics.

This role requires someone who can bridge theoretical knowledge with practical implementation, delivering production-grade systems while maintaining the flexibility to adapt to emerging ML innovations. The position involves working with technologies like RDMA, RoCEv2, EFA, and InfiniBand, while maintaining a deep understanding of ML training patterns and NCCL internals.

As the technical authority for ML networking performance at AWS, this role will influence how Amazon builds and operates its ML infrastructure for years to come. The ideal candidate will combine deep networking expertise with strong leadership skills, capable of driving cross-team initiatives and establishing best practices for the organization.

The role offers the opportunity to work with cutting-edge technologies in ML and networking, while solving complex challenges at scale. Amazon provides a collaborative environment focused on innovation and customer success, with opportunities to influence the future of ML infrastructure.

Last updated 3 hours ago

Responsibilities For Principal Network Development Engineer, ML Networking

  • Own ML network performance dependent on the EC2 interface
  • Design and implement systems for measuring and baselining ML workload performance
  • Develop new approaches to identify and classify network traffic patterns
  • Build systems for automatic network configuration tuning
  • Deliver production-grade telemetry system
  • Establish best practices for ML infrastructure
  • Work across teams to drive adoption of networking approaches

Requirements For Principal Network Development Engineer, ML Networking

Kubernetes
  • Masters Degree in Computer Science or Engineering, or equivalent experience
  • Excellent IP networking fundamentals and extensive experience in IP protocols
  • Expertise with major internet routing protocols (BGP, OSPF, MPLS, RSVP, ISIS)
  • Expertise with major router platforms and internal hardware components
  • Expert level network analysis fundamentals and troubleshooting skills
  • Ability to lead teams of engineers
  • Excellent written and verbal communication skills

Benefits For Principal Network Development Engineer, ML Networking

Medical Insurance
Dental Insurance
Vision Insurance
  • Equal opportunity employer
  • Disability accommodations available

Interested in this job?

Jobs Related To Amazon Principal Network Development Engineer, ML Networking

Senior Principal Engineer, Last Mile Delivery & Technology

Senior Principal Engineer role at Amazon leading last-mile delivery technology initiatives, developing solutions for package delivery optimization and logistics infrastructure.

Principal Software Engineer

Principal Software Engineer position at Microsoft focusing on Azure Cosmos DB and distributed systems, offering competitive compensation and remote work flexibility.

Software Engineering Architect

Senior technical leadership role at Salesforce building distributed systems for Data Cloud, focusing on large-scale data processing and analytics infrastructure.

Director, Edge & Traffic Engineering

Lead LinkedIn's global traffic management, DNS, and routing platform initiatives as Director of Edge & Traffic Engineering, ensuring high availability and performance.

Principal Software Developer(hybrid)

Principal Software Engineer position at Oracle focusing on distributed systems and cloud infrastructure, offering hybrid work arrangement and competitive compensation.