Software Development Engineer, HPC/ML Interconnect Engineer

Amazon Web Services (AWS) is a leading cloud computing platform providing a wide range of services including compute, storage, and AI/ML solutions.
$129,300 - $223,600
Distributed Systems
Senior Software Engineer
Hybrid
5,000+ Employees
3+ years of experience
AI · Enterprise SaaS · Cloud

Description For Software Development Engineer, HPC/ML Interconnect Engineer

We are seeking an experienced software engineer with low-level latency networking or interconnect expertise to optimize customer experience by designing systems that enable scaling network-intensive workloads over thousands of CPUs, GPUs, and TPUs. This role is on the forefront of AI/ML, focusing on optimizing networking for the latest AI workloads such as LLMs.

As part of the AWS Utility Computing (UC) organization, you'll support the development and management of various AWS services, including Compute, Database, Storage, IoT, Platform, and Productivity Apps. You'll work within Annapurna Labs, designing silicon and software that accelerates innovation for cloud solutions.

Key responsibilities:

  • Design and optimize networking solutions for Machine Learning (ML) and High-Performance Computing (HPC) workloads on AWS
  • Collaborate with cross-functional teams and engage with customers to gather feedback and improve offerings
  • Develop low-latency networking and collective operations for HPC network fabric or machine learning accelerator cluster systems
  • Troubleshoot complex networking issues and implement solutions at scale

Required skills:

  • Extensive experience in low-latency networking and collective operations
  • Proficiency in C/C++ and deep understanding of Linux and kernel-level programming
  • Strong problem-solving skills and ability to troubleshoot complex networking issues
  • Excellent communication skills for effective collaboration in a team environment

The role offers opportunities to work on cutting-edge AI/ML technologies, participate in innovative learning experiences, and benefit from a diverse and inclusive team culture. AWS values work-life balance and offers flexible working hours.

Join the Elastic Collectives team at Annapurna Labs and be part of shaping the future of networking solutions for ML and HPC workloads on AWS!

Last updated 22 days ago

Responsibilities For Software Development Engineer, HPC/ML Interconnect Engineer

  • Design and optimize networking solutions for ML and HPC workloads
  • Collaborate with cross-functional teams
  • Engage with customers to gather feedback and improve offerings
  • Develop low-latency networking and collective operations
  • Troubleshoot complex networking issues
  • Implement solutions at scale

Requirements For Software Development Engineer, HPC/ML Interconnect Engineer

Linux
  • 3+ years of non-internship professional software development experience
  • 2+ years of non-internship design or architecture experience
  • Experience programming with at least one software programming language
  • Extensive experience in low-latency networking and collective operations
  • Proficiency in C/C++
  • Deep understanding of Linux and kernel-level programming
  • Strong problem-solving skills
  • Excellent communication skills

Benefits For Software Development Engineer, HPC/ML Interconnect Engineer

Medical Insurance
Dental Insurance
Vision Insurance
  • Flexible working hours
  • Career growth opportunities
  • Mentorship programs
  • Diverse and inclusive team culture
  • Work-life balance

Interested in this job?

Jobs Related To Amazon Software Development Engineer, HPC/ML Interconnect Engineer

Senior Software Development Engineer, AWS Organizations

Senior SDE role at AWS Organizations leading development of multi-account management solutions and distributed systems at scale.

Software Development Engineer, OpenSearch Serverless Team

Senior Software Engineer role at Amazon building next-gen cloud-scale analytics and search platform with OpenSearch Serverless team.

Systems Development Engineer: Intelligent Building Systems, FinAuto GREF Technology

Senior Systems Development Engineer role at Amazon focusing on intelligent building systems and real estate technology infrastructure.

Sr. Software Dev Engineer, Measurement, Ad Tech, and Data Science (MADS)

Senior Software Engineer role at Amazon focusing on building petabyte-scale distributed systems for advertising measurement and analytics.

Software Development Engineer, AWS Infrastructure Service

Senior Software Engineer role at AWS Infrastructure Services focusing on distributed systems and data center automation with competitive compensation.