AI/HPC Systems Performance Engineer

Meta builds technologies that help people connect, find communities, and grow businesses, including Facebook, Messenger, Instagram, WhatsApp, and AR/VR technologies.
$213,000 - $293,000
Distributed Systems
Principal Software Engineer
In-Person
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS

Description For AI/HPC Systems Performance Engineer

Meta is seeking an AI/HPC Systems Performance Engineer to join their Infrastructure team, focusing on scaling their AI Training and Inference Infrastructure. This is a senior technical role that combines deep expertise in networking, distributed systems, and AI infrastructure.

The role involves tackling dramatic scaling challenges in Meta's AI infrastructure, particularly in building and evolving network infrastructure that connects numerous training accelerators like GPUs. The position requires ensuring the network meets stringent performance and availability requirements for RDMA workloads, expecting a loss-less fabric interconnect with minimal latency.

As an AI/HPC Systems Performance Engineer, you'll lead multi-disciplinary teams to develop solutions for large-scale training systems, make critical architectural decisions, and drive the technical vision for Meta's AI infrastructure. The role requires deep expertise in host networking protocols, particularly RDMA, and extensive experience in designing and operating large-scale networks.

The ideal candidate will have 10+ years of experience in network infrastructure, strong leadership abilities, and a deep understanding of AI training workloads. Experience with communication libraries (MPI, NCCL, UCX), machine learning frameworks (PyTorch, TensorFlow), and systems programming in C++ is highly valued.

Meta offers a competitive compensation package ranging from $213,000 to $293,000 per year, plus bonus, equity, and comprehensive benefits. This is an opportunity to work at the cutting edge of AI infrastructure, solving complex technical challenges at unprecedented scale, while contributing to Meta's mission of connecting people and building the future of social technology.

The role is based in Menlo Park, CA, and offers the chance to work with world-class engineers and researchers in Meta's Infrastructure team. You'll be at the forefront of developing the next generation of AI infrastructure that powers Meta's various products and services, making a direct impact on billions of users worldwide.

Last updated 9 days ago

Responsibilities For AI/HPC Systems Performance Engineer

  • Lead multi-disciplinary teams to develop solutions for large scale training systems
  • Ensure timely milestone delivery with teamwork and close collaboration
  • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
  • Defining technical vision and driving a multi-year roadmap
  • Work with cross functional teams and provide guidance on the AI network architecture

Requirements For AI/HPC Systems Performance Engineer

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • 10+ years of experience in designing, deploying and operating networks
  • Experience with triaging performance issues in complex scale-out distributed applications

Benefits For AI/HPC Systems Performance Engineer

Medical Insurance
Dental Insurance
Vision Insurance
Equity
401k
  • Medical Insurance
  • Dental Insurance
  • Vision Insurance
  • Equity
  • 401k

Interested in this job?

Jobs Related To Meta AI/HPC Systems Performance Engineer

Software Engineer (Leadership) - Infrastructure

Lead Software Engineer position at Meta focusing on large-scale distributed systems and infrastructure, combining technical expertise with leadership responsibilities.

Software Engineer (Leadership) - Infrastructure

Lead software engineer position at Meta, focusing on large-scale distributed systems and infrastructure, combining technical leadership with hands-on engineering to power Meta's global platforms.

Distinguished Software Architect - Deep Learning and HPC Communications

Lead the development of next-generation GPU communication technologies for AI and HPC at NVIDIA, architecting solutions that scale to thousands of GPUs.

Distinguished Engineer – Data Center System Software Architect

Lead system software architecture for NVIDIA's data center products, working with cutting-edge GPU and CPU technologies while collaborating with major cloud providers.

Principal Software Engineer

Principal Software Engineer position at Coupang, leading architecture and development of distributed systems and infrastructure for a major e-commerce platform.