AI/HPC Systems Performance Engineer

Meta

Meta builds technologies that help people connect, find communities, and grow businesses, including Facebook, Messenger, Instagram, WhatsApp, and AR/VR technologies.

Menlo Park, CA, USA

$213,000 - $293,000

Principal Software Engineer

In-Person

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For AI/HPC Systems Performance Engineer

Meta is seeking an AI/HPC Systems Performance Engineer to join their Infrastructure team, focusing on scaling their AI Training and Inference Infrastructure. This is a senior technical role that combines deep expertise in networking, distributed systems, and AI infrastructure.

The role involves tackling dramatic scaling challenges in Meta's AI infrastructure, particularly in building and evolving network infrastructure that connects numerous training accelerators like GPUs. The position requires ensuring the network meets stringent performance and availability requirements for RDMA workloads, expecting a loss-less fabric interconnect with minimal latency.

As an AI/HPC Systems Performance Engineer, you'll lead multi-disciplinary teams to develop solutions for large-scale training systems, make critical architectural decisions, and drive the technical vision for Meta's AI infrastructure. The role requires deep expertise in host networking protocols, particularly RDMA, and extensive experience in designing and operating large-scale networks.

The ideal candidate will have 10+ years of experience in network infrastructure, strong leadership abilities, and a deep understanding of AI training workloads. Experience with communication libraries (MPI, NCCL, UCX), machine learning frameworks (PyTorch, TensorFlow), and systems programming in C++ is highly valued.

Meta offers a competitive compensation package ranging from $213,000 to $293,000 per year, plus bonus, equity, and comprehensive benefits. This is an opportunity to work at the cutting edge of AI infrastructure, solving complex technical challenges at unprecedented scale, while contributing to Meta's mission of connecting people and building the future of social technology.

The role is based in Menlo Park, CA, and offers the chance to work with world-class engineers and researchers in Meta's Infrastructure team. You'll be at the forefront of developing the next generation of AI infrastructure that powers Meta's various products and services, making a direct impact on billions of users worldwide.

Last updated 2 months ago

Responsibilities For AI/HPC Systems Performance Engineer

Lead multi-disciplinary teams to develop solutions for large scale training systems
Ensure timely milestone delivery with teamwork and close collaboration
Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
Defining technical vision and driving a multi-year roadmap
Work with cross functional teams and provide guidance on the AI network architecture

Requirements For AI/HPC Systems Performance Engineer

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Experience with developing, evaluating and debugging host networking protocols such as RDMA
10+ years of experience in designing, deploying and operating networks
Experience with triaging performance issues in complex scale-out distributed applications

Benefits For AI/HPC Systems Performance Engineer

Medical Insurance

Dental Insurance

Vision Insurance

Equity

401k

Medical Insurance
Dental Insurance
Vision Insurance
Equity
401k