Meta is seeking an AI/HPC Systems Performance Engineer to join their Infrastructure team, focusing on scaling their AI Training and Inference Infrastructure. This is a senior technical role that combines deep expertise in networking, distributed systems, and AI infrastructure.
The role involves tackling dramatic scaling challenges in Meta's AI infrastructure, particularly in building and evolving network infrastructure that connects numerous training accelerators like GPUs. The position requires ensuring the network meets stringent performance and availability requirements for RDMA workloads, expecting a loss-less fabric interconnect with minimal latency.
As an AI/HPC Systems Performance Engineer, you'll lead multi-disciplinary teams to develop solutions for large-scale training systems, make critical architectural decisions, and drive the technical vision for Meta's AI infrastructure. The role requires deep expertise in host networking protocols, particularly RDMA, and extensive experience in designing and operating large-scale networks.
The ideal candidate will have 10+ years of experience in network infrastructure, strong leadership abilities, and a deep understanding of AI training workloads. Experience with communication libraries (MPI, NCCL, UCX), machine learning frameworks (PyTorch, TensorFlow), and systems programming in C++ is highly valued.
Meta offers a competitive compensation package ranging from $213,000 to $293,000 per year, plus bonus, equity, and comprehensive benefits. This is an opportunity to work at the cutting edge of AI infrastructure, solving complex technical challenges at unprecedented scale, while contributing to Meta's mission of connecting people and building the future of social technology.
The role is based in Menlo Park, CA, and offers the chance to work with world-class engineers and researchers in Meta's Infrastructure team. You'll be at the forefront of developing the next generation of AI infrastructure that powers Meta's various products and services, making a direct impact on billions of users worldwide.