Meta is seeking a Senior Software Engineer to join their Network.AI Software team within the DC networking organization. This role is crucial for developing and maintaining the software stack around NVIDIA Collective Communications Library (NCCL), which is essential for multi-GPU and multi-node data communication in distributed ML training.
The position focuses on enabling Meta-wide ML products and innovations to leverage their large-scale GPU training and inference fleet through an observable, reliable, and high-performance distributed AI/GPU communication stack. A key focus area is building customized features, software benchmarks, performance tuners, and software stacks around NCCL and PyTorch to improve full-stack distributed ML reliability and performance, particularly for Large-Scale GenAI/LLM training.
As a member of this team, you'll work at the intersection of high-performance computing and machine learning infrastructure, directly impacting the performance of Meta's distributed GPU-based ML workloads. The role requires deep expertise in distributed systems, GPU architecture, and machine learning frameworks, with a particular emphasis on scaling reliability and performance for GenAI/LLM applications.
The ideal candidate should have strong experience with NCCL, distributed GPU systems, and deep learning frameworks like PyTorch. Knowledge of both data parallel and model parallel training techniques, including Distributed Data Parallel and Fully Sharded Data Parallel (FSDP), is highly valuable. The role offers competitive compensation ranging from $70,670 to $208,000 annually, plus bonus, equity, and comprehensive benefits.
This position represents an opportunity to work on cutting-edge AI infrastructure at one of the world's leading technology companies, directly contributing to the advancement of large-scale machine learning systems. The work will involve collaboration with various teams across Meta to optimize and scale AI training systems, making it an ideal role for someone passionate about both systems engineering and machine learning technology.