Meta is seeking a Senior Software Engineer to join their AI Networking Software team within the DC networking organization. This role focuses on developing and maintaining the software stack around NCCL (NVIDIA Collective Communications Library), which is crucial for multi-GPU and multi-node data communication in distributed ML training. The position is at the intersection of AI infrastructure and high-performance computing, working on critical systems that enable Meta's large-scale GPU training and inference fleet.
The team's mission is to enable Meta-wide ML products and innovations through a reliable, observable, and high-performance distributed AI/GPU communication stack. A key focus area is building customized features, software benchmarks, performance tuners, and software stacks around NCCL and PyTorch to enhance distributed ML reliability and performance, particularly for Large-Scale GenAI/LLM training.
This is an excellent opportunity for someone with strong technical skills in distributed systems, machine learning infrastructure, and high-performance computing. The role requires expertise in C/C++ and Python programming, along with specialized knowledge in areas such as distributed ML training, GPU architectures, or machine learning frameworks. The ideal candidate will have experience with NCCL, distributed GPU performance analysis, and deep learning frameworks like PyTorch.
The position offers competitive compensation ranging from $85,100 to $251,000 per year, plus bonus and equity opportunities. Working at Meta provides exposure to cutting-edge AI technologies and the chance to impact machine learning systems at a massive scale. The role is based in Menlo Park, CA, where you'll collaborate with world-class engineers and researchers in the AI infrastructure space.