Software Engineer, SystemML - AI Networking

Meta

Meta builds technologies that help people connect, find communities, and grow businesses, including Facebook, Messenger, Instagram, WhatsApp, and AR/VR technologies.

Menlo Park, CA, USA

$85,100 - $251,000

Machine Learning

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Description For Software Engineer, SystemML - AI Networking

Meta is seeking a Senior Software Engineer to join their AI Networking Software team within the DC networking organization. This role focuses on developing and maintaining the software stack around NCCL (NVIDIA Collective Communications Library), which is crucial for multi-GPU and multi-node data communication in distributed ML training. The position is at the intersection of AI infrastructure and high-performance computing, working on critical systems that enable Meta's large-scale GPU training and inference fleet.

The team's mission is to enable Meta-wide ML products and innovations through a reliable, observable, and high-performance distributed AI/GPU communication stack. A key focus area is building customized features, software benchmarks, performance tuners, and software stacks around NCCL and PyTorch to enhance distributed ML reliability and performance, particularly for Large-Scale GenAI/LLM training.

This is an excellent opportunity for someone with strong technical skills in distributed systems, machine learning infrastructure, and high-performance computing. The role requires expertise in C/C++ and Python programming, along with specialized knowledge in areas such as distributed ML training, GPU architectures, or machine learning frameworks. The ideal candidate will have experience with NCCL, distributed GPU performance analysis, and deep learning frameworks like PyTorch.

The position offers competitive compensation ranging from $85,100 to $251,000 per year, plus bonus and equity opportunities. Working at Meta provides exposure to cutting-edge AI technologies and the chance to impact machine learning systems at a massive scale. The role is based in Menlo Park, CA, where you'll collaborate with world-class engineers and researchers in the AI infrastructure space.

Last updated a month ago

Responsibilities For Software Engineer, SystemML - AI Networking

Tech-leading the collective communication library development on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling
Develop and maintain software stack around NCCL for multi-GPU and multi-node data communication
Build customized features, SW benchmarks, performance tuners for distributed ML reliability
Improve full-stack distributed ML performance for Large-Scale GenAI/LLM training

Requirements For Software Engineer, SystemML - AI Networking

Python

Kubernetes

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Proven C/C++ and Python programming skills
Proven track record of leading successful projects
Effective leadership and communication skills
Specialized experience in machine learning/deep learning domains
Experience with distributed ML training, GPU architecture, ML systems, or AI infrastructure

Benefits For Software Engineer, SystemML - AI Networking

Medical Insurance

Equity

Bonus
Equity
Medical Benefits

Software Engineer, SystemML - AI Networking

Meta

Description For Software Engineer, SystemML - AI Networking

Responsibilities For Software Engineer, SystemML - AI Networking

Requirements For Software Engineer, SystemML - AI Networking

Benefits For Software Engineer, SystemML - AI Networking

Meta

Jobs Related To Meta Software Engineer, SystemML - AI Networking