Taro Logo

Software Engineer, SystemML - AI Networking

Meta builds technologies that help people connect, find communities, and grow businesses, including Facebook, Messenger, Instagram, WhatsApp, and AR/VR technologies.
$85,100 - $251,000
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Software Engineer, SystemML - AI Networking

Meta is seeking a Senior Software Engineer to join their AI Networking Software team within the DC networking organization. This role focuses on developing and maintaining the software stack around NCCL (NVIDIA Collective Communications Library), which is crucial for multi-GPU and multi-node data communication in distributed ML training. The position is at the intersection of AI infrastructure and high-performance computing, working on critical systems that enable Meta's large-scale GPU training and inference fleet.

The team's mission is to enable Meta-wide ML products and innovations through a reliable, observable, and high-performance distributed AI/GPU communication stack. A key focus area is building customized features, software benchmarks, performance tuners, and software stacks around NCCL and PyTorch to enhance distributed ML reliability and performance, particularly for Large-Scale GenAI/LLM training.

This is an excellent opportunity for someone with strong technical skills in distributed systems, machine learning infrastructure, and high-performance computing. The role requires expertise in C/C++ and Python programming, along with specialized knowledge in areas such as distributed ML training, GPU architectures, or machine learning frameworks. The ideal candidate will have experience with NCCL, distributed GPU performance analysis, and deep learning frameworks like PyTorch.

The position offers competitive compensation ranging from $85,100 to $251,000 per year, plus bonus and equity opportunities. Working at Meta provides exposure to cutting-edge AI technologies and the chance to impact machine learning systems at a massive scale. The role is based in Menlo Park, CA, where you'll collaborate with world-class engineers and researchers in the AI infrastructure space.

Last updated a month ago

Responsibilities For Software Engineer, SystemML - AI Networking

  • Tech-leading the collective communication library development on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling
  • Develop and maintain software stack around NCCL for multi-GPU and multi-node data communication
  • Build customized features, SW benchmarks, performance tuners for distributed ML reliability
  • Improve full-stack distributed ML performance for Large-Scale GenAI/LLM training

Requirements For Software Engineer, SystemML - AI Networking

Python
Kubernetes
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Proven C/C++ and Python programming skills
  • Proven track record of leading successful projects
  • Effective leadership and communication skills
  • Specialized experience in machine learning/deep learning domains
  • Experience with distributed ML training, GPU architecture, ML systems, or AI infrastructure

Benefits For Software Engineer, SystemML - AI Networking

Medical Insurance
Equity
  • Bonus
  • Equity
  • Medical Benefits

Interested in this job?

Jobs Related To Meta Software Engineer, SystemML - AI Networking