Software Engineer, SystemML - Scaling / Performance

Meta builds technologies that help people connect, find communities, and grow businesses, including Facebook, Messenger, Instagram, WhatsApp, and AR/VR technologies.
$70,670 - $208,000
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Software Engineer, SystemML - Scaling / Performance

Meta is seeking a Senior Software Engineer to join their Network.AI Software team within the DC networking organization. This role is crucial for developing and maintaining the software stack around NVIDIA Collective Communications Library (NCCL), which is essential for multi-GPU and multi-node data communication in distributed ML training.

The position focuses on enabling Meta-wide ML products and innovations to leverage their large-scale GPU training and inference fleet through an observable, reliable, and high-performance distributed AI/GPU communication stack. A key focus area is building customized features, software benchmarks, performance tuners, and software stacks around NCCL and PyTorch to improve full-stack distributed ML reliability and performance, particularly for Large-Scale GenAI/LLM training.

As a member of this team, you'll work at the intersection of high-performance computing and machine learning infrastructure, directly impacting the performance of Meta's distributed GPU-based ML workloads. The role requires deep expertise in distributed systems, GPU architecture, and machine learning frameworks, with a particular emphasis on scaling reliability and performance for GenAI/LLM applications.

The ideal candidate should have strong experience with NCCL, distributed GPU systems, and deep learning frameworks like PyTorch. Knowledge of both data parallel and model parallel training techniques, including Distributed Data Parallel and Fully Sharded Data Parallel (FSDP), is highly valuable. The role offers competitive compensation ranging from $70,670 to $208,000 annually, plus bonus, equity, and comprehensive benefits.

This position represents an opportunity to work on cutting-edge AI infrastructure at one of the world's leading technology companies, directly contributing to the advancement of large-scale machine learning systems. The work will involve collaboration with various teams across Meta to optimize and scale AI training systems, making it an ideal role for someone passionate about both systems engineering and machine learning technology.

Last updated 12 hours ago

Responsibilities For Software Engineer, SystemML - Scaling / Performance

  • Enable reliable and highly scalable distributed ML training on Meta's large-scale GPU training infrastructure
  • Focus on GenAI/LLM scaling
  • Develop and maintain software stack around NCCL for multi-GPU and multi-node data communication
  • Improve full-stack distributed ML reliability and performance

Requirements For Software Engineer, SystemML - Scaling / Performance

Python
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Specialized experience in distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, or Machine Learning frameworks
  • Experience with NCCL and distributed GPU reliability/performance improvement
  • Experience with DL frameworks like PyTorch, Caffe2 or TensorFlow
  • Knowledge of GPU architectures and CUDA programming
  • Knowledge of ML, deep learning and LLM

Benefits For Software Engineer, SystemML - Scaling / Performance

Medical Insurance
Equity
401k
  • Bonus
  • Equity
  • Health benefits

Interested in this job?

Jobs Related To Meta Software Engineer, SystemML - Scaling / Performance

Research Engineer, Language - Monetization AI

Research Engineer position at Meta focusing on Language AI and Monetization, combining advanced ML research with practical applications in advertising technology.

Software Engineer, Systems ML - GenAI Evals Platform

Senior Software Engineer role at Meta focusing on building and optimizing GenAI evaluation platforms and infrastructure for large language models.

Software Engineer, Systems ML - Frameworks / Compilers / Kernels

Senior Software Engineering role at Meta focusing on AI compiler development and optimization for machine learning frameworks and hardware acceleration.

Software Engineer, Systems ML - PyTorch Compiler, PyTorch Framework, PyTorch Performance

Senior Software Engineering role at Meta focusing on PyTorch compiler development and ML systems optimization, offering competitive compensation and the opportunity to work on cutting-edge AI infrastructure.

Software Engineer, Computer Vision - XR World AI

Senior Software Engineer role at Meta focusing on computer vision and AI for XR/AR applications, developing 3D mapping solutions.