AI/HPC Systems Performance Engineer

Meta builds technologies that help people connect, find communities, and grow businesses, including Facebook, Messenger, Instagram, WhatsApp, and AR/VR experiences.
$147,000 - $208,000
Distributed Systems
Senior Software Engineer
In-Person
5,000+ Employees
4+ years of experience
AI · Enterprise SaaS

Description For AI/HPC Systems Performance Engineer

Meta is seeking an AI/HPC Systems Performance Engineer to join their AI Training and Inference Infrastructure team. This role is critical in addressing the exponential growth and scaling challenges of Meta's AI infrastructure. The position focuses on building and evolving network infrastructure that connects numerous training accelerators like GPUs, ensuring smooth operation and meeting stringent performance requirements for RDMA workloads.

The ideal candidate will work on performance optimization across multiple layers: network fabric, host networking, communications libraries, and scheduling infrastructure. This role requires deep expertise in distributed systems, RDMA protocols, and AI infrastructure performance optimization. The position offers an opportunity to work with cutting-edge AI technologies and solve complex scaling challenges at one of the world's leading tech companies.

The role combines elements of systems engineering, performance optimization, and AI infrastructure development. You'll be part of a multi-disciplinary team developing solutions for large-scale training systems, with responsibilities including performance benchmarking, monitoring, and troubleshooting production issues. The position requires strong technical skills in networking protocols, distributed systems, and performance optimization.

Meta offers a competitive compensation package including base salary ranging from $147,000 to $208,000 annually, plus bonus and equity opportunities. The company provides comprehensive benefits and the chance to work on technologies that impact billions of users globally. This role is perfect for someone passionate about high-performance computing, AI infrastructure, and solving complex technical challenges at scale.

Last updated 7 hours ago

Responsibilities For AI/HPC Systems Performance Engineer

  • Active member of a multi-disciplinary team to develop solutions for large scale training systems
  • Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
  • Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric
  • Develop and deploy innovative solutions to address the performance issues

Requirements For AI/HPC Systems Performance Engineer

Python
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • BS/MS/PhD in relevant fields (EE, CS), with 4+ years work experience
  • Experience with using communication libraries, such as MPI, NCCL, and UCX
  • Experience with developing, evaluating and debugging host networking protocols such as RDMA
  • Experience with triaging performance issues in complex scale-out distributed applications

Benefits For AI/HPC Systems Performance Engineer

Medical Insurance
Dental Insurance
Vision Insurance
401k
  • Bonus
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To Meta AI/HPC Systems Performance Engineer

Systems Software Engineer - FBOSS

Senior Systems Software Engineer position at Meta working on FBOSS team, developing and maintaining network infrastructure for data centers.

Senior Software Engineer

Senior Software Engineer role at Microsoft working on Azure Core platform services, focusing on distributed systems and cloud infrastructure.

Senior Software Engineer

Senior Software Engineer role at Microsoft Azure Specialized, focusing on AI infrastructure, distributed systems, and next-gen hardware integration with competitive compensation and comprehensive benefits.

Senior Software Dev Engineer, AWS EC2 Elastic Block Store (EBS)

Senior Software Engineer role at AWS EBS team, developing high-performance storage solutions for cloud computing, requiring 5+ years of experience in software development and system architecture.

AWS FSx Lustre - SDE III, AWS FSx Lustre

Senior Software Engineer role at Amazon FSx for Lustre, building petabyte-scale distributed file systems and high-performance cloud storage solutions at AWS.