Meta is seeking an AI/HPC Systems Performance Engineer to join their AI Training and Inference Infrastructure team. This role is critical in addressing the exponential growth and scaling challenges of Meta's AI infrastructure. The position focuses on building and evolving network infrastructure that connects numerous training accelerators like GPUs, ensuring smooth operation and meeting stringent performance requirements for RDMA workloads.
The ideal candidate will work on performance optimization across multiple layers: network fabric, host networking, communications libraries, and scheduling infrastructure. This role requires deep expertise in distributed systems, RDMA protocols, and AI infrastructure performance optimization. The position offers an opportunity to work with cutting-edge AI technologies and solve complex scaling challenges at one of the world's leading tech companies.
The role combines elements of systems engineering, performance optimization, and AI infrastructure development. You'll be part of a multi-disciplinary team developing solutions for large-scale training systems, with responsibilities including performance benchmarking, monitoring, and troubleshooting production issues. The position requires strong technical skills in networking protocols, distributed systems, and performance optimization.
Meta offers a competitive compensation package including base salary ranging from $147,000 to $208,000 annually, plus bonus and equity opportunities. The company provides comprehensive benefits and the chance to work on technologies that impact billions of users globally. This role is perfect for someone passionate about high-performance computing, AI infrastructure, and solving complex technical challenges at scale.