Production Systems Engineer, AI Systems

Meta builds technologies that help people connect, find communities, and grow businesses through social platforms and immersive experiences.
$132,000 - $191,000
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
4+ years of experience
AI · Enterprise SaaS

Description For Production Systems Engineer, AI Systems

Meta is seeking a Systems Engineer for their Release to Production (RTP) team focusing on AI/ML initiatives. This role is crucial for managing the end-to-end Hardware Lifecycle of Meta's servers, including prototyping, debugging, and system monitoring. The position involves working with cutting-edge AI infrastructure at datacenter scale, collaborating with various teams to enable new systems deployment in production data centers.

The ideal candidate will support scale up and scale out network technologies for Meta's AI systems, requiring deep knowledge of network technologies (NICs, Switches, Optics, DACs, Protocols-TCP/IP, RDMA) and hands-on experience throughout hardware/software lifecycle phases. This role combines hardware expertise with software engineering skills, making it perfect for someone passionate about large-scale AI infrastructure.

Working at Meta offers the opportunity to impact billions of users through their technology platforms. The company is actively moving beyond traditional social media into immersive experiences like AR and VR. The position offers competitive compensation ($132,000-$191,000/year) plus bonus, equity, and comprehensive benefits.

The role requires both technical depth in systems engineering and the ability to work cross-functionally with various teams. You'll be at the forefront of AI infrastructure development, working with state-of-the-art technology while helping Meta push the boundaries of what's possible in the AI space.

Last updated 4 days ago

Responsibilities For Production Systems Engineer, AI Systems

  • Support new AI platform introduction into Meta fleet by driving scale up and scale out interface integration
  • Create experiments and tooling to detect and diagnose hardware/firmware/software health issues
  • Develop understanding of AI workload traffic and incorporate as part of NPI
  • Contribute to enabling hacks for future technology explorations in AI space
  • Troubleshoot, diagnose and root cause system failures
  • Develop visibility through data visualization
  • Implement systemic solutions to hardware health issues
  • Drive continuous product quality improvement

Requirements For Production Systems Engineer, AI Systems

Linux
Python
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 4+ years of work experience in network ASIC/Platform development, network product deployment, or Interconnect Technologies
  • Knowledge of server architecture and components
  • Experience working with Linux
  • Knowledge of TCP/IP and experience using iperf
  • Hands on troubleshooting and debug experience

Benefits For Production Systems Engineer, AI Systems

Medical Insurance
Dental Insurance
Vision Insurance
  • bonus
  • equity
  • benefits package

Interested in this job?

Jobs Related To Meta Production Systems Engineer, AI Systems

Production Engineering

Senior Production Engineering role at Meta focusing on infrastructure, systems reliability, and scalability for Meta's core services and platforms.

Production Systems Engineer, Tooling

Senior Production Systems Engineer role at Meta focusing on hardware validation tooling and infrastructure for AI systems, offering competitive compensation and benefits.

Enterprise System Engineer

Senior Enterprise System Engineer role at Meta focusing on building and scaling Linux infrastructure and automation for AI Research teams.

Network Engineer, Engineering R&D

Network Engineer position at Meta focusing on infrastructure R&D, combining traditional networking with modern automation practices.

Network Operations Engineer

Senior Network Operations Engineer role at Meta, focusing on managing and automating large-scale network infrastructure with competitive compensation and benefits.