Meta is seeking an experienced Production Systems Engineer to join their Release to Production (RTP) team, focusing on AI/ML initiatives and large-scale AI Training and Inference systems. This role sits at the intersection of hardware and software, working with Meta's server infrastructure that powers their innovative AI services.
The position involves managing the end-to-end Hardware Lifecycle of Meta's servers, including prototyping experimental hardware, conducting pre-production debugging, implementing system monitoring, and developing automated provisioning solutions. The role is crucial in supporting Meta's ambitious AI infrastructure scaling efforts.
As a Production Systems Engineer, you'll work closely with cross-functional teams including hardware designers, networking teams, system manufacturers, and data center operations to enable and optimize new systems for production deployment. The role requires deep expertise in network technologies, including NICs, Switches, Optics, and various protocols.
The ideal candidate should have strong technical skills in server architecture, Linux systems, and networking protocols, with particular emphasis on AI platform integration and scale-out network technologies. You'll be responsible for creating experimental frameworks, developing diagnostic tools, and implementing solutions for hardware health monitoring.
This is an excellent opportunity for someone passionate about large-scale infrastructure and AI systems, offering competitive compensation ($132,000-$191,000/year) plus bonus, equity, and comprehensive benefits. The role is based in Menlo Park, CA, and offers the chance to work on cutting-edge AI infrastructure at one of the world's leading technology companies.
Meta provides a collaborative environment where you'll work with industry experts and have the opportunity to influence the future of AI infrastructure. The company offers excellent career growth potential and the chance to work on technologies that impact billions of users worldwide.