Meta is seeking an experienced Lab Engineer to join the Release to Production (RTP) team, in the Infrastructure Foundation organization at Meta Platforms. The key focus for this role will be supporting large scale experiments and new hardware product development within Meta's data center environment. The domain of these products are generally compute, storage, and AI platforms with the engineering focus resting at the server systems level. We're seeking an engineer with a breadth of knowledge supporting and troubleshooting complex problems at scale, including Linux environments. This role will be critical to the development of next generation hardware platforms here, deployed to connect Meta's growing global community. Strong cross-functional engagement and independent problem-solving skills are core competencies in demand for this team. Additionally, we need someone with a deep technical understanding and ability to drive projects full cycle in one of the key following areas: Networking Systems/Hardware, Compute/Storage Hardware, Tooling and Automation, Systems Administration, New Product Validation/Integration (NPI), or similar.
Responsibilities:
- Work with hardware design and validation teams, vendors, and others to test and deploy new server and storage products across our data center infrastructure.
- Identify, characterize, and root cause hardware failures and error conditions in pre-production hardware environments.
- Manage hardware systems projects (experiments, NPI, and product testbeds) that require customization and integration of engineering sample hardware within Meta's production environment.
- Participate in program design, test, phase exit, and retrospective efforts.
- Test and troubleshoot new hardware products and components with minimal documentation and direction.
- Manage the full lifecycle for lab hardware assets from initial bring-up through end of life.
- Collaborate with hardware teams by running small scale experiments, collecting data, and providing feedback on failure symptoms for hardware platforms under test.
- Assist in the design and implementation of large-scale experiments requiring custom integration of single or multiple racks of hardware, potentially including modification of the data center environment.
- Communicate and coordinate with other data center technical operations teams.
- Provide serviceability feedback on new hardware platforms.
- Serve as an escalation point regarding all Lab activities and NPI hardware to local datacenter staff.
- Maintain a hardware test lab operation within Meta's production data center environment.
Minimum Qualifications:
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience.
- 5+ years of experience with Linux and hardware systems support in an internet operations environment.
- Experience working in a Linux CLI, leveraging Linux sysadmin tools to investigate and solve problems.
- Experience supervising, training, mentoring, and/or leading lab technicians.
- Knowledge of out-of-band/lights-out server communication methods, such as IPMI and serial console.
- Cross-functional communication with experience documenting and presenting to teams at all levels.
- Experience managing technical projects with ambiguity.
Preferred Qualifications:
- Experience working with hardware engineering teams in developing compute, storage, or AI products.
- Exposure and experience working with software development teams during full product lifecycles including hardware integration.
- Experience with one or more of the following: Bash, PHP, Python, or Perl.