AI Infrastructure Operations Engineer

Cerebras Systems

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds for machine learning applications.

Sunnyvale, CA, USA • Toronto, ON, Canada • Bengaluru, Karnataka, India

DevOps

Senior Software Engineer

In-Person

7+ years of experience

This job posting is no longer active. Check out these related jobs instead:

Job Description

Cerebras Systems, a pioneering company in AI hardware, is seeking an AI Infrastructure Operations Engineer to manage their cutting-edge machine learning compute clusters. The role involves working with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its power. The position requires deep expertise in Linux systems, containerization, and distributed systems management.

The successful candidate will be responsible for ensuring the health, performance, and availability of Cerebras' infrastructure while maximizing compute capacity for AI initiatives. This is a critical role that combines hands-on technical work with strategic infrastructure management, requiring both deep technical knowledge and strong operational skills.

Cerebras Systems has established itself as a leader in AI computing, with partnerships across multiple industries including healthcare, where they recently announced a multi-year partnership with Mayo Clinic. Their technology delivers unprecedented AI computing power, with their chip being 56 times larger than traditional GPUs and their inference solution being 10 times faster than GPU-based cloud services.

The role offers an opportunity to work at the forefront of AI infrastructure, managing some of the most advanced computing systems in the world. The position demands a combination of technical expertise, operational excellence, and the ability to work in a fast-paced environment. The ideal candidate will have extensive experience with Linux systems, containerization, and large-scale infrastructure management, along with strong problem-solving and communication skills.

This is an excellent opportunity for someone passionate about AI infrastructure who wants to work with cutting-edge technology while making a significant impact on the future of AI computing. The role offers the chance to work with a team that's pushing the boundaries of what's possible in AI hardware and infrastructure.

Last updated 4 months ago

Responsibilities For AI Infrastructure Operations Engineer

Manage and operate multiple advanced AI compute infrastructure clusters
Monitor and oversee cluster health, proactively identifying and resolving potential issues
Maximize compute capacity through optimization and efficient resource allocation
Deploy, configure, and debug container-based services using Docker
Provide 24/7 monitoring and support
Handle engineering escalations and collaborate with other teams
Contribute to monitoring and support processes improvement
Stay up-to-date with AI compute infrastructure technologies

Requirements For AI Infrastructure Operations Engineer

Python

Linux

Kubernetes

6-8 years of experience in managing complex compute infrastructure
Strong proficiency in Python scripting
Deep understanding of Linux-based compute systems
Extensive knowledge of Docker containers and orchestration platforms
Experience with monitoring and alerting systems
Proven track record to own and drive challenges to completion
Excellent communication and collaboration skills
Willingness to participate in 24/7 on-call rotation