AI Infrastructure Operations Engineer

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds for machine learning applications.
DevOps
Senior Software Engineer
In-Person
7+ years of experience
AI

Description For AI Infrastructure Operations Engineer

Cerebras Systems, a pioneering company in AI hardware, is seeking an AI Infrastructure Operations Engineer to manage their cutting-edge machine learning compute clusters. The role involves working with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its power. The position requires deep expertise in Linux systems, containerization, and distributed systems management.

The successful candidate will be responsible for ensuring the health, performance, and availability of Cerebras' infrastructure while maximizing compute capacity for AI initiatives. This is a critical role that combines hands-on technical work with strategic infrastructure management, requiring both deep technical knowledge and strong operational skills.

Cerebras Systems has established itself as a leader in AI computing, with partnerships across multiple industries including healthcare, where they recently announced a multi-year partnership with Mayo Clinic. Their technology delivers unprecedented AI computing power, with their chip being 56 times larger than traditional GPUs and their inference solution being 10 times faster than GPU-based cloud services.

The role offers an opportunity to work at the forefront of AI infrastructure, managing some of the most advanced computing systems in the world. The position demands a combination of technical expertise, operational excellence, and the ability to work in a fast-paced environment. The ideal candidate will have extensive experience with Linux systems, containerization, and large-scale infrastructure management, along with strong problem-solving and communication skills.

This is an excellent opportunity for someone passionate about AI infrastructure who wants to work with cutting-edge technology while making a significant impact on the future of AI computing. The role offers the chance to work with a team that's pushing the boundaries of what's possible in AI hardware and infrastructure.

Last updated 14 days ago

Responsibilities For AI Infrastructure Operations Engineer

  • Manage and operate multiple advanced AI compute infrastructure clusters
  • Monitor and oversee cluster health, proactively identifying and resolving potential issues
  • Maximize compute capacity through optimization and efficient resource allocation
  • Deploy, configure, and debug container-based services using Docker
  • Provide 24/7 monitoring and support
  • Handle engineering escalations and collaborate with other teams
  • Contribute to monitoring and support processes improvement
  • Stay up-to-date with AI compute infrastructure technologies

Requirements For AI Infrastructure Operations Engineer

Python
Linux
Kubernetes
  • 6-8 years of experience in managing complex compute infrastructure
  • Strong proficiency in Python scripting
  • Deep understanding of Linux-based compute systems
  • Extensive knowledge of Docker containers and orchestration platforms
  • Experience with monitoring and alerting systems
  • Proven track record to own and drive challenges to completion
  • Excellent communication and collaboration skills
  • Willingness to participate in 24/7 on-call rotation

Interested in this job?

Jobs Related To Cerebras Systems AI Infrastructure Operations Engineer

AI Infrastructure Operations Engineer

Senior AI Infrastructure Operations Engineer position at Cerebras Systems, managing and operating cutting-edge machine learning compute clusters with the world's largest AI chip.

Senior Support Engineer, Smart Device

Senior Support Engineer position at Amazon leading device lab infrastructure operations, combining technical expertise with leadership responsibilities in smart device testing and lab management.

Fullstack Engineer w/ DevOps

Senior Fullstack Engineer position at Kunai, focusing on DevOps and automation platforms for financial services technology solutions. Remote work in the United States.

Senior Systems Engineer

Senior Systems Engineer role at Disney focusing on content delivery infrastructure and systems administration for streaming services.

Senior Python Developer, CI/CD Infrastructure and DevOps Tooling

Senior Python Developer role at NVIDIA focusing on CI/CD infrastructure and DevOps tooling, building and maintaining development systems that power NVIDIA's core software products.