AI Infrastructure Operations Engineer

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds.
DevOps
Senior Software Engineer
In-Person
501 - 1,000 Employees
7+ years of experience
AI

Description For AI Infrastructure Operations Engineer

Cerebras Systems, a pioneering company in AI hardware, is seeking an AI Infrastructure Operations Engineer to join their team. The company is known for creating the world's largest AI chip, which is 56 times larger than conventional GPUs, providing unprecedented AI compute power with the simplicity of a single device management.

The role offers a unique opportunity to work with cutting-edge technology, specifically the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power. As an AI Infrastructure Operations Engineer, you'll be responsible for managing and operating advanced AI compute infrastructure clusters, ensuring optimal performance and availability.

The position requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. With 6-8 years of relevant experience required, the ideal candidate should be proficient in Python scripting, have extensive knowledge of Docker and container orchestration platforms, and be comfortable with 24/7 on-call rotations.

Working at Cerebras means being at the forefront of AI technology advancement. The company has established partnerships with global corporations, national labs, and healthcare systems, including a multi-year, multi-million-dollar partnership with Mayo Clinic. In 2023, they launched Cerebras Inference, the fastest Generative AI inference solution globally.

The company offers a unique work environment that combines the stability of an established company with the vitality of a startup. They promote a simple, non-corporate work culture that respects individual beliefs and encourages continuous learning and growth. Team members have the opportunity to work on one of the fastest AI supercomputers in the world and contribute to cutting-edge AI research.

This role is available in multiple locations including Sunnyvale, CA, Toronto, Canada, and Bangalore, India, offering flexibility in terms of work location. The position is ideal for someone who is passionate about AI infrastructure, enjoys solving complex technical challenges, and wants to be part of a team that's pushing the boundaries of what's possible in AI computing.

Last updated 15 days ago

Responsibilities For AI Infrastructure Operations Engineer

  • Manage and operate multiple advanced AI compute infrastructure clusters
  • Monitor and oversee cluster health, proactively identifying and resolving potential issues
  • Maximize compute capacity through optimization and efficient resource allocation
  • Deploy, configure, and debug container-based services using Docker
  • Provide 24/7 monitoring and support
  • Handle engineering escalations and collaborate with other teams
  • Contribute to monitoring and support processes improvement
  • Stay up-to-date with AI compute infrastructure technologies

Requirements For AI Infrastructure Operations Engineer

Python
Linux
Kubernetes
  • 6-8 years of experience in managing complex compute infrastructure
  • Strong proficiency in Python scripting
  • Deep understanding of Linux-based compute systems
  • Extensive knowledge of Docker containers and orchestration platforms
  • Proven ability to troubleshoot complex technical issues
  • Experience with monitoring and alerting systems
  • Proven track record to own and drive challenges to completion
  • Excellent communication and collaboration skills
  • Ability to work in a fast-paced environment
  • Willingness to participate in 24/7 on-call rotation

Interested in this job?

Jobs Related To Cerebras Systems AI Infrastructure Operations Engineer

AI Infrastructure Operations Engineer

Senior AI Infrastructure Operations Engineer role at Cerebras Systems, managing advanced ML compute clusters and working with world's largest AI chip.

Cloud Services Engineer

Senior DevOps Engineer role at Oracle focusing on autonomous database services, cloud infrastructure, and platform operations in Zapopan, Mexico.

Software Engineer - Infrastructure & Security

Senior Infrastructure & Security Engineer role at Julius, building and scaling cloud infrastructure for AI coding agents.

Software Engineer, Infrastructure

Senior Infrastructure Engineer role at Greenlite, building secure AI systems for financial compliance, $130k-$200k + equity, San Francisco based.

Senior Support Engineer, Smart Device

Senior Support Engineer position at Amazon leading device lab infrastructure operations, combining technical expertise with leadership responsibilities in smart device testing and lab management.