AI Infrastructure Operations Engineer

Cerebras Systems

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds.

Sunnyvale, CA, USA • Toronto, ON, Canada • Bengaluru, Karnataka, India

DevOps

Senior Software Engineer

In-Person

501 - 1,000 Employees

7+ years of experience

Description For AI Infrastructure Operations Engineer

Cerebras Systems, a pioneering company in AI hardware, is seeking an AI Infrastructure Operations Engineer to join their team. The company is known for creating the world's largest AI chip, which is 56 times larger than conventional GPUs, providing unprecedented AI compute power with the simplicity of a single device management.

The role offers a unique opportunity to work with cutting-edge technology, specifically the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power. As an AI Infrastructure Operations Engineer, you'll be responsible for managing and operating advanced AI compute infrastructure clusters, ensuring optimal performance and availability.

The position requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. With 6-8 years of relevant experience required, the ideal candidate should be proficient in Python scripting, have extensive knowledge of Docker and container orchestration platforms, and be comfortable with 24/7 on-call rotations.

Working at Cerebras means being at the forefront of AI technology advancement. The company has established partnerships with global corporations, national labs, and healthcare systems, including a multi-year, multi-million-dollar partnership with Mayo Clinic. In 2023, they launched Cerebras Inference, the fastest Generative AI inference solution globally.

The company offers a unique work environment that combines the stability of an established company with the vitality of a startup. They promote a simple, non-corporate work culture that respects individual beliefs and encourages continuous learning and growth. Team members have the opportunity to work on one of the fastest AI supercomputers in the world and contribute to cutting-edge AI research.

This role is available in multiple locations including Sunnyvale, CA, Toronto, Canada, and Bangalore, India, offering flexibility in terms of work location. The position is ideal for someone who is passionate about AI infrastructure, enjoys solving complex technical challenges, and wants to be part of a team that's pushing the boundaries of what's possible in AI computing.

Last updated 15 days ago

Responsibilities For AI Infrastructure Operations Engineer

Manage and operate multiple advanced AI compute infrastructure clusters
Monitor and oversee cluster health, proactively identifying and resolving potential issues
Maximize compute capacity through optimization and efficient resource allocation
Deploy, configure, and debug container-based services using Docker
Provide 24/7 monitoring and support
Handle engineering escalations and collaborate with other teams
Contribute to monitoring and support processes improvement
Stay up-to-date with AI compute infrastructure technologies

Requirements For AI Infrastructure Operations Engineer

Python

Linux

Kubernetes

6-8 years of experience in managing complex compute infrastructure
Strong proficiency in Python scripting
Deep understanding of Linux-based compute systems
Extensive knowledge of Docker containers and orchestration platforms
Proven ability to troubleshoot complex technical issues
Experience with monitoring and alerting systems
Proven track record to own and drive challenges to completion
Excellent communication and collaboration skills
Ability to work in a fast-paced environment
Willingness to participate in 24/7 on-call rotation

Cerebras Systems

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds.

Sunnyvale, CA, USA • Toronto, ON, Canada • Bengaluru, Karnataka, India

DevOps

Senior Software Engineer

In-Person

501 - 1,000 Employees

7+ years of experience

Interested in this job?

Jobs Related To Cerebras Systems AI Infrastructure Operations Engineer

AI Infrastructure Operations Engineer

Cerebras Systems

Senior AI Infrastructure Operations Engineer role at Cerebras Systems, managing advanced ML compute clusters and working with world's largest AI chip.

Cloud Services Engineer

Oracle

Senior DevOps Engineer role at Oracle focusing on autonomous database services, cloud infrastructure, and platform operations in Zapopan, Mexico.

Software Engineer - Infrastructure & Security

Julius

Senior Infrastructure & Security Engineer role at Julius, building and scaling cloud infrastructure for AI coding agents.

Software Engineer, Infrastructure

Greenlite

Senior Infrastructure Engineer role at Greenlite, building secure AI systems for financial compliance, $130k-$200k + equity, San Francisco based.

Senior Support Engineer, Smart Device

Amazon

Senior Support Engineer position at Amazon leading device lab infrastructure operations, combining technical expertise with leadership responsibilities in smart device testing and lab management.