Distributed Software Engineer

Cerebras Systems

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds for machine learning applications.

Sunnyvale, CA, USA • Toronto, ON, Canada • Bengaluru, Karnataka, India

Staff Software Engineer

In-Person

6+ years of experience

Description For Distributed Software Engineer

Cerebras Systems, a pioneer in AI hardware, is seeking a Distributed Software Engineer to join their innovative team. The company has built the world's largest AI chip, 56 times larger than GPUs, revolutionizing AI compute power. Their novel wafer-scale architecture provides the computing capability of dozens of GPUs on a single chip, simplifying machine learning applications management.

The role focuses on building and maintaining the software infrastructure for Cerebras' multi-exaflop supercomputers, which are deployed in major datacenters. As part of the Cluster engineering team, you'll be responsible for developing critical software components that manage their Wafer-Scale Cluster technology.

The position offers an opportunity to work on cutting-edge technology at the intersection of distributed systems and AI. Cerebras has established partnerships with global corporations, national labs, and healthcare systems, including a multi-year, multi-million-dollar collaboration with Mayo Clinic. The company recently launched Cerebras Inference, the fastest Generative AI inference solution globally.

The ideal candidate will bring strong expertise in distributed systems, cluster management, and software architecture. You'll be working with modern technologies including Kubernetes, GoLang, and Python, while building systems that operate at an unprecedented scale. This is a chance to contribute to groundbreaking advancements in AI computing infrastructure while working in a non-corporate culture that values individual beliefs and innovation.

Join Cerebras to be part of a team that's pushing the boundaries of what's possible in AI computing, with the stability of an established company combined with the energy and innovation of a startup.

Last updated 14 days ago

Responsibilities For Distributed Software Engineer

Automate bare-metal configuration of networking, OS, and application software in large clusters
Implement push button workflows for cluster upgrades, downgrades, and security patching
Develop orchestration and scheduler system for resource allocation and job submission
Support both on-premise and cloud mode deployment and operations
Build robust system for monitoring, detecting and handling failures
Develop cluster and job monitoring and visualization capabilities
Create user facing tools to monitor job status and collect metrics
Develop administrator facing tools to manage and operate large clusters

Requirements For Distributed Software Engineer

Python

Kubernetes

Strong track record of software architecture, system design and development for over 6 years or more
Strong track record of development in distributed cluster environment
Strong understanding of Kubernetes (K8s) software ecosystem, Prometheus and Grafana
Strong development skills in GoLang, Python, bash
Strong debugging skills with distributed systems
Strong skill to develop tests for the new features and regress old features

Cerebras Systems

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds for machine learning applications.

Sunnyvale, CA, USA • Toronto, ON, Canada • Bengaluru, Karnataka, India

Staff Software Engineer

In-Person

6+ years of experience

Interested in this job?

Jobs Related To Cerebras Systems Distributed Software Engineer

Software Developer 4

Oracle

Staff Software Engineer role at Oracle building large-scale distributed systems and cloud infrastructure with 7+ years experience required.

Senior Software Architect - Deep Learning and HPC Communications

NVIDIA

Senior Software Architect position at NVIDIA focusing on Deep Learning and HPC Communications, building scalable solutions for GPU clusters with competitive compensation.

Staff Software Engineer

Datadog

Staff Software Engineer position at Datadog focusing on building large-scale distributed systems and leading technical initiatives across the organization.

Staff Ground Systems Lead Engineer - TS/SCI (Space/Satellite)

Northrop Grumman

Staff Ground Systems Lead Engineer position at Northrop Grumman focusing on space and satellite systems, requiring TS/SCI clearance.

Senior Staff Software Engineer - Simulation Metrics Platform

Zoox

Senior Staff Software Engineer position at Zoox leading the development of simulation metrics platform for autonomous vehicle systems.