Distributed Software Engineer

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs, delivering industry-leading training and inference speeds for machine learning applications.
Staff Software Engineer
In-Person
6+ years of experience
AI

Description For Distributed Software Engineer

Cerebras Systems, a pioneer in AI hardware, is seeking a Distributed Software Engineer to join their innovative team. The company has built the world's largest AI chip, 56 times larger than GPUs, revolutionizing AI compute power. Their novel wafer-scale architecture provides the computing capability of dozens of GPUs on a single chip, simplifying machine learning applications management.

The role focuses on building and maintaining the software infrastructure for Cerebras' multi-exaflop supercomputers, which are deployed in major datacenters. As part of the Cluster engineering team, you'll be responsible for developing critical software components that manage their Wafer-Scale Cluster technology.

The position offers an opportunity to work on cutting-edge technology at the intersection of distributed systems and AI. Cerebras has established partnerships with global corporations, national labs, and healthcare systems, including a multi-year, multi-million-dollar collaboration with Mayo Clinic. The company recently launched Cerebras Inference, the fastest Generative AI inference solution globally.

The ideal candidate will bring strong expertise in distributed systems, cluster management, and software architecture. You'll be working with modern technologies including Kubernetes, GoLang, and Python, while building systems that operate at an unprecedented scale. This is a chance to contribute to groundbreaking advancements in AI computing infrastructure while working in a non-corporate culture that values individual beliefs and innovation.

Join Cerebras to be part of a team that's pushing the boundaries of what's possible in AI computing, with the stability of an established company combined with the energy and innovation of a startup.

Last updated 14 days ago

Responsibilities For Distributed Software Engineer

  • Automate bare-metal configuration of networking, OS, and application software in large clusters
  • Implement push button workflows for cluster upgrades, downgrades, and security patching
  • Develop orchestration and scheduler system for resource allocation and job submission
  • Support both on-premise and cloud mode deployment and operations
  • Build robust system for monitoring, detecting and handling failures
  • Develop cluster and job monitoring and visualization capabilities
  • Create user facing tools to monitor job status and collect metrics
  • Develop administrator facing tools to manage and operate large clusters

Requirements For Distributed Software Engineer

Go
Python
Kubernetes
  • Strong track record of software architecture, system design and development for over 6 years or more
  • Strong track record of development in distributed cluster environment
  • Strong understanding of Kubernetes (K8s) software ecosystem, Prometheus and Grafana
  • Strong development skills in GoLang, Python, bash
  • Strong debugging skills with distributed systems
  • Strong skill to develop tests for the new features and regress old features

Interested in this job?

Jobs Related To Cerebras Systems Distributed Software Engineer

Software Developer 4

Staff Software Engineer role at Oracle building large-scale distributed systems and cloud infrastructure with 7+ years experience required.

Senior Software Architect - Deep Learning and HPC Communications

Senior Software Architect position at NVIDIA focusing on Deep Learning and HPC Communications, building scalable solutions for GPU clusters with competitive compensation.

Staff Software Engineer

Staff Software Engineer position at Datadog focusing on building large-scale distributed systems and leading technical initiatives across the organization.

Staff Ground Systems Lead Engineer - TS/SCI (Space/Satellite)

Staff Ground Systems Lead Engineer position at Northrop Grumman focusing on space and satellite systems, requiring TS/SCI clearance.

Senior Staff Software Engineer - Simulation Metrics Platform

Senior Staff Software Engineer position at Zoox leading the development of simulation metrics platform for autonomous vehicle systems.