High Performance Computing Engineer

Boson AI

A startup building large language tools, founded by Alex Smola and Mu Li, focusing on generative AI models for language, audio, and entertainment.

Santa Clara, CA, USA

$150,000 - $250,000

Cloud

Senior Software Engineer

Hybrid

11 - 50 Employees

5+ years of experience

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For High Performance Computing Engineer

Boson AI, an innovative startup in the AI space, is seeking a Senior High Performance Computing Engineer to join their team. Founded by renowned experts Alex Smola and Mu Li, the company is at the forefront of developing large language tools and generative AI models for language, audio, and entertainment.

The role offers an exceptional opportunity to work with cutting-edge technology, including NVIDIA H100 and A100 GPUs, managing over 20PB of storage, Terabit networking, and hundreds of computers. You'll be responsible for operating the GPUs, network, and filesystem in the datacenter deployment in Toronto, requiring strong problem-solving skills and an adaptable learning mindset.

As a Senior HPC Engineer, you'll be at the heart of the infrastructure that powers Boson AI's innovative work. The position involves managing high-end GPU clusters, configuring complex networking systems, and maintaining critical infrastructure components like MAAS, Ceph, Slurm, and Kubernetes. You'll need to be comfortable with both software and hardware aspects, as the role involves hands-on configuration and maintenance of physical systems.

The ideal candidate will bring a strong background in high performance computing, experience with data center operations, and proficiency in programming. You'll be working with state-of-the-art technology in a dynamic startup environment, contributing directly to the infrastructure that enables advanced AI development. The role offers competitive compensation and the opportunity to work with leading experts in the field of AI and machine learning.

If you're passionate about high-performance computing, have a strong technical background, and want to be part of a team pushing the boundaries of AI technology, this role presents an exciting opportunity to make a significant impact in a rapidly growing field.

Last updated 2 months ago

Responsibilities For High Performance Computing Engineer

Manage private large high-end GPU clusters
Handle full lifecycle of physical systems including deployments, operations, triage and troubleshooting
Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
Configure and maintain MAAS, Ceph, Slurm and Kubernetes
Configure and automate on-premises Linux-based systems using infrastructure-as-code practices
Configure and maintain Layer 3 networking
Learn and deploy new tools

Requirements For High Performance Computing Engineer

Python

Linux

Kubernetes

Strong background in high performance computing
Experience with on-premises Data Center operations and technologies
Experience in managing a large hardware cluster
Proficiency in at least one programming language (e.g. Python)
Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
Familiarity with GPU utilization for machine learning workloads and optimization techniques
Experience with managing firmware / systems updates for systems