Taro Logo

High Performance Computing Engineer

A startup building large language tools, founded by Alex Smola and Mu Li, focusing on generative AI models for language, audio, and entertainment.
$150,000 - $250,000
Cloud
Senior Software Engineer
Hybrid
11 - 50 Employees
5+ years of experience
AI
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For High Performance Computing Engineer

Boson AI, an innovative startup in the AI space, is seeking a Senior High Performance Computing Engineer to join their team. Founded by renowned experts Alex Smola and Mu Li, the company is at the forefront of developing large language tools and generative AI models for language, audio, and entertainment.

The role offers an exceptional opportunity to work with cutting-edge technology, including NVIDIA H100 and A100 GPUs, managing over 20PB of storage, Terabit networking, and hundreds of computers. You'll be responsible for operating the GPUs, network, and filesystem in the datacenter deployment in Toronto, requiring strong problem-solving skills and an adaptable learning mindset.

As a Senior HPC Engineer, you'll be at the heart of the infrastructure that powers Boson AI's innovative work. The position involves managing high-end GPU clusters, configuring complex networking systems, and maintaining critical infrastructure components like MAAS, Ceph, Slurm, and Kubernetes. You'll need to be comfortable with both software and hardware aspects, as the role involves hands-on configuration and maintenance of physical systems.

The ideal candidate will bring a strong background in high performance computing, experience with data center operations, and proficiency in programming. You'll be working with state-of-the-art technology in a dynamic startup environment, contributing directly to the infrastructure that enables advanced AI development. The role offers competitive compensation and the opportunity to work with leading experts in the field of AI and machine learning.

If you're passionate about high-performance computing, have a strong technical background, and want to be part of a team pushing the boundaries of AI technology, this role presents an exciting opportunity to make a significant impact in a rapidly growing field.

Last updated 2 months ago

Responsibilities For High Performance Computing Engineer

  • Manage private large high-end GPU clusters
  • Handle full lifecycle of physical systems including deployments, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
  • Configure and maintain MAAS, Ceph, Slurm and Kubernetes
  • Configure and automate on-premises Linux-based systems using infrastructure-as-code practices
  • Configure and maintain Layer 3 networking
  • Learn and deploy new tools

Requirements For High Performance Computing Engineer

Python
Linux
Kubernetes
  • Strong background in high performance computing
  • Experience with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python)
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems

Interested in this job?