Taro Logo

HPC Systems Engineer

DevOps
Senior Software Engineer
In-Person
AI

Description For HPC Systems Engineer

Zyphra is seeking an HPC Systems Engineer to lead their machine learning infrastructure efforts. This role combines traditional systems engineering with cutting-edge AI/ML infrastructure management. The position involves maintaining and developing core infrastructure for machine learning research and production, managing Linux-based cluster environments, and ensuring smooth operation of critical systems. The ideal candidate will have strong Linux administration skills, experience with containerization and job scheduling, and the ability to work across various technical domains. This role offers the opportunity to directly impact developer productivity and ML training performance at a company working on advanced AI technologies. The position requires both technical expertise in systems engineering and strong collaborative abilities to work with various teams.

Last updated 20 minutes ago

Responsibilities For HPC Systems Engineer

  • Maintaining and developing core infrastructure for machine learning research and production
  • Administration and automation of Linux-based cluster environments
  • Managing user onboarding/offboarding, security auditing, and access control
  • Monitoring system resources and job scheduling
  • Supporting and improving developer workflows
  • Enabling and supporting AI/ML workloads

Requirements For HPC Systems Engineer

Linux
Python
Kubernetes
  • Strong experience with Linux system administration, user and access management, and automation
  • Demonstrated expertise in scripting languages for system tooling and automation (bash, python, etc)
  • Familiarity with containerized environments and job scheduling systems like Slurm
  • Experience building tooling for cluster validation and reliability
  • Experience setting up and managing developer tools and third-party services
  • Excellent debugging and troubleshooting skills across compute, storage, and networking
  • Strong communication skills and ability to collaborate across technical and non-technical teams

Interested in this job?

Jobs Related To Zyphra HPC Systems Engineer

Senior Software Engineer, Continuous Deployment

Senior Software Engineer position at Roblox focusing on developing and scaling continuous deployment systems for microservices across cloud and on-premise infrastructure.

Senior Software Engineer - Infrastructure

Senior Infrastructure Engineer role at Astronomer, focusing on Kubernetes, cloud platforms, and DevOps practices to enhance their Enterprise data platform.

Operations Engineer

Senior Operations Engineer position at Superpower, focusing on healthcare technology development and operational efficiency with 5+ years experience required.

Senior DevOps Engineer

Senior DevOps Engineer role at Abound, focusing on AWS infrastructure, CI/CD pipelines, and cloud optimization for a leading fintech company in London.

Senior Systems Engineer

Senior Systems Engineer role at Cerebrium, building scalable infrastructure for AI workloads. 6+ years experience required, offering $140-200K salary with equity.