Senior HPC AI Cluster Engineer

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior HPC AI Cluster Engineer

NVIDIA is seeking an experienced Senior HPC AI Cluster Engineer to join their E2E software verification HPC/AI Infrastructure team. This role represents an exciting opportunity to work at the forefront of accelerated computing and artificial intelligence, building and maintaining supercomputers and HPC clusters based on cutting-edge technologies.

The position combines deep technical expertise in HPC systems with hands-on engineering work, requiring skills across system architecture, infrastructure automation, and performance optimization. You'll be working with the latest accelerated computing and deep learning platforms, collaborating with scientific researchers and developers to improve workflows and develop innovative solutions.

As a Senior HPC AI Cluster Engineer, you'll be responsible for designing and implementing large-scale HPC/AI clusters, managing workload orchestration, developing automation tools, and ensuring optimal system performance. The role requires expertise in Linux systems, networking protocols, storage solutions, and modern DevOps practices.

NVIDIA, as the world leader in accelerated computing, offers an environment where you'll be working with cutting-edge technology and contributing to breakthroughs in AI and GPU computing. The company's focus on innovation and technical excellence makes this an ideal position for someone passionate about high-performance computing and artificial intelligence.

The role offers the opportunity to work with multiple teams across the organization, providing technical leadership and developing standardized methodologies. You'll be involved in research and development activities, participating in proof-of-concepts for future improvements, and helping shape the future of HPC/AI infrastructure.

This position is perfect for a seasoned engineer who combines strong technical skills with a strategic mindset, capable of both hands-on implementation and high-level system architecture. The role offers significant growth potential and the chance to work on some of the most advanced computing systems in the industry.

Last updated 6 days ago

Responsibilities For Senior HPC AI Cluster Engineer

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
  • Manage Linux job/workload schedules and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of large-scale infrastructure
  • Deploy monitoring solutions for servers, network and storage
  • Perform troubleshooting from bare metal to application level
  • Develop and document standard methodologies
  • Support R&D activities and engage in POCs/POVs

Requirements For Senior HPC AI Cluster Engineer

Python
Linux
Kubernetes
  • Degree in Computer Science, Engineering, or related field
  • 5+ years of experience
  • Knowledge of HPC and AI solution technologies
  • Experience with job scheduling workloads and orchestration tools (Slurm, K8s)
  • Excellent knowledge of Windows and Linux networking and internals
  • Experience with multiple storage solutions (Lustre, GPFS, zfs, xfs)
  • Python programming and bash scripting experience
  • Experience with automation tools (Jenkins, Ansible, Puppet/chef)
  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet
  • Deep understanding of virtual systems

Interested in this job?

Jobs Related To NVIDIA Senior HPC AI Cluster Engineer

Senior DevOps Engineer

Senior DevOps Engineer position at NVIDIA focusing on infrastructure, CI/CD, and build & test environments for DPU and Network Adapters.

Senior Software Engineer, Code Coverage Tools

Senior Software Engineer position at NVIDIA focusing on developing code coverage tools for chip design and verification, offering competitive compensation and opportunity to work with cutting-edge technology.

Senior Software QA Test Development Engineer

Senior Software QA Test Development Engineer role at NVIDIA focusing on platform testing, automation, and AI tools development.

Senior DevOps Engineer - Accelerated Computing

Senior DevOps Engineer position at NVIDIA working on CUDA Math Libraries team, focusing on build systems and infrastructure for AI and HPC applications.

Senior Software Engineer – AI Infrastructure and Tooling

Senior Software Engineer role at NVIDIA focusing on AI infrastructure automation and tooling, offering $184k-$356.5k salary with hybrid work options.