Taro Logo

Senior HPC AI Cluster Engineer

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Cloud
Senior Software Engineer
Remote
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA is seeking an experienced Senior HPC AI Cluster Engineer to join their E2E software verification HPC/AI Infrastructure team. This role combines cutting-edge technology with the opportunity to work on supercomputers and HPC clusters using groundbreaking technologies. The position offers a unique chance to contribute to the latest developments in artificial intelligence and GPU computing.

As a Senior HPC AI Cluster Engineer, you'll be responsible for designing and implementing large-scale HPC/AI clusters, managing complex workload schedules, and developing automation tools for infrastructure management. You'll work with state-of-the-art accelerated computing and Deep Learning platforms, collaborating with scientific researchers, developers, and customers to improve workflows and create innovative solutions.

The ideal candidate brings 5+ years of experience and deep expertise in HPC environments, including knowledge of both hardware and software aspects of high-performance computing. You'll need strong skills in Python programming, Linux systems, and modern orchestration tools like Kubernetes and Slurm. Experience with storage solutions, networking protocols, and cloud platforms is essential.

NVIDIA offers a compelling opportunity to work at the forefront of AI and accelerated computing technology. The company provides competitive compensation and benefits, promoting a diverse and inclusive work environment. This remote position allows you to work from various European locations while contributing to projects that are shaping the future of computing technology.

Join NVIDIA to be part of a team that's driving innovation in AI, GPU computing, and high-performance computing, working with the latest technologies and solving complex technical challenges that impact multiple industries.

Last updated 7 days ago

Responsibilities For Senior HPC AI Cluster Engineer

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
  • Manage Linux job/workload schedules and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of infrastructure
  • Deploy monitoring solutions for servers, network and storage
  • Troubleshoot and fix issues from bare metal to application level
  • Be a technical resource and document standard methodologies
  • Support R&D activities and engage in POCs/POVs

Requirements For Senior HPC AI Cluster Engineer

Python
Linux
Kubernetes
  • Bachelor's Degree in Computer Science, Engineering, or related field
  • 5+ years of experience
  • Knowledge of HPC and AI solution technologies
  • Experience with job scheduling and orchestration tools (Slurm, K8s)
  • Excellent knowledge of Windows and Linux networking
  • Experience with storage solutions (Lustre, GPFS, zfs, xfs)
  • Python programming and bash scripting experience
  • Experience with automation tools (Jenkins, Ansible, Puppet/chef)
  • Deep knowledge of Networking Protocols (InfiniBand, Ethernet)
  • Deep understanding of virtual systems
  • Familiarity with cloud computing platforms

Related Jobs