Senior HPC AI Cluster Engineer

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

London, UK • Warsaw, Poland • Paris, France…

Cloud

Senior Software Engineer

Remote

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA is seeking an experienced Senior HPC AI Cluster Engineer to join their E2E software verification HPC/AI Infrastructure team. This role combines cutting-edge technology with the opportunity to work on supercomputers and HPC clusters using groundbreaking technologies. The position offers a unique chance to contribute to the latest developments in artificial intelligence and GPU computing.

As a Senior HPC AI Cluster Engineer, you'll be responsible for designing and implementing large-scale HPC/AI clusters, managing complex workload schedules, and developing automation tools for infrastructure management. You'll work with state-of-the-art accelerated computing and Deep Learning platforms, collaborating with scientific researchers, developers, and customers to improve workflows and create innovative solutions.

The ideal candidate brings 5+ years of experience and deep expertise in HPC environments, including knowledge of both hardware and software aspects of high-performance computing. You'll need strong skills in Python programming, Linux systems, and modern orchestration tools like Kubernetes and Slurm. Experience with storage solutions, networking protocols, and cloud platforms is essential.

NVIDIA offers a compelling opportunity to work at the forefront of AI and accelerated computing technology. The company provides competitive compensation and benefits, promoting a diverse and inclusive work environment. This remote position allows you to work from various European locations while contributing to projects that are shaping the future of computing technology.

Join NVIDIA to be part of a team that's driving innovation in AI, GPU computing, and high-performance computing, working with the latest technologies and solving complex technical challenges that impact multiple industries.

Last updated 7 days ago

Responsibilities For Senior HPC AI Cluster Engineer

Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
Manage Linux job/workload schedules and orchestration tools
Develop and maintain continuous integration and delivery pipelines
Develop tooling to automate deployment and management of infrastructure
Deploy monitoring solutions for servers, network and storage
Troubleshoot and fix issues from bare metal to application level
Be a technical resource and document standard methodologies
Support R&D activities and engage in POCs/POVs

Requirements For Senior HPC AI Cluster Engineer

Python

Linux

Kubernetes

Bachelor's Degree in Computer Science, Engineering, or related field
5+ years of experience
Knowledge of HPC and AI solution technologies
Experience with job scheduling and orchestration tools (Slurm, K8s)
Excellent knowledge of Windows and Linux networking
Experience with storage solutions (Lustre, GPFS, zfs, xfs)
Python programming and bash scripting experience
Experience with automation tools (Jenkins, Ansible, Puppet/chef)
Deep knowledge of Networking Protocols (InfiniBand, Ethernet)
Deep understanding of virtual systems
Familiarity with cloud computing platforms

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

London, UK • Warsaw, Poland • Paris, France…

Cloud

Senior Software Engineer

Remote

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Senior HPC AI Cluster Engineer

NVIDIA

Job Description

Responsibilities For Senior HPC AI Cluster Engineer

Requirements For Senior HPC AI Cluster Engineer

NVIDIA

Related Jobs