Senior HPC DevOps Engineer

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

Yokne'am Illit, Israel

DevOps

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

This job posting is no longer active. 😔

Job Description

NVIDIA is seeking an experienced HPC DevOps Engineer to contribute to building next-generation supercomputers and HPC clusters. This role combines cutting-edge technology with practical implementation, focusing on large-scale system design and optimization for AI and GPU computing platforms. As a Senior HPC DevOps Engineer, you'll work at the intersection of hardware and software, collaborating with scientists, developers, and customers to enhance workflows and create innovative solutions. The position requires expertise in infrastructure management, automation, and system architecture, with opportunities to work with state-of-the-art accelerated computing and deep learning platforms. You'll be responsible for designing and maintaining large-scale HPC/AI clusters, implementing infrastructure as code, developing CI/CD pipelines, and ensuring robust monitoring systems. The role demands strong technical skills in areas like containerization, GPU computing, and high-performance networking, while also requiring leadership in best practices and innovation. At NVIDIA, you'll be part of a team pushing the boundaries of technology and making real-world impact, supported by a company culture that values diversity and inclusion.

Last updated 5 months ago

Responsibilities For Senior HPC DevOps Engineer

Design, implement, and maintain large-scale HPC/AI clusters with monitoring systems
Utilize and develop tools to manage infrastructure as code
Develop and maintain CI/CD pipelines
Develop automation scripts and tools
Deploy advanced monitoring solutions
Perform comprehensive troubleshooting
Serve as a technical resource and share best practices
Support R&D activities and engage in POCs and POVs

Requirements For Senior HPC DevOps Engineer

Linux

Kubernetes

B.Sc. in Computer Science, Engineering, or related field with 5+ years of experience
Deep knowledge of HPC and AI solution technologies
Advanced proficiency in programming and scripting languages
Familiarity with Jenkins, Ansible, Puppet/Chef
Excellent knowledge of Windows and Linux
Deep understanding of networking protocols
Experience with job scheduling workloads and orchestration tools
Experience with multiple storage solutions
Expertise with virtual systems
Familiarity with cloud platforms