Senior HPC DevOps Engineer

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior HPC DevOps Engineer

NVIDIA is seeking an experienced Senior HPC DevOps Engineer to contribute to building next-generation supercomputers and HPC clusters. This role sits at the intersection of artificial intelligence and GPU computing, offering an opportunity to drive breakthrough advancements in technology. The position involves working with cutting-edge Accelerated computing and Deep Learning platforms, collaborating with scientific researchers and developers to optimize workflows and create innovative solutions.

As a Senior HPC DevOps Engineer, you'll be responsible for designing and maintaining large-scale HPC/AI clusters, implementing infrastructure as code, and developing automated deployment solutions. The role requires expertise in both hardware and software aspects of high-performance computing, from bare metal to application level troubleshooting. You'll work with state-of-the-art technologies including GPUs, high-speed interconnects, and various storage solutions.

The ideal candidate brings strong technical expertise in HPC environments, programming skills, and experience with modern DevOps tools and practices. This position offers the opportunity to work with NVIDIA's cutting-edge technology while contributing to solutions that are pushing the boundaries of what's possible in AI and accelerated computing. The role combines hands-on technical work with leadership responsibilities, including sharing best practices and driving innovation within the team.

Working at NVIDIA means joining a company at the forefront of technological innovation, with a strong commitment to diversity and inclusion. This role offers the chance to make a real impact in the world of high-performance computing while working with some of the industry's most advanced technologies and brightest minds.

Last updated 5 days ago

Responsibilities For Senior HPC DevOps Engineer

  • Design, implement, and maintain large-scale HPC/AI clusters with monitoring, logging, and alerting systems
  • Utilize and develop tools to manage infrastructure as code
  • Develop and maintain CI/CD pipelines
  • Develop automation scripts and tools
  • Deploy advanced monitoring solutions
  • Perform comprehensive troubleshooting from bare metal to application level
  • Serve as a technical resource and share best practices
  • Support R&D activities and engage in proof of concepts

Requirements For Senior HPC DevOps Engineer

Linux
Kubernetes
  • B.Sc. in Computer Science, Engineering, or related field with 5+ years of experience
  • Deep knowledge of HPC and AI solution technologies
  • Advanced proficiency in programming and scripting languages
  • Familiarity with Jenkins, Ansible, Puppet/Chef
  • Excellent knowledge of Windows and Linux
  • Deep understanding of networking protocols
  • Experience with job scheduling workloads and orchestration tools
  • Experience with multiple storage solutions
  • Expertise with virtual systems
  • Familiarity with cloud platforms

Interested in this job?

Jobs Related To NVIDIA Senior HPC DevOps Engineer

Senior Software Engineer – AI Infrastructure and Tooling

Senior Software Engineer role at NVIDIA focusing on AI infrastructure automation and tooling, offering $184k-$356.5k salary with hybrid work options.

Senior Software Engineer - Build and Deployment Tools

Senior Software Engineer position at NVIDIA focusing on build and deployment tools development for chip design infrastructure.

Senior HPC AI Cluster Engineer

Senior HPC AI Cluster Engineer role at NVIDIA focusing on building and maintaining large-scale HPC/AI infrastructure and supercomputers.

Senior DevOps Engineer

Senior DevOps Engineer position at Sinch, managing production environments and implementing DevOps practices in a hybrid work setting in Noida, India.

Deployment Engineer - Data (Contract)

Senior DevOps Engineer position at Apptronik, focusing on data infrastructure and pipeline development for advanced robotics systems.