Taro Logo

Senior DevOps and Automation Engineer, Fabric Networking - GPU

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, the pioneer in accelerated computing and inventor of the GPU, is seeking a Senior DevOps and Automation Engineer to join their software infrastructure team. This role is crucial in supporting large-scale GPU clusters interconnected via NVLink and InfiniBand for cutting-edge HPC and AI workloads. The position offers an opportunity to work at the forefront of artificial intelligence and high-performance computing, building and enhancing systems that power groundbreaking developments in AI, HPC, and visualization.

The ideal candidate will be responsible for developing and maintaining CI/CD pipelines, creating automation workflows, and managing infrastructure for GPU clusters. They will work with state-of-the-art technology, including NVIDIA's DGX/HGX systems, and implement modern observability tools like Prometheus and Grafana. The role requires expertise in Python, Ansible, and Shell scripting, along with a strong understanding of Linux and distributed systems.

Working at NVIDIA means being part of a company that's transforming industries through AI and digital twins technology. The position offers exposure to cutting-edge technology and the chance to work with global engineering teams. NVIDIA's commitment to diversity and inclusion ensures a welcoming environment for all employees. This role is perfect for someone who is passionate about infrastructure automation, system reliability, and wants to contribute to the next wave of artificial intelligence development.

Last updated 5 days ago

Responsibilities For Senior DevOps and Automation Engineer, Fabric Networking - GPU

  • Build and maintain CI/CD pipelines for complex systems integration and deployment
  • Design tools and automation workflows for software releases and dependency management
  • Accelerate development by modularizing systems and enabling independent release cycles
  • Build infrastructure automation for GPU clusters
  • Automate software updates and monitor system health
  • Troubleshoot and resolve operational issues across distributed infrastructure
  • Manage firmware and software rollouts
  • Work with global engineering teams on infrastructure tools

Requirements For Senior DevOps and Automation Engineer, Fabric Networking - GPU

Python
Linux
Kubernetes
  • BS or MS in Computer Science, Computer Engineering, or related field
  • 5+ years of experience managing infrastructure or systems in high-performance environments
  • Expertise in scripting and automation using Python, Ansible, and Shell
  • Practical experience with modern CI/CD tools and infrastructure-as-code frameworks
  • Strong understanding of Linux, networking, and distributed system design
  • Proven ability to break down monolithic systems into scalable components
  • Ability to work and communicate effectively in a multi-national environment