Senior DevOps and Automation Engineer, Fabric Networking - GPU

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Tel Aviv-Yafo, Israel • Yokne'am Illit, Israel

DevOps

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA, the pioneer in accelerated computing and inventor of the GPU, is seeking a Senior DevOps and Automation Engineer to join their software infrastructure team. This role is crucial in supporting large-scale GPU clusters interconnected via NVLink and InfiniBand for cutting-edge HPC and AI workloads. The position offers an opportunity to work at the forefront of artificial intelligence and high-performance computing, building and enhancing systems that power groundbreaking developments in AI, HPC, and visualization.

The ideal candidate will be responsible for developing and maintaining CI/CD pipelines, creating automation workflows, and managing infrastructure for GPU clusters. They will work with state-of-the-art technology, including NVIDIA's DGX/HGX systems, and implement modern observability tools like Prometheus and Grafana. The role requires expertise in Python, Ansible, and Shell scripting, along with a strong understanding of Linux and distributed systems.

Working at NVIDIA means being part of a company that's transforming industries through AI and digital twins technology. The position offers exposure to cutting-edge technology and the chance to work with global engineering teams. NVIDIA's commitment to diversity and inclusion ensures a welcoming environment for all employees. This role is perfect for someone who is passionate about infrastructure automation, system reliability, and wants to contribute to the next wave of artificial intelligence development.

Last updated 5 days ago

Responsibilities For Senior DevOps and Automation Engineer, Fabric Networking - GPU

Build and maintain CI/CD pipelines for complex systems integration and deployment
Design tools and automation workflows for software releases and dependency management
Accelerate development by modularizing systems and enabling independent release cycles
Build infrastructure automation for GPU clusters
Automate software updates and monitor system health
Troubleshoot and resolve operational issues across distributed infrastructure
Manage firmware and software rollouts
Work with global engineering teams on infrastructure tools

Requirements For Senior DevOps and Automation Engineer, Fabric Networking - GPU

Python

Linux

Kubernetes

BS or MS in Computer Science, Computer Engineering, or related field
5+ years of experience managing infrastructure or systems in high-performance environments
Expertise in scripting and automation using Python, Ansible, and Shell
Practical experience with modern CI/CD tools and infrastructure-as-code frameworks
Strong understanding of Linux, networking, and distributed system design
Proven ability to break down monolithic systems into scalable components
Ability to work and communicate effectively in a multi-national environment