Taro Logo

Senior DevOps and Automation Engineer, Fabric Networking - GPU

World leader in accelerated computing, pioneering AI and digital twins technology.
$148,000 - $287,500
DevOps
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior DevOps and Automation Engineer, Fabric Networking - GPU

NVIDIA, the pioneer in accelerated computing and inventor of the GPU, is seeking a Senior DevOps and Automation Engineer for their Fabric Networking - GPU team. This role is crucial in developing and maintaining software that facilitates GPU communication for High Performance Computing and Deep Learning solutions.

The position involves working with cutting-edge technology, including large GPU clusters interconnected via NVLink and InfiniBand. You'll be responsible for developing automated tools for cluster deployment, implementing modern DevOps practices, and ensuring optimal cluster performance. This role combines hands-on technical expertise with collaborative teamwork across multiple time zones.

The ideal candidate will bring strong expertise in automation tools like Ansible and Python, along with deep knowledge of Linux systems and cluster management. Experience with GPU-focused hardware and software, particularly DGX systems and Compute Clusters, would be highly valuable. The role offers exposure to groundbreaking developments in Artificial Intelligence and High-Performance Computing.

NVIDIA offers a competitive compensation package, including a base salary range of $148,000 - $287,500 USD, equity, and comprehensive benefits. This is an opportunity to join a company at the forefront of AI and accelerated computing, working on technology that powers everything from artificial intelligence to autonomous vehicles. The position offers flexibility with remote work options while being part of a team that's driving innovation in the industry.

Last updated 5 months ago

Responsibilities For Senior DevOps and Automation Engineer, Fabric Networking - GPU

  • Develop automated tools to deploy, provision, and maintain GPU clusters with NVLink and InfiniBand
  • Implement DevOps tools for software updates, maintenance, and cluster monitoring
  • Handle daily cluster failures and troubleshooting
  • Manage cluster software and firmware updates rollout
  • Collaborate with Engineering and Product Teams across multiple time zones

Requirements For Senior DevOps and Automation Engineer, Fabric Networking - GPU

Python
Linux
Kubernetes
  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field
  • 5+ years experience in deploying and administrating clusters, servers, and infrastructure
  • Expertise in Ansible, Python and Shell Scripting
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • Proven ability to work with cross-functional teams
  • Proficient with Linux fundamentals

Benefits For Senior DevOps and Automation Engineer, Fabric Networking - GPU

Equity
  • Equity

Interested in this job?