Taro Logo

Senior Site Reliability Engineer

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Robotics · Automotive

Description For Senior Site Reliability Engineer

NVIDIA is seeking an exceptional Senior Site Reliability Engineer to join their Infrastructure, Planning and Processes organization. This role is part of a dynamic team responsible for developing and maintaining sophisticated build & test environments for various hardware platforms including NVIDIA GPUs and Tegra Processors across multiple operating systems. The position offers an opportunity to work with cutting-edge technologies in AI, Robotics, and Autonomous Vehicles.

The ideal candidate will be responsible for implementing and managing Kubernetes architectures, establishing high-availability clusters, and developing automation tools. They will work with various business units within NVIDIA Software, including Graphics Processors, Mobile Processors, Deep Learning, and Artificial Intelligence teams. The role requires expertise in infrastructure as code, monitoring solutions, and cloud infrastructure development.

This is an excellent opportunity for a seasoned SRE professional who thrives in a fast-paced environment and wants to work with state-of-the-art technology. The position offers competitive compensation and benefits, making it an attractive opportunity for those looking to advance their career at one of the technology world's most desirable employers. NVIDIA's commitment to innovation in accelerated computing and AI makes this an exciting opportunity to work on transformative technologies that impact various industries.

Last updated a month ago

Responsibilities For Senior Site Reliability Engineer

  • End-to-end Implementation of Kubernetes architecture - design, deploy, hardening, networking, sizing, scaling
  • Implementing high availability clusters and disaster recovery solutions
  • Design and implement logging & monitoring solutions
  • Develop tools for automating workflows
  • Participate in prototyping and developing cloud infrastructure
  • Participate in on-call support and critical issue coverage
  • Implement critical metrics using various analytics methods and dashboards

Requirements For Senior Site Reliability Engineer

Kubernetes
Python
Go
Linux
  • Solid programming background in Python/Go
  • 5+ years of proven experience
  • Bachelor's or master's degree in computer science, Software Engineering, or equivalent
  • Proficient in configuration management & IaC tools (Ansible, Puppet, Chef, Terraform)
  • Strong background with Gitlab, Jenkins, Flux, ArgoCD
  • Strong expertise in Kubernetes architecture
  • Proficient in secret management tools
  • Proficient in data analytics/visualization & monitoring tools
  • Excellent debugging, problem solving and analytical skills

Benefits For Senior Site Reliability Engineer

  • Competitive salaries
  • Generous benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer