Taro Logo

Senior Site Reliability Engineer, Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$144,000 - $270,250
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Cloud team. This role sits at the intersection of software and systems engineering, focusing on designing and maintaining large-scale production systems with high efficiency and availability. The SRE team at NVIDIA ensures that their GPU cloud services maintain maximum reliability while enabling continuous improvement and innovation.

The position offers an opportunity to work with cutting-edge technologies like Kubernetes and OpenStack, while focusing on automation, performance tuning, and system optimization. NVIDIA's SRE culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The role combines hands-on technical work with strategic thinking about system architecture and reliability.

As an SRE, you'll be responsible for the entire lifecycle of services, from design through deployment and maintenance. This includes implementing monitoring solutions, conducting launch reviews, and participating in on-call rotations. The role requires strong coding skills, particularly in languages like Python or Go, combined with deep knowledge of Linux systems and container technologies.

NVIDIA offers competitive compensation with a base salary range of $144,000 - $270,250 (depending on level), plus equity and comprehensive benefits. The company's commitment to fostering a diverse work environment and their position at the forefront of AI and digital twins technology makes this an exciting opportunity for engineers looking to work on challenging problems at scale.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer, Cloud

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and launch reviews
  • Maintain services by monitoring availability, latency and system health
  • Scale systems through automation
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation

Requirements For Senior Site Reliability Engineer, Cloud

Python
Go
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • Experience with Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer, Cloud

Equity
  • Equity
  • Comprehensive benefits package

Related Jobs