Senior Site Reliability Engineer, Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$144,000 - $270,250

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Cloud team. This role sits at the intersection of software and systems engineering, focusing on designing and maintaining large-scale production systems with high efficiency and availability. The SRE team at NVIDIA ensures that their GPU cloud services maintain maximum reliability while enabling continuous improvement and innovation.

The position offers an opportunity to work with cutting-edge technologies like Kubernetes and OpenStack, while focusing on automation, performance tuning, and system optimization. NVIDIA's SRE culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The role combines hands-on technical work with strategic thinking about system architecture and reliability.

As an SRE, you'll be responsible for the entire lifecycle of services, from design through deployment and maintenance. This includes implementing monitoring solutions, conducting launch reviews, and participating in on-call rotations. The role requires strong coding skills, particularly in languages like Python or Go, combined with deep knowledge of Linux systems and container technologies.

NVIDIA offers competitive compensation with a base salary range of $144,000 - $270,250 (depending on level), plus equity and comprehensive benefits. The company's commitment to fostering a diverse work environment and their position at the forefront of AI and digital twins technology makes this an exciting opportunity for engineers looking to work on challenging problems at scale.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer, Cloud

Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
Engage in service lifecycle from inception through deployment and refinement
Support services through system design consulting and launch reviews
Maintain services by monitoring availability, latency and system health
Scale systems through automation
Practice sustainable incident response and blameless postmortems
Participate in on-call rotation

Requirements For Senior Site Reliability Engineer, Cloud

Python

Kubernetes

Linux

BS degree in Computer Science or related technical field
5+ years of experience with Infrastructure automation and distributed systems design
Experience with Python, Go, Perl or Ruby
In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer, Cloud

Equity

Equity
Comprehensive benefits package

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$144,000 - $270,250

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Senior Site Reliability Engineer, Cloud

NVIDIA

Job Description

Responsibilities For Senior Site Reliability Engineer, Cloud

Requirements For Senior Site Reliability Engineer, Cloud

Benefits For Senior Site Reliability Engineer, Cloud

NVIDIA

Related Jobs