Site Reliability Engineering (SRE) at NVIDIA ensures that internal and external facing GPU cloud services have reliability and uptime as promised to users. The role involves designing, building, and maintaining large-scale production systems with high efficiency and availability. SREs at NVIDIA work on eliminating manual work through automation, performance tuning, and growing the efficiency of production systems. They use a breadth of tools and approaches to tackle a broad spectrum of problems, including limiting time spent on reactive operational work, conducting blameless postmortems, and proactively identifying potential outages.
Key responsibilities include:
The ideal candidate should have:
NVIDIA offers a competitive base salary range of $148,000 - $339,250 USD, along with equity and benefits. The company values diversity and maintains an inclusive work environment.