Senior Production SRE Engineer - Storage

NVIDIA is the world leader in accelerated computing, pioneering accelerated computing to tackle challenges no one else can solve.
$148,000 - $339,250
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Production SRE Engineer - Storage

Site Reliability Engineering (SRE) at NVIDIA ensures that internal and external facing GPU cloud services have reliability and uptime as promised to users. The role involves designing, building, and maintaining large-scale production systems with high efficiency and availability. SREs at NVIDIA work on eliminating manual work through automation, performance tuning, and growing the efficiency of production systems. They use a breadth of tools and approaches to tackle a broad spectrum of problems, including limiting time spent on reactive operational work, conducting blameless postmortems, and proactively identifying potential outages.

Key responsibilities include:

  • Designing and implementing large-scale storage clusters
  • Working with AI/ML workloads to capture and correlate behavior in large clusters
  • Improving the lifecycle of services from inception to refinement
  • Supporting services before and after they go live
  • Maintaining system health through monitoring and machine learning models
  • Scaling systems sustainably through AI/ML and automation
  • Practicing sustainable incident response
  • Participating in on-call rotation

The ideal candidate should have:

  • BS degree in Computer Science or related field
  • 5+ years of practical experience
  • Experience with algorithms, data structures, and large-scale Linux systems
  • Proficiency in languages like C/C++, Java, Python, Go, Perl, or Ruby
  • Knowledge of infrastructure configuration management tools
  • Experience with observability and tracing tools

NVIDIA offers a competitive base salary range of $148,000 - $339,250 USD, along with equity and benefits. The company values diversity and maintains an inclusive work environment.

Last updated 4 days ago

Responsibilities For Senior Production SRE Engineer - Storage

  • Design and implement large-scale storage clusters
  • Work with AI/ML workloads in large clusters
  • Improve lifecycle of services from inception to refinement
  • Support services before and after going live
  • Maintain system health through monitoring and machine learning
  • Scale systems through AI/ML and automation
  • Practice sustainable incident response
  • Participate in on-call rotation

Requirements For Senior Production SRE Engineer - Storage

Linux
Python
Java
Go
  • BS degree in Computer Science or related technical field
  • At least 5+ years practical experience
  • Experience with algorithms, data structures, complexity analysis, software design
  • Experience maintaining large-scale Linux-based systems
  • Experience in C/C++, Java, Python, Go, Perl or Ruby
  • Knowledge of infrastructure configuration management tools
  • Experience with observability and tracing tools

Benefits For Senior Production SRE Engineer - Storage

Equity
Medical Insurance
  • Equity
  • Medical Insurance

Interested in this job?

Jobs Related To NVIDIA Senior Production SRE Engineer - Storage

Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer role at Zeal Group, a global FinTech company, focusing on infrastructure reliability, automation, and scalability.

Site Reliability Engineer

Join 3E as a Site Reliability Engineer in Brussels, shaping the future of renewable energy asset management with cutting-edge SaaS solutions.

Senior Infra & Site Reliability Engineer

Senior Infra & Site Reliability Engineer at Zeotap, managing CDP infrastructure on GCP with focus on high availability and security.

Senior Site Reliability Engineer - Platform Microservices Reliability

Senior Site Reliability Engineer for Platform Microservices Reliability at Guidewire, ensuring system reliability and efficiency for P&C insurance platform.

Senior Site Reliability Engineer

Senior Site Reliability Engineer at Apple, designing and managing media content services for App Store, Apple Music, and more.