Senior Site Reliability Engineer, ML Platforms

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA • Austin, TX, USA

$224,000 - $425,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS

Description For Senior Site Reliability Engineer, ML Platforms

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for their Data Science & ML Platform(s) team. This role is at the intersection of AI, machine learning, and infrastructure reliability, focusing on building and maintaining large-scale production systems that support advanced data science and ML applications. The position offers a unique opportunity to work with cutting-edge technology at a company that leads in AI and accelerated computing.

The role involves designing, implementing, and maintaining services that enable real-time data analytics, streaming, data lakes, observability, and ML/AI training and inferencing. You'll be responsible for ensuring high efficiency and availability of the platform while applying SRE principles to improve production systems and optimize service SLOs. The position requires a strong background in systems engineering, cloud operations, and modern DevOps practices.

As a Senior SRE at NVIDIA, you'll work with a team that values learning, growth, and innovation. The company culture promotes blameless postmortems, iterative improvement, and calculated risk-taking. You'll have the autonomy to work on meaningful projects while receiving the support and mentorship needed to succeed.

The role offers competitive compensation, including a base salary range of $224,000 - $425,500 USD, plus equity and comprehensive benefits. NVIDIA's commitment to diversity and inclusion makes it an excellent workplace for professionals from all backgrounds. The position can be based in Santa Clara, CA, Austin, TX, or remote, offering flexibility in work location.

This is an excellent opportunity for an experienced SRE professional who wants to work at the forefront of AI and machine learning infrastructure, making a significant impact on systems that power the future of technology. The role combines technical challenges with the opportunity to work on innovative solutions in a collaborative, forward-thinking environment.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer, ML Platforms

Develop software solutions to ensure reliability and operability of large-scale systems
Analyze system operations, scalability, and failures to identify improvements
Create tools and automation to reduce operational overhead
Establish frameworks and processes to enhance operational maturity
Define reliability metrics to track and improve system performance
Manage capacity and performance across public and private clouds
Build tools for service observability
Practice sustainable incident response and blameless postmortems

Requirements For Senior Site Reliability Engineer, ML Platforms

Python

Kubernetes

Kafka

10+ years of experience in SRE, Cloud platforms, or DevOps
Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent
Strong understanding of SRE principles, including error budgets, SLOs, and SLAs
Proficiency in incident, change, and problem management processes
Experience with streaming data infrastructure services (Kafka, Spark)
Expertise in observability platforms (ELK, Prometheus)
Proficiency in Python, Go, Perl, or Ruby
Experience with scaling distributed systems in cloud environments
Experience in deploying and supporting services and platforms

Benefits For Senior Site Reliability Engineer, ML Platforms

Equity

Medical Insurance

Competitive base salary range of $224,000 - $425,500 USD
Equity compensation
Comprehensive benefits package

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA • Austin, TX, USA

$224,000 - $425,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS

Interested in this job?

Senior Site Reliability Engineer, ML Platforms

NVIDIA

Description For Senior Site Reliability Engineer, ML Platforms

Responsibilities For Senior Site Reliability Engineer, ML Platforms

Requirements For Senior Site Reliability Engineer, ML Platforms

Benefits For Senior Site Reliability Engineer, ML Platforms

NVIDIA

Jobs Related To NVIDIA Senior Site Reliability Engineer, ML Platforms