Taro Logo

Senior Site Reliability Engineer, ML Platforms

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$224,000 - $425,500
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer, ML Platforms

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for their Data Science & ML Platform(s) team. This role is at the intersection of AI, machine learning, and infrastructure reliability, focusing on building and maintaining large-scale production systems that support advanced data science and ML applications. The position offers a unique opportunity to work with cutting-edge technology at a company that leads in AI and accelerated computing.

The role involves designing, implementing, and maintaining services that enable real-time data analytics, streaming, data lakes, observability, and ML/AI training and inferencing. You'll be responsible for ensuring high efficiency and availability of the platform while applying SRE principles to improve production systems and optimize service SLOs. The position requires a strong background in systems engineering, cloud operations, and modern DevOps practices.

As a Senior SRE at NVIDIA, you'll work with a team that values learning, growth, and innovation. The company culture promotes blameless postmortems, iterative improvement, and calculated risk-taking. You'll have the autonomy to work on meaningful projects while receiving the support and mentorship needed to succeed.

The role offers competitive compensation, including a base salary range of $224,000 - $425,500 USD, plus equity and comprehensive benefits. NVIDIA's commitment to diversity and inclusion makes it an excellent workplace for professionals from all backgrounds. The position can be based in Santa Clara, CA, Austin, TX, or remote, offering flexibility in work location.

This is an excellent opportunity for an experienced SRE professional who wants to work at the forefront of AI and machine learning infrastructure, making a significant impact on systems that power the future of technology. The role combines technical challenges with the opportunity to work on innovative solutions in a collaborative, forward-thinking environment.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer, ML Platforms

  • Develop software solutions to ensure reliability and operability of large-scale systems
  • Analyze system operations, scalability, and failures to identify improvements
  • Create tools and automation to reduce operational overhead
  • Establish frameworks and processes to enhance operational maturity
  • Define reliability metrics to track and improve system performance
  • Manage capacity and performance across public and private clouds
  • Build tools for service observability
  • Practice sustainable incident response and blameless postmortems

Requirements For Senior Site Reliability Engineer, ML Platforms

Python
Go
Kubernetes
Kafka
  • 10+ years of experience in SRE, Cloud platforms, or DevOps
  • Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent
  • Strong understanding of SRE principles, including error budgets, SLOs, and SLAs
  • Proficiency in incident, change, and problem management processes
  • Experience with streaming data infrastructure services (Kafka, Spark)
  • Expertise in observability platforms (ELK, Prometheus)
  • Proficiency in Python, Go, Perl, or Ruby
  • Experience with scaling distributed systems in cloud environments
  • Experience in deploying and supporting services and platforms

Benefits For Senior Site Reliability Engineer, ML Platforms

Equity
Medical Insurance
  • Competitive base salary range of $224,000 - $425,500 USD
  • Equity compensation
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer, ML Platforms