Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

San Francisco, CA, USA

$148,000 - $419,750

Site Reliability

Senior Software Engineer

Remote

5+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work on ensuring GPU cloud services maintain maximum reliability while enabling developers to implement changes efficiently. The position requires expertise in systems, networking, coding, database management, and cloud technologies like Kubernetes and OpenStack.

The role focuses on eliminating manual work through automation, performance tuning, and system optimization. You'll be part of a diverse, intellectually curious team that values problem-solving and openness. The position offers the opportunity to work on meaningful projects with support and mentorship for continuous learning and growth.

Key responsibilities include designing and implementing large-scale observability platforms, managing the complete service lifecycle, and maintaining system health through monitoring and automation. The ideal candidate brings 5+ years of experience in infrastructure automation and observability platforms, strong programming skills in languages like Python or Go, and deep knowledge of Linux and containers.

NVIDIA offers a competitive compensation package with a base salary range of $148,000 - $419,750 USD, plus equity and benefits. The company is committed to fostering a diverse work environment and provides equal opportunities to all candidates. This role offers the flexibility of remote work while being part of a team that's transforming industries through AI and digital twins technology.

Last updated 7 months ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
Support services before they go live through system design consulting and tools development
Maintain services by measuring and monitoring availability, latency and system health
Scale systems through automation and evolve systems for improved reliability
Practice sustainable incident response and blameless postmortems
Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python

Linux

Kubernetes

BS degree in Computer Science or related technical field
5+ years of experience with Infrastructure automation and distributed systems design
5+ years experience delivering foundational infrastructure and observability platforms
Experience in Python, Go, Perl or Ruby
In depth knowledge on Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity

Equity
Benefits package