Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$168,000 - $333,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. SRE at NVIDIA ensures maximum reliability and uptime for GPU cloud services while enabling efficient system changes and optimizations.

The position requires expertise in distributed systems, infrastructure automation, and observability platforms. You'll work with technologies like Kubernetes, OpenStack, Grafana, OpenTelemetry, and Prometheus. The role involves both proactive system design and reactive incident response, with a focus on automation and elimination of manual work.

NVIDIA offers a collaborative environment that values diversity, intellectual curiosity, and problem-solving. The company promotes self-direction while providing support and mentorship for growth. This is an opportunity to work on meaningful projects at scale, contributing to NVIDIA's mission as the world leader in accelerated computing.

The compensation is competitive, with a base salary range of $168,000 - $333,500 USD depending on level and experience, plus equity and comprehensive benefits. The position offers flexibility with both Santa Clara and remote work options. Join NVIDIA to help shape the future of AI and digital twins technology while working with cutting-edge observability and telemetry systems.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
Engage in service lifecycle from inception through deployment and refinement
Support services through system design consulting, developing tools, capacity management and launch reviews
Maintain services by measuring and monitoring availability, latency and system health
Scale systems through automation and improve reliability
Practice sustainable incident response and blameless postmortems
Participate in on-call rotation

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python

Linux

Kubernetes

BS degree in Computer Science or related technical field
5+ years of experience with Infrastructure automation and distributed systems design
8+ years experience delivering foundational infrastructure and observability platforms
Experience in Python, Go, Perl or Ruby
In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity

Medical Insurance

Equity
Medical Insurance

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$168,000 - $333,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

Job Description

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA

Related Jobs