Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $419,750
Site Reliability
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. Site Reliability Engineering (SRE) at NVIDIA is a critical discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. The role combines software and systems engineering practices, requiring expertise across systems, networking, coding, database management, and cloud technologies.

As an SRE, you'll be responsible for ensuring NVIDIA's GPU cloud services maintain maximum reliability and uptime while enabling developers to implement changes effectively. The position emphasizes automation, performance tuning, and system optimization. You'll work with cutting-edge tools and technologies, including Kubernetes, OpenStack, and various observability platforms like Grafana and Prometheus.

The role offers an opportunity to work in a diverse, intellectually stimulating environment that encourages collaboration, innovation, and risk-taking in a blame-free culture. NVIDIA promotes self-direction on meaningful projects while providing support and mentorship for professional growth. The position comes with competitive compensation, including a base salary range of $148,000 - $419,750, plus equity and benefits.

This is an ideal opportunity for experienced engineers passionate about large-scale distributed systems, infrastructure automation, and observability platforms. You'll be part of a team that values systematic problem-solving, strong communication, and a drive for continuous improvement. The role offers both technical challenges and the chance to impact NVIDIA's critical infrastructure supporting AI and accelerated computing initiatives.

Last updated a few seconds ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

  • Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
  • Support services before they go live through system design consulting and tools development
  • Maintain services by measuring and monitoring availability, latency and system health
  • Scale systems through automation and evolve systems for improved reliability
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field involving coding
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • 5+ years experience delivering foundational infrastructure and observability platforms
  • Experience in Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers
  • Experience with Grafana, OpenTelemetry, Prometheus, and similar observability tools
  • Strong communication skills and systematic problem-solving approach

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior Production SRE Engineer - Storage

Senior SRE position at NVIDIA focusing on storage systems, requiring 5+ years experience and expertise in large-scale system design and maintenance.

Senior Site Reliability Engineer - GPU Clusters

Senior SRE position at NVIDIA managing GPU clusters for AI workloads, offering competitive salary and opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - DGX Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive salary and opportunity to work with cutting-edge AI technology.

Senior Site Reliability Engineer - DGX Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive compensation and opportunity to work with cutting-edge cloud technologies.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer for AI Research Clusters at NVIDIA, designing and implementing GPU compute clusters for AI research.