Taro Logo

Senior Site Reliability Engineer - Observability and Telemetry Platform

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$168,000 - $333,500
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. SRE at NVIDIA ensures maximum reliability and uptime for GPU cloud services while enabling efficient system changes and optimizations.

The position requires expertise in distributed systems, infrastructure automation, and observability platforms. You'll work with technologies like Kubernetes, OpenStack, Grafana, OpenTelemetry, and Prometheus. The role involves both proactive system design and reactive incident response, with a focus on automation and elimination of manual work.

NVIDIA offers a collaborative environment that values diversity, intellectual curiosity, and problem-solving. The company promotes self-direction while providing support and mentorship for growth. This is an opportunity to work on meaningful projects at scale, contributing to NVIDIA's mission as the world leader in accelerated computing.

The compensation is competitive, with a base salary range of $168,000 - $333,500 USD depending on level and experience, plus equity and comprehensive benefits. The position offers flexibility with both Santa Clara and remote work options. Join NVIDIA to help shape the future of AI and digital twins technology while working with cutting-edge observability and telemetry systems.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer - Observability and Telemetry Platform

  • Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting, developing tools, capacity management and launch reviews
  • Maintain services by measuring and monitoring availability, latency and system health
  • Scale systems through automation and improve reliability
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation

Requirements For Senior Site Reliability Engineer - Observability and Telemetry Platform

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of experience with Infrastructure automation and distributed systems design
  • 8+ years experience delivering foundational infrastructure and observability platforms
  • Experience in Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - Observability and Telemetry Platform

Equity
Medical Insurance
  • Equity
  • Medical Insurance

Related Jobs