NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. SRE at NVIDIA ensures maximum reliability and uptime for GPU cloud services while enabling efficient system changes and optimizations.
The position requires expertise in distributed systems, infrastructure automation, and observability platforms. You'll work with technologies like Kubernetes, OpenStack, Grafana, OpenTelemetry, and Prometheus. The role involves both proactive system design and reactive incident response, with a focus on automation and elimination of manual work.
NVIDIA offers a collaborative environment that values diversity, intellectual curiosity, and problem-solving. The company promotes self-direction while providing support and mentorship for growth. This is an opportunity to work on meaningful projects at scale, contributing to NVIDIA's mission as the world leader in accelerated computing.
The compensation is competitive, with a base salary range of $168,000 - $333,500 USD depending on level and experience, plus equity and comprehensive benefits. The position offers flexibility with both Santa Clara and remote work options. Join NVIDIA to help shape the future of AI and digital twins technology while working with cutting-edge observability and telemetry systems.