NVIDIA is seeking a Senior Site Reliability Engineer to join their Observability and Telemetry Platform team. Site Reliability Engineering (SRE) at NVIDIA is a critical discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. The role combines software and systems engineering practices, requiring expertise across systems, networking, coding, database management, and cloud technologies.
As an SRE, you'll be responsible for ensuring NVIDIA's GPU cloud services maintain maximum reliability and uptime while enabling developers to implement changes effectively. The position emphasizes automation, performance tuning, and system optimization. You'll work with cutting-edge tools and technologies, including Kubernetes, OpenStack, and various observability platforms like Grafana and Prometheus.
The role offers an opportunity to work in a diverse, intellectually stimulating environment that encourages collaboration, innovation, and risk-taking in a blame-free culture. NVIDIA promotes self-direction on meaningful projects while providing support and mentorship for professional growth. The position comes with competitive compensation, including a base salary range of $148,000 - $419,750, plus equity and benefits.
This is an ideal opportunity for experienced engineers passionate about large-scale distributed systems, infrastructure automation, and observability platforms. You'll be part of a team that values systematic problem-solving, strong communication, and a drive for continuous improvement. The role offers both technical challenges and the chance to impact NVIDIA's critical infrastructure supporting AI and accelerated computing initiatives.