NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer to join their Cloud team. This role sits at the intersection of software and systems engineering, focusing on designing and maintaining large-scale production systems with high efficiency and availability. The SRE team at NVIDIA ensures that their GPU cloud services maintain maximum reliability while enabling continuous improvement and innovation.
The position offers an opportunity to work with cutting-edge technologies like Kubernetes and OpenStack, while focusing on automation, performance tuning, and system optimization. NVIDIA's SRE culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The role combines hands-on technical work with strategic thinking about system architecture and reliability.
As an SRE, you'll be responsible for the entire lifecycle of services, from design through deployment and maintenance. This includes implementing monitoring solutions, conducting launch reviews, and participating in on-call rotations. The role requires strong coding skills, particularly in languages like Python or Go, combined with deep knowledge of Linux systems and container technologies.
NVIDIA offers competitive compensation with a base salary range of $144,000 - $270,250 (depending on level), plus equity and comprehensive benefits. The company's commitment to fostering a diverse work environment and their position at the forefront of AI and digital twins technology makes this an exciting opportunity for engineers looking to work on challenging problems at scale.