NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their DGX Cloud platform. This role sits at the intersection of software engineering and systems operations, focusing on designing and maintaining large-scale production systems with high efficiency and availability.
The position requires expertise in Kubernetes, distributed systems, and automation, with a strong emphasis on eliminating manual work through sophisticated tooling and optimization. As an SRE at NVIDIA, you'll be responsible for ensuring maximum reliability and uptime of GPU cloud services while enabling developers to make system changes safely and efficiently.
The role offers a unique opportunity to work with cutting-edge technology in AI and cloud computing, while being part of a diverse and intellectually curious team that values problem-solving and openness. You'll be involved in the entire service lifecycle, from design consulting to production support, and will have the chance to work on meaningful projects with significant impact.
NVIDIA offers competitive compensation, including a base salary range of $168,000 - $333,500 (depending on level), equity, and comprehensive benefits. The company promotes a blame-free environment that encourages self-direction and provides support and mentorship for professional growth.
The ideal candidate will bring 10+ years of experience, strong coding skills in languages like Python or Go, and deep knowledge of Linux and containers. You'll be joining a company at the forefront of AI and digital twins technology, transforming major industries and making a profound impact on society.
Working at NVIDIA means being part of a team that tackles challenges no one else can solve, with the opportunity to contribute to groundbreaking technological advancements. The role offers flexibility with remote work options and the chance to collaborate with some of the most forward-thinking professionals in the technology industry.