NVIDIA, a pioneer in Accelerated Computing, is seeking a Senior Site Reliability Engineer for their GPU Cloud team. This role is part of a dynamic SRE team managing cloud and on-prem infrastructure for High-Performance & Distributed Computing. The position involves working with cutting-edge technology in AI, LLMs, and cloud computing, supporting NVIDIA's GPU cloud platform that serves both internal R&D teams and external AI/ML customers.
The role requires managing infrastructure spanning thousands of GPU nodes, focusing on automation, monitoring, and analytics solutions. You'll be responsible for the complete lifecycle of tools and services, from design to deployment, while ensuring high reliability and availability of the platform. This is an opportunity to work with state-of-the-art technology while solving complex infrastructure challenges.
The ideal candidate will bring 8+ years of experience in large-scale distributed systems, strong programming skills, and expertise in cloud infrastructure and Kubernetes. You'll be joining a company at the forefront of AI and accelerated computing innovation, working on technology that's transforming major industries.
NVIDIA offers a collaborative environment where creativity and autonomy are valued. The company is committed to diversity and inclusion, providing equal opportunities to all qualified candidates. This role offers the chance to work with some of the most forward-thinking professionals in the technology industry while contributing to groundbreaking innovations in AI, cloud computing, and high-performance computing.