NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for their Data Science & ML Platform(s) team. This role is at the intersection of AI, machine learning, and infrastructure reliability, focusing on building and maintaining large-scale production systems that support advanced data science and ML applications. The position offers a unique opportunity to work with cutting-edge technology at a company that leads in AI and accelerated computing.
The role involves designing, implementing, and maintaining services that enable real-time data analytics, streaming, data lakes, observability, and ML/AI training and inferencing. You'll be responsible for ensuring high efficiency and availability of the platform while applying SRE principles to improve production systems and optimize service SLOs. The position requires a strong background in systems engineering, cloud operations, and modern DevOps practices.
As a Senior SRE at NVIDIA, you'll work with a team that values learning, growth, and innovation. The company culture promotes blameless postmortems, iterative improvement, and calculated risk-taking. You'll have the autonomy to work on meaningful projects while receiving the support and mentorship needed to succeed.
The role offers competitive compensation, including a base salary range of $224,000 - $425,500 USD, plus equity and comprehensive benefits. NVIDIA's commitment to diversity and inclusion makes it an excellent workplace for professionals from all backgrounds. The position can be based in Santa Clara, CA, Austin, TX, or remote, offering flexibility in work location.
This is an excellent opportunity for an experienced SRE professional who wants to work at the forefront of AI and machine learning infrastructure, making a significant impact on systems that power the future of technology. The role combines technical challenges with the opportunity to work on innovative solutions in a collaborative, forward-thinking environment.