NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for their Data Science & ML Platform(s) team. This role is at the intersection of AI, machine learning, and infrastructure reliability, focusing on building and maintaining large-scale production systems that support advanced data science applications. The position involves designing and maintaining services for real-time data analytics, streaming, data lakes, and ML/AI operations.
The ideal candidate will bring strong expertise in SRE practices, systems engineering, and cloud operations. You'll be working with cutting-edge technologies in AI and data science, implementing software solutions to ensure high efficiency and availability of NVIDIA's ML platforms. The role offers significant autonomy while providing the support and mentorship needed to succeed.
As an SRE at NVIDIA, you'll be responsible for implementing software and systems engineering practices to maintain high service availability, applying SRE principles to improve production systems, and collaborating with customers to plan and implement system changes. The position requires deep understanding of distributed systems, strong problem-solving abilities, and excellent communication skills.
NVIDIA's environment promotes a culture of diversity, intellectual curiosity, and continuous improvement. You'll be part of a team that values blameless postmortems, iterative improvement, and calculated risk-taking. The company's position as a leader in AI and accelerated computing means you'll be working on technologies that are shaping the future of computing and artificial intelligence.
This role offers the opportunity to work with a company that's at the forefront of technological innovation, particularly in AI and high-performance computing. NVIDIA's GPU technology serves as the foundation for many groundbreaking developments in artificial intelligence, autonomous vehicles, and scientific discovery.