Taro Logo

Senior Site Reliability Engineer, ML Platforms

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
6+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer, ML Platforms

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for their Data Science & ML Platform(s) team. This role is at the intersection of AI, machine learning, and infrastructure reliability, focusing on building and maintaining large-scale production systems that support advanced data science applications. The position involves designing and maintaining services for real-time data analytics, streaming, data lakes, and ML/AI operations.

The ideal candidate will bring strong expertise in SRE practices, systems engineering, and cloud operations. You'll be working with cutting-edge technologies in AI and data science, implementing software solutions to ensure high efficiency and availability of NVIDIA's ML platforms. The role offers significant autonomy while providing the support and mentorship needed to succeed.

As an SRE at NVIDIA, you'll be responsible for implementing software and systems engineering practices to maintain high service availability, applying SRE principles to improve production systems, and collaborating with customers to plan and implement system changes. The position requires deep understanding of distributed systems, strong problem-solving abilities, and excellent communication skills.

NVIDIA's environment promotes a culture of diversity, intellectual curiosity, and continuous improvement. You'll be part of a team that values blameless postmortems, iterative improvement, and calculated risk-taking. The company's position as a leader in AI and accelerated computing means you'll be working on technologies that are shaping the future of computing and artificial intelligence.

This role offers the opportunity to work with a company that's at the forefront of technological innovation, particularly in AI and high-performance computing. NVIDIA's GPU technology serves as the foundation for many groundbreaking developments in artificial intelligence, autonomous vehicles, and scientific discovery.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer, ML Platforms

  • Develop software solutions for large-scale system reliability
  • Analyze system operations and identify improvement opportunities
  • Create automation tools to reduce operational overhead
  • Establish frameworks and processes for operational maturity
  • Define and track reliability metrics
  • Manage capacity and performance across cloud infrastructure
  • Build observability tools
  • Handle incident response and postmortems

Requirements For Senior Site Reliability Engineer, ML Platforms

Python
Go
Kubernetes
Kafka
  • 6+ years of experience in SRE, Cloud platforms, or DevOps
  • Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent
  • Strong understanding of SRE principles
  • Proficiency in incident, change, and problem management
  • Experience with streaming data infrastructure services
  • Expertise in observability platforms
  • Proficiency in Python, Go, Perl, or Ruby
  • Experience with distributed systems in cloud environments
  • Experience in deploying and supporting services and platforms

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer, ML Platforms