Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA is the world leader in accelerated computing. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team. The role involves designing, building, and maintaining services that enable real-time data analytics, streaming, data lakes, observability and ML/AI training and inferencing. Responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of the platform, applying SRE principles to improve production systems and optimize service SLOs, and collaborating with customers to plan and implement changes to existing systems.

Key responsibilities:

  • Develop software solutions for large-scale system reliability
  • Gain deep understanding of system operations and identify improvement opportunities
  • Create tools and automation to reduce operational overhead
  • Establish frameworks and processes to enhance operational maturity
  • Define reliability metrics and oversee capacity management
  • Build tools for improved service observability
  • Practice sustainable incident response and blameless postmortems

Requirements:

  • 5-8 years of experience in SRE, Cloud platforms, or DevOps
  • Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent
  • Strong understanding of SRE principles
  • Proficiency in incident, change, and problem management
  • Experience with streaming data infrastructure services
  • Expertise in large-scale observability platforms
  • Proficiency in programming languages like Python, Go, Perl, or Ruby
  • Experience with distributed systems in cloud environments

This role offers the opportunity to work on innovative technologies powering the future of AI and data science, as part of a dynamic team that values learning and growth. Join NVIDIA to accelerate the next wave of artificial intelligence and make a significant impact in the field.

Last updated 2 months ago

Responsibilities For Senior Site Reliability Engineer, Data Science and ML Platforms

  • Develop software solutions for system reliability
  • Understand system operations and identify improvements
  • Create tools and automation to reduce operational overhead
  • Establish frameworks and processes for operational maturity
  • Define reliability metrics
  • Oversee capacity and performance management
  • Build tools for improved service observability
  • Practice sustainable incident response and blameless postmortems

Requirements For Senior Site Reliability Engineer, Data Science and ML Platforms

Python
Go
Kubernetes
Kafka
  • 5-8 years of experience in SRE, Cloud platforms, or DevOps
  • Master's or Bachelor's degree in Computer Science or Electrical Engineering or equivalent
  • Strong understanding of SRE principles
  • Proficiency in incident, change, and problem management
  • Experience with streaming data infrastructure services
  • Expertise in large-scale observability platforms
  • Proficiency in programming languages like Python, Go, Perl, or Ruby
  • Experience with distributed systems in cloud environments

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer - AI Research Clusters

Senior SRE position at NVIDIA focusing on AI research clusters, requiring expertise in GPU computing, infrastructure automation, and high-performance computing environments.

Senior Site Reliability Engineer - Observability and Telemetry Platform

Senior SRE position at NVIDIA focusing on observability and telemetry platforms, offering competitive salary and opportunity to work with cutting-edge cloud technologies.

Senior Production SRE Engineer - Storage

Senior SRE position at NVIDIA focusing on storage systems, requiring 5+ years experience and expertise in large-scale system design and maintenance.

Senior Site Reliability Engineer - GPU Clusters

Senior SRE position at NVIDIA managing GPU clusters for AI workloads, offering competitive salary and opportunity to work with cutting-edge technology.

Senior Site Reliability Engineer - DGX Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on DGX Cloud infrastructure, offering competitive salary and opportunity to work with cutting-edge AI technology.