Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA

NVIDIA is the world leader in accelerated computing. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.

Bengaluru, Karnataka, India • Hyderabad, Telangana, India • Pune, Maharashtra, India…

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Site Reliability Engineer, Data Science and ML Platforms

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team. The role involves designing, building, and maintaining services that enable real-time data analytics, streaming, data lakes, observability and ML/AI training and inferencing. Responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of the platform, applying SRE principles to improve production systems and optimize service SLOs, and collaborating with customers to plan and implement changes to existing systems.

Key responsibilities:

Develop software solutions for large-scale system reliability
Gain deep understanding of system operations and identify improvement opportunities
Create tools and automation to reduce operational overhead
Establish frameworks and processes to enhance operational maturity
Define reliability metrics and oversee capacity management
Build tools for improved service observability
Practice sustainable incident response and blameless postmortems

Requirements:

5-8 years of experience in SRE, Cloud platforms, or DevOps
Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent
Strong understanding of SRE principles
Proficiency in incident, change, and problem management
Experience with streaming data infrastructure services
Expertise in large-scale observability platforms
Proficiency in programming languages like Python, Go, Perl, or Ruby
Experience with distributed systems in cloud environments

This role offers the opportunity to work on innovative technologies powering the future of AI and data science, as part of a dynamic team that values learning and growth. Join NVIDIA to accelerate the next wave of artificial intelligence and make a significant impact in the field.

Last updated 8 months ago

Responsibilities For Senior Site Reliability Engineer, Data Science and ML Platforms

Develop software solutions for system reliability
Understand system operations and identify improvements
Create tools and automation to reduce operational overhead
Establish frameworks and processes for operational maturity
Define reliability metrics
Oversee capacity and performance management
Build tools for improved service observability
Practice sustainable incident response and blameless postmortems

Requirements For Senior Site Reliability Engineer, Data Science and ML Platforms

Python

Kubernetes

Kafka

5-8 years of experience in SRE, Cloud platforms, or DevOps
Master's or Bachelor's degree in Computer Science or Electrical Engineering or equivalent
Strong understanding of SRE principles
Proficiency in incident, change, and problem management
Experience with streaming data infrastructure services
Expertise in large-scale observability platforms
Proficiency in programming languages like Python, Go, Perl, or Ruby
Experience with distributed systems in cloud environments