Senior Site Reliability Engineer, ML Platforms

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Bengaluru, Karnataka, India • Hyderabad, Telangana, India • Pune, Maharashtra, India…

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

6+ years of experience

AI · Enterprise SaaS

Description For Senior Site Reliability Engineer, ML Platforms

NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for their Data Science & ML Platform(s) team. This role is at the intersection of AI, machine learning, and infrastructure reliability, focusing on building and maintaining large-scale production systems that support advanced data science applications. The position involves designing and maintaining services for real-time data analytics, streaming, data lakes, and ML/AI operations.

The ideal candidate will bring strong expertise in SRE practices, systems engineering, and cloud operations. You'll be working with cutting-edge technologies in AI and data science, implementing software solutions to ensure high efficiency and availability of NVIDIA's ML platforms. The role offers significant autonomy while providing the support and mentorship needed to succeed.

As an SRE at NVIDIA, you'll be responsible for implementing software and systems engineering practices to maintain high service availability, applying SRE principles to improve production systems, and collaborating with customers to plan and implement system changes. The position requires deep understanding of distributed systems, strong problem-solving abilities, and excellent communication skills.

NVIDIA's environment promotes a culture of diversity, intellectual curiosity, and continuous improvement. You'll be part of a team that values blameless postmortems, iterative improvement, and calculated risk-taking. The company's position as a leader in AI and accelerated computing means you'll be working on technologies that are shaping the future of computing and artificial intelligence.

This role offers the opportunity to work with a company that's at the forefront of technological innovation, particularly in AI and high-performance computing. NVIDIA's GPU technology serves as the foundation for many groundbreaking developments in artificial intelligence, autonomous vehicles, and scientific discovery.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer, ML Platforms

Develop software solutions for large-scale system reliability
Analyze system operations and identify improvement opportunities
Create automation tools to reduce operational overhead
Establish frameworks and processes for operational maturity
Define and track reliability metrics
Manage capacity and performance across cloud infrastructure
Build observability tools
Handle incident response and postmortems

Requirements For Senior Site Reliability Engineer, ML Platforms

Python

Kubernetes

Kafka

6+ years of experience in SRE, Cloud platforms, or DevOps
Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent
Strong understanding of SRE principles
Proficiency in incident, change, and problem management
Experience with streaming data infrastructure services
Expertise in observability platforms
Proficiency in Python, Go, Perl, or Ruby
Experience with distributed systems in cloud environments
Experience in deploying and supporting services and platforms

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Bengaluru, Karnataka, India • Hyderabad, Telangana, India • Pune, Maharashtra, India…

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

6+ years of experience

AI · Enterprise SaaS

Interested in this job?

Senior Site Reliability Engineer, ML Platforms

NVIDIA

Description For Senior Site Reliability Engineer, ML Platforms

Responsibilities For Senior Site Reliability Engineer, ML Platforms

Requirements For Senior Site Reliability Engineer, ML Platforms

NVIDIA

Jobs Related To NVIDIA Senior Site Reliability Engineer, ML Platforms