Site Reliability Engineer - Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$136,000 - $212,750

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Description For Site Reliability Engineer - Cloud

NVIDIA, a global leader in AI computing and accelerated computing solutions, is seeking a Site Reliability Engineer to join their Digital Marketing Organization. This role combines technical expertise with operational excellence, focusing on maintaining and improving AWS infrastructure and ensuring the reliability of NVIDIA's Digital Marketing Services.

The position offers an opportunity to work with cutting-edge technology at a company that has continuously reinvented itself over two decades. From inventing the GPU in 1999 to becoming "the AI computing company," NVIDIA has been at the forefront of technological advancement. The role involves working with AWS Infrastructure, Kubernetes, and various programming languages to ensure service reliability and efficiency.

As an SRE, you'll be responsible for critical tasks including debugging user-reported issues, implementing monitoring solutions, and automating deployment pipelines. The role requires a blend of technical skills in Python, Java, and cloud technologies, along with strong problem-solving abilities and excellent communication skills. You'll be part of a team that values innovation and autonomous thinking, with opportunities to make significant impacts on service reliability and performance.

The position offers a competitive compensation package with a base salary range of $136,000 to $212,750, plus equity and comprehensive benefits. NVIDIA is known for its inclusive work environment and commitment to diversity, making it an ideal place for professionals looking to advance their careers in technology while working on meaningful projects that shape the future of computing.

Last updated 2 days ago

Responsibilities For Site Reliability Engineer - Cloud

Debug and triage user-reported issues on the Digital Marketing Organization
On-board new applications and services on AWS Infrastructure
Contribute to health, performance, and uptime of services running in Linux and Windows
Implement monitors, alerts and SOPs for early detection and response
Automate and script daily tasks to achieve 100% automation

Requirements For Site Reliability Engineer - Cloud

Python

Java

Kubernetes

Linux

MS or BS in Computer Science/Engineering or related field
5+ years of experience supporting technical operations in production environment
Experience with critical production services in Python/Java on Windows or Linux
Strong knowledge of Kubernetes Platform, deployments, automation
Experience with incident management and SRE on-call duties
Advanced level experience with Python scripting
Strong problem-solving and root cause analysis skills

Benefits For Site Reliability Engineer - Cloud

Medical Insurance

Equity

Competitive base salary
Equity grants
Comprehensive benefits package

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$136,000 - $212,750

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Interested in this job?

Jobs Related To NVIDIA Site Reliability Engineer - Cloud

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Senior Site Reliability Engineer position at NVIDIA working on DGX Cloud infrastructure and operations.

Platform Reliability Engineer

NVIDIA

Senior Platform Reliability Engineer role at NVIDIA focusing on maintaining and improving the reliability of their Unified Commerce Platform through automated testing and monitoring solutions.

Site Reliability Engineer

Wheely

Senior Site Reliability Engineer position at Wheely, focusing on infrastructure security, monitoring, and DevOps practices in Nicosia, Cyprus.

Senior Software Engineer, Site Reliability Engineering

Adobe

Senior SRE position at Adobe working on Identity Services, focusing on scalability, reliability and zero downtime for systems handling millions of requests.

Senior Software Engineer, Site Reliability Tooling

Upstart

Senior Site Reliability Engineer role at Upstart, focusing on tooling and automation for infrastructure reliability. Remote-friendly position with competitive compensation and comprehensive benefits.