Site Reliability Engineer - Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$136,000 - $212,750
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Site Reliability Engineer - Cloud

NVIDIA, a global leader in AI computing and accelerated computing solutions, is seeking a Site Reliability Engineer to join their Digital Marketing Organization. This role combines technical expertise with operational excellence, focusing on maintaining and improving AWS infrastructure and ensuring the reliability of NVIDIA's Digital Marketing Services.

The position offers an opportunity to work with cutting-edge technology at a company that has continuously reinvented itself over two decades. From inventing the GPU in 1999 to becoming "the AI computing company," NVIDIA has been at the forefront of technological advancement. The role involves working with AWS Infrastructure, Kubernetes, and various programming languages to ensure service reliability and efficiency.

As an SRE, you'll be responsible for critical tasks including debugging user-reported issues, implementing monitoring solutions, and automating deployment pipelines. The role requires a blend of technical skills in Python, Java, and cloud technologies, along with strong problem-solving abilities and excellent communication skills. You'll be part of a team that values innovation and autonomous thinking, with opportunities to make significant impacts on service reliability and performance.

The position offers a competitive compensation package with a base salary range of $136,000 to $212,750, plus equity and comprehensive benefits. NVIDIA is known for its inclusive work environment and commitment to diversity, making it an ideal place for professionals looking to advance their careers in technology while working on meaningful projects that shape the future of computing.

Last updated 2 days ago

Responsibilities For Site Reliability Engineer - Cloud

  • Debug and triage user-reported issues on the Digital Marketing Organization
  • On-board new applications and services on AWS Infrastructure
  • Contribute to health, performance, and uptime of services running in Linux and Windows
  • Implement monitors, alerts and SOPs for early detection and response
  • Automate and script daily tasks to achieve 100% automation

Requirements For Site Reliability Engineer - Cloud

Python
Java
Kubernetes
Linux
  • MS or BS in Computer Science/Engineering or related field
  • 5+ years of experience supporting technical operations in production environment
  • Experience with critical production services in Python/Java on Windows or Linux
  • Strong knowledge of Kubernetes Platform, deployments, automation
  • Experience with incident management and SRE on-call duties
  • Advanced level experience with Python scripting
  • Strong problem-solving and root cause analysis skills

Benefits For Site Reliability Engineer - Cloud

Medical Insurance
Equity
  • Competitive base salary
  • Equity grants
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Site Reliability Engineer - Cloud

Senior Site Reliability Engineer - DGX Cloud

Senior Site Reliability Engineer position at NVIDIA working on DGX Cloud infrastructure and operations.

Platform Reliability Engineer

Senior Platform Reliability Engineer role at NVIDIA focusing on maintaining and improving the reliability of their Unified Commerce Platform through automated testing and monitoring solutions.

Site Reliability Engineer

Senior Site Reliability Engineer position at Wheely, focusing on infrastructure security, monitoring, and DevOps practices in Nicosia, Cyprus.

Senior Software Engineer, Site Reliability Engineering

Senior SRE position at Adobe working on Identity Services, focusing on scalability, reliability and zero downtime for systems handling millions of requests.

Senior Software Engineer, Site Reliability Tooling

Senior Site Reliability Engineer role at Upstart, focusing on tooling and automation for infrastructure reliability. Remote-friendly position with competitive compensation and comprehensive benefits.