Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.
$148,000 - $276,000
Cloud
Senior Software Engineer
Remote
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:
Sr. Software Engineer, EC2 VPC

Senior Software Engineer position at Amazon working on EC2 VPC, leading development of critical cloud networking infrastructure and ML platforms

Senior Delivery Consultant - Cloud Application Architecture, AWS Professional Services

Senior Delivery Consultant position at AWS Professional Services, focusing on cloud application architecture and enterprise solution delivery.

Senior Cloud Network Engineer

Senior Cloud Network Engineer role at Microsoft Azure WAN team, combining networking expertise with software development, offering remote work and comprehensive benefits.

Snr. Systems Development Engineer, ESC Managed Operations

Senior Systems Development Engineer role at AWS focusing on European Sovereign Cloud operations, automation, and infrastructure management in Dublin, Ireland.

Presales - Cloud Solutions Engineer, Multicloud OnPremises/Hybrid Senior Sales Consultant

Senior Cloud Solutions Engineer position at Oracle, focusing on pre-sales technical support for cloud, multi-cloud, and hybrid solutions, requiring 6-10+ years of experience.

Description For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Software Engineer to join their DGX Cloud team focusing on reliability and operational excellence. This role is critical in ensuring maximum reliability and uptime for both internal and external GPU cloud services.

The position combines systems engineering with software development, focusing on building tooling, reporting, and automation to enable operational excellence across a highly dynamic organization. You'll be working with cloud infrastructure, developing essential data pipelines, and streamlining incident management processes.

As a Senior Software Engineer in this role, you'll be at the forefront of maintaining and improving NVIDIA's cloud services, working with cutting-edge technologies including Python, Go, TypeScript, and Kubernetes. The role offers an exciting opportunity to work with distributed systems at scale while contributing to NVIDIA's groundbreaking developments in Artificial Intelligence and High-Performance Computing.

The position offers competitive compensation with a base salary range of $148,000 to $276,000, plus equity and benefits. NVIDIA provides an environment that values creativity, autonomy, and technical innovation. You'll be joining a company that's leading the way in AI and digital twins, transforming the world's largest industries and profoundly impacting society.

This role is perfect for someone who combines strong technical skills with excellent problem-solving abilities and communication skills. You'll have the opportunity to work on challenging problems, collaborate with talented peers, and make a significant impact on the reliability and efficiency of NVIDIA's cloud infrastructure. The position offers the flexibility of remote work while being part of a team that's pushing the boundaries of what's possible in cloud computing and AI.

Last updated 4 months ago

Responsibilities For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

  • Design, build, deploy, and run internal tooling built on top of cloud infrastructure
  • Design, implement, ship, and maintain essential data pipelines for executive leadership
  • Integrate tooling with internal and customer workflows
  • Reduce the toil of running an incident, writing a postmortem, running an oncall
  • Evangelize sustainable blameless incident prevention and incident response
  • Consult with peer teams on operations best practices

Requirements For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Python
Go
TypeScript
Java
Kubernetes
Linux
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with infrastructure automation and distributed systems design
  • Experience in Python, Go, Typescript, C/C++, or Java
  • In-depth knowledge in Linux, Networking, Storage, and Containers
  • Track record of project initiation and collaboration

Benefits For Senior Software Engineer, Reliability and Operational Excellence - DGX Cloud

Equity
  • Equity
  • Benefits package

Interested in this job?