Taro Logo

Distinguished Engineer, Observability, Monitoring, and Remediation

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins that transform industries.
$308,000 - $471,500
Cloud
Principal Software Engineer
In-Person
5,000+ Employees
18+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, the industry leader in accelerated computing, is seeking a Distinguished Engineer to lead their observability, monitoring, and remediation initiatives for DGX Cloud. This senior-level position combines deep technical expertise with strategic leadership, focusing on building robust infrastructure for cloud and enterprise environments.

The role requires an exceptional engineer with 18+ years of experience to develop and implement comprehensive strategies for observability across all infrastructure layers. You'll be responsible for defining technical approaches for data collection, analysis, and automated remediation systems that maintain the highest standards of availability and performance.

As a Distinguished Engineer, you'll work directly with NVIDIA leadership and collaborate across multiple functions to shape the future of accelerated computing infrastructure. The position offers an opportunity to impact millions of lives through groundbreaking developments in AI, High-Performance Computing, and Visualization.

The ideal candidate brings extensive experience in cloud-native architectures, demonstrated success in delivering sophisticated technical solutions, and strong leadership capabilities. You'll be working with cutting-edge technologies, including AI/ML platforms, containerization, and cloud infrastructure, while helping to set industry standards for operational excellence.

NVIDIA offers a competitive compensation package with a base salary range of $308,000 - $471,500 USD, plus equity and comprehensive benefits. This is a unique opportunity to join a forward-thinking team that's solving some of the world's biggest challenges through innovative technology.

Last updated 15 days ago

Responsibilities For Distinguished Engineer, Observability, Monitoring, and Remediation

  • Define and drive technical implementation for DGX Cloud offerings in observability, monitoring, and remediation
  • Collaborate on cross domain disciplines
  • Guide technical delivery into DGX Cloud systems across all delivery environments
  • Collaborate with customers, infrastructure providers, and strategic partners
  • Ensure operational excellence across all environments
  • Lead technical aspects of planning and continuous evolution of large technical scope

Requirements For Distinguished Engineer, Observability, Monitoring, and Remediation

Kubernetes
  • 18+ years in technical roles focusing on observability and monitoring for cloud infrastructure
  • 5+ years of lead experience
  • BS/MS or higher in systems/software engineering or related fields
  • Technical proficiency in multi-tenant data center and cloud-native architectures
  • Experience with bare metal, virtualization, containerization, and higher level abstractions
  • Proven success delivering high-impact technically sophisticated solutions
  • Strong technical leadership abilities
  • Strong collaboration and influence skills

Benefits For Distinguished Engineer, Observability, Monitoring, and Remediation

Equity
  • Competitive base salary
  • Equity compensation
  • Company benefits

Related Jobs

Principal Cloud Software Engineer

Principal Cloud Software Engineer position at NVIDIA, leading GPU cloud services development with 15+ years experience required, offering $272K-$425.5K salary plus benefits.

Principal Cloud Software Engineer

Principal Cloud Software Engineer position at NVIDIA, leading GPU cloud services development with 15+ years experience required, offering $272K-$425.5K salary plus benefits.

Principal Cloud Software Engineer

Principal Cloud Software Engineer position at NVIDIA focusing on GPU Cloud computing and AI, offering competitive salary and benefits.

Distinguished Engineer, Observability, Monitoring, and Remediation

Lead technical strategy for DGX Cloud observability and monitoring at NVIDIA, developing auto-remediation solutions for cloud infrastructure.

Principal Engineer - Enterprise Applications

Lead enterprise IT systems development at NVIDIA, designing and implementing scalable solutions while driving innovation in cloud computing and AI/ML technologies.