Principal AI Infrastructure SRE Engineer

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$248,000 - $391,000
Site Reliability
Principal Software Engineer
Hybrid
5,000+ Employees
15+ years of experience
AI · Enterprise SaaS

Description For Principal AI Infrastructure SRE Engineer

NVIDIA, a pioneer in accelerated computing and AI technology for 30 years, is seeking a Principal AI Infrastructure SRE Engineer to lead their next generation of IT computing environment. This role combines the challenges of traditional SRE work with cutting-edge AI infrastructure development, making it an unique opportunity in the tech industry.

The position requires a seasoned professional with 15+ years of experience in compute platform engineering, focusing on building and maintaining large-scale infrastructure that supports AI and software development. You'll be at the forefront of transforming NVIDIA's IT Infrastructure platform architecture, particularly for modern AI workloads and semiconductor development.

As a Principal Engineer, you'll lead initiatives that bridge traditional infrastructure with modern AI capabilities, working with technologies like Kubernetes, Go, Python, and various infrastructure automation tools. The role demands both technical expertise and leadership skills, as you'll be collaborating with NVIDIA's senior leadership to shape the future of their IT products and services.

The compensation is highly competitive, with a base salary range of $248,000 to $391,000, plus equity and comprehensive benefits. NVIDIA's position as a leader in AI and accelerated computing means you'll be working on infrastructure that powers some of the most advanced computing workloads in the industry.

This hybrid role is based in Santa Clara, CA, putting you at the heart of NVIDIA's operations. The company's commitment to innovation, coupled with its supportive and diverse work environment, makes this an exceptional opportunity for a senior infrastructure engineer looking to work at the intersection of traditional IT infrastructure and cutting-edge AI technology.

Last updated a day ago

Responsibilities For Principal AI Infrastructure SRE Engineer

  • Lead IT Infrastructure platform architecture transformation for modern AI workloads
  • Design, Build & Operate platforms that transform Storage, Compute & Middleware
  • Build software and automation for scale infrastructure management
  • Develop and maintain tools for data collection, analysis, and visualization
  • Conduct capacity planning and system data analysis
  • Collaborate with NVIDIA leadership to develop IT products and services

Requirements For Principal AI Infrastructure SRE Engineer

Python
Go
Kubernetes
  • Bachelor's degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience
  • 15+ years of proven experience in compute platform engineering with automation focus
  • Experience with AI and SW development infrastructure at scale including Kubernetes
  • Experience integrating application architectures and containerization
  • Proficiency in Go and/or Python
  • Experience with Terraform and Config Management tools
  • Experience managing large environments with BareMetal servers/virtualized environments
  • Deep understanding of infrastructure components (Storage, DNS, AD, Security Tools)

Benefits For Principal AI Infrastructure SRE Engineer

Medical Insurance
Equity
  • Competitive base salary range: $248,000 - $391,000
  • Equity compensation
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Principal AI Infrastructure SRE Engineer

Principal Site Reliability Developer

Principal Site Reliability Developer position at Oracle, focusing on cloud services and infrastructure with 10+ years experience required, based in Bengaluru, India.

Principal Site Reliability Developer

Principal Site Reliability Developer position at Oracle, focusing on cloud infrastructure, automation, and distributed systems architecture in Bengaluru.

Director, Software Engineering - SRE

Lead SRE engineering teams at Capital One, overseeing system reliability and scalability while managing and mentoring software engineers in a technology-forward financial institution.

Sr Principal Site Reliability Developer

Senior Principal Site Reliability Developer position at Oracle, focusing on cloud infrastructure and automation with 10+ years of experience required.

Principal Service Reliability Engineer

Principal SRE position at Oracle focusing on service reliability, automation, and incident response for Enterprise Cloud services.