Senior AI Infrastructure Engineer

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

San Francisco, CA, USA

$144,000 - $270,250

Machine Learning

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

This job posting is no longer active. Check out these related jobs instead:

Job Description

NVIDIA, the world leader in accelerated computing, is seeking a Senior AI Infrastructure Engineer for their Compute Architecture Group. This role involves managing a diverse cluster of GPU-accelerated systems to support AI and software development. The position requires expertise in system administration, performance analysis, automation, and architecture. You'll be working with cutting-edge technology, enabling groundbreaking experimentation in designing the world's most powerful systems.

The role combines hands-on technical work with strategic planning, requiring you to administer AI clusters, maintain SLURM configurations, and implement DevOps practices using tools like Ansible and Gitlab. You'll work directly with developers and hardware architects, making a meaningful impact at a company spearheading the next wave in computing technology.

Ideal candidates should have 5+ years of experience with large-scale clusters, strong technical knowledge of distributed systems, and expertise in Linux administration. The position offers competitive compensation ($144,000-$270,250) plus equity and benefits. This is an excellent opportunity for someone passionate about AI infrastructure who wants to work at the forefront of technology innovation.

NVIDIA's commitment to diversity and inclusion, combined with their position as a leader in AI and digital twins technology, makes this an attractive opportunity for those looking to make a significant impact in the field of accelerated computing. The role offers the chance to work with a technically diverse team of GPU architects, software engineers, and infrastructure experts in a fast-paced, innovation-driven environment.

Last updated 7 months ago

Responsibilities For Senior AI Infrastructure Engineer

Administer NVIDIA Internal AI cluster composed of Linux systems
Maintain SLURM resource management system configuration
Automate configuration management and software updates using DevOps tools
Plan and maintain systems supporting NVIDIA Software stack
Debug issues and improve workflows with developers and hardware architects
Communicate with users and management regarding resource planning

Requirements For Senior AI Infrastructure Engineer

Python

Linux

Kubernetes

5+ years experience deploying and administering large scale clusters for AI
MS in Computer Science, Computer Engineering, or EECE; or BS with equivalent experience
Deep knowledge of distributed resource scheduling systems (Slurm preferred)
Scripting ability in bash and Python
Experience with container technologies (Docker, Singularity)
Deep understanding of operating systems, computer networks, and high-performance hardware
Strong collaboration skills with developers and architects
Dedication to providing quality user support

Benefits For Senior AI Infrastructure Engineer

Equity

Equity
Comprehensive benefits package