Taro Logo

Senior AI Infrastructure Engineer

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$144,000 - $270,250
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS
This job posting is no longer active. Check out these related jobs instead:

Job Description

NVIDIA, the world leader in accelerated computing, is seeking a Senior AI Infrastructure Engineer for their Compute Architecture Group. This role involves managing a diverse cluster of GPU-accelerated systems to support AI and software development. The position requires expertise in system administration, performance analysis, automation, and architecture. You'll be working with cutting-edge technology, enabling groundbreaking experimentation in designing the world's most powerful systems.

The role combines hands-on technical work with strategic planning, requiring you to administer AI clusters, maintain SLURM configurations, and implement DevOps practices using tools like Ansible and Gitlab. You'll work directly with developers and hardware architects, making a meaningful impact at a company spearheading the next wave in computing technology.

Ideal candidates should have 5+ years of experience with large-scale clusters, strong technical knowledge of distributed systems, and expertise in Linux administration. The position offers competitive compensation ($144,000-$270,250) plus equity and benefits. This is an excellent opportunity for someone passionate about AI infrastructure who wants to work at the forefront of technology innovation.

NVIDIA's commitment to diversity and inclusion, combined with their position as a leader in AI and digital twins technology, makes this an attractive opportunity for those looking to make a significant impact in the field of accelerated computing. The role offers the chance to work with a technically diverse team of GPU architects, software engineers, and infrastructure experts in a fast-paced, innovation-driven environment.

Last updated 7 months ago

Responsibilities For Senior AI Infrastructure Engineer

  • Administer NVIDIA Internal AI cluster composed of Linux systems
  • Maintain SLURM resource management system configuration
  • Automate configuration management and software updates using DevOps tools
  • Plan and maintain systems supporting NVIDIA Software stack
  • Debug issues and improve workflows with developers and hardware architects
  • Communicate with users and management regarding resource planning

Requirements For Senior AI Infrastructure Engineer

Python
Linux
Kubernetes
  • 5+ years experience deploying and administering large scale clusters for AI
  • MS in Computer Science, Computer Engineering, or EECE; or BS with equivalent experience
  • Deep knowledge of distributed resource scheduling systems (Slurm preferred)
  • Scripting ability in bash and Python
  • Experience with container technologies (Docker, Singularity)
  • Deep understanding of operating systems, computer networks, and high-performance hardware
  • Strong collaboration skills with developers and architects
  • Dedication to providing quality user support

Benefits For Senior AI Infrastructure Engineer

Equity
  • Equity
  • Comprehensive benefits package