Senior Site Reliability Engineer, HPC and LSF

NVIDIA

NVIDIA is the world leader in accelerated computing, AI, and digital twins technology.

Santa Clara, CA, USA • Westford, MA 01886, USA • Austin, TX, USA

$184,000 - $287,500

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS

Description For Senior Site Reliability Engineer, HPC and LSF

NVIDIA, the global leader in AI and accelerated computing, is seeking a Senior Site Reliability Engineer to join their Hardware Infrastructure Farm team. This role combines the challenges of managing large-scale HPC environments with the excitement of contributing to cutting-edge chip development.

The position offers a unique opportunity to work at the intersection of infrastructure management and silicon development, where your work directly impacts the creation of next-generation chips. As an SRE, you'll be responsible for designing and implementing groundbreaking compute clusters that power all silicon development across NVIDIA. The role demands expertise in high-reliability system operations, efficiency optimization, and automation implementation to enhance engineer productivity.

NVIDIA's culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The organization brings together professionals from various backgrounds and perspectives, encouraging collaboration and innovative thinking. The role offers significant autonomy while providing the necessary support and mentorship for continuous learning and growth.

Key responsibilities include managing workload schedulers in HPC environments, developing automation solutions, troubleshooting complex issues, and collaborating with domain experts. The ideal candidate will bring extensive experience with job scheduler administration, Linux systems, container technologies, and scripting languages. The position requires both technical expertise and strong communication skills to work effectively across diverse teams.

The compensation package includes a competitive base salary range of $184,000 - $287,500 USD, plus equity and benefits. This role offers the flexibility of hybrid work arrangements and the opportunity to work in multiple locations including Santa Clara, Westford, or Austin. Join NVIDIA in their mission to amplify human imagination and intelligence through technological innovation.

Last updated 6 days ago

Responsibilities For Senior Site Reliability Engineer, HPC and LSF

Manage and support workload and resource schedulers in large-scale HPC environment
Develop automation scripts for deployment, configuration management, and operational monitoring
Develop solutions for complex computing resource management requirements
Extract and leverage grid performance metrics for troubleshooting and optimization
Perform comprehensive troubleshooting from bare metal to application level
Develop, define and document standard methodologies
Collaborate with domain experts to improve chip development process
Contribute to quality and improve time to market for next generation chips

Requirements For Senior Site Reliability Engineer, HPC and LSF

Linux

Python

Extensive knowledge with job scheduler administration (IBM Spectrum LSF or SLURM)
Proficient in administering Centos/RHEL Linux distributions
In depth understanding of container technologies like Docker
Proficiency in UNIX scripting languages and Python
Excellent problem-solving skills
Excellent communication and teamwork skills
10+ years experience in large, distributed Linux environment
BS in Computer Science, similar degree or equivalent experience

Benefits For Senior Site Reliability Engineer, HPC and LSF

Equity

Equity

NVIDIA

NVIDIA is the world leader in accelerated computing, AI, and digital twins technology.

Santa Clara, CA, USA • Westford, MA 01886, USA • Austin, TX, USA

$184,000 - $287,500

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS

Senior Site Reliability Engineer, HPC and LSF

NVIDIA

Description For Senior Site Reliability Engineer, HPC and LSF

Responsibilities For Senior Site Reliability Engineer, HPC and LSF

Requirements For Senior Site Reliability Engineer, HPC and LSF

Benefits For Senior Site Reliability Engineer, HPC and LSF

NVIDIA

Jobs Related To NVIDIA Senior Site Reliability Engineer, HPC and LSF