Taro Logo

Senior Site Reliability Engineer, HPC and LSF

NVIDIA is the world leader in accelerated computing, AI, and digital twins technology.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$184,000 - $287,500
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer, HPC and LSF

NVIDIA, the global leader in AI and accelerated computing, is seeking a Senior Site Reliability Engineer to join their Hardware Infrastructure Farm team. This role combines the challenges of managing large-scale HPC environments with the excitement of contributing to cutting-edge chip development.

The position offers a unique opportunity to work at the intersection of infrastructure management and silicon development, where your work directly impacts the creation of next-generation chips. As an SRE, you'll be responsible for designing and implementing groundbreaking compute clusters that power all silicon development across NVIDIA. The role demands expertise in high-reliability system operations, efficiency optimization, and automation implementation to enhance engineer productivity.

NVIDIA's culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The organization brings together professionals from various backgrounds and perspectives, encouraging collaboration and innovative thinking. The role offers significant autonomy while providing the necessary support and mentorship for continuous learning and growth.

Key responsibilities include managing workload schedulers in HPC environments, developing automation solutions, troubleshooting complex issues, and collaborating with domain experts. The ideal candidate will bring extensive experience with job scheduler administration, Linux systems, container technologies, and scripting languages. The position requires both technical expertise and strong communication skills to work effectively across diverse teams.

The compensation package includes a competitive base salary range of $184,000 - $287,500 USD, plus equity and benefits. This role offers the flexibility of hybrid work arrangements and the opportunity to work in multiple locations including Santa Clara, Westford, or Austin. Join NVIDIA in their mission to amplify human imagination and intelligence through technological innovation.

Last updated 6 days ago

Responsibilities For Senior Site Reliability Engineer, HPC and LSF

  • Manage and support workload and resource schedulers in large-scale HPC environment
  • Develop automation scripts for deployment, configuration management, and operational monitoring
  • Develop solutions for complex computing resource management requirements
  • Extract and leverage grid performance metrics for troubleshooting and optimization
  • Perform comprehensive troubleshooting from bare metal to application level
  • Develop, define and document standard methodologies
  • Collaborate with domain experts to improve chip development process
  • Contribute to quality and improve time to market for next generation chips

Requirements For Senior Site Reliability Engineer, HPC and LSF

Linux
Python
  • Extensive knowledge with job scheduler administration (IBM Spectrum LSF or SLURM)
  • Proficient in administering Centos/RHEL Linux distributions
  • In depth understanding of container technologies like Docker
  • Proficiency in UNIX scripting languages and Python
  • Excellent problem-solving skills
  • Excellent communication and teamwork skills
  • 10+ years experience in large, distributed Linux environment
  • BS in Computer Science, similar degree or equivalent experience

Benefits For Senior Site Reliability Engineer, HPC and LSF

Equity
  • Equity

Jobs Related To NVIDIA Senior Site Reliability Engineer, HPC and LSF