Taro Logo

Senior Site Reliability Engineer, HPC and LSF

NVIDIA is the world leader in accelerated computing, AI, and machine learning.
DevOps
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, the pioneer in accelerated computing and AI, is seeking a Senior Site Reliability Engineer to join their Hardware Infrastructure Farm team. This role is crucial in designing and implementing cutting-edge compute clusters that power NVIDIA's silicon development. As an SRE, you'll be responsible for building and operating high-reliability, efficient clusters while driving automation and improvements to enhance engineer productivity.

The position combines deep technical expertise in HPC environments with strategic thinking about system interactions. You'll work with technologies like LSF/SLURM, Linux, Docker, and Python to manage complex computing resources. The role emphasizes automation, proactive problem-solving, and maintaining high-reliability systems that directly impact NVIDIA's chip development process.

NVIDIA's culture values diversity, intellectual curiosity, and openness. The company has transformed itself over two decades, from inventing the GPU that revolutionized gaming and graphics to leading AI and machine learning innovation. This role offers the opportunity to work with cutting-edge technology while contributing to NVIDIA's continued innovation in AI and accelerated computing.

The ideal candidate brings 5+ years of experience in large Linux environments, strong expertise in job scheduler administration, and excellent problem-solving abilities. You'll collaborate with diverse teams, automate processes, and directly influence the quality and time-to-market of NVIDIA's next-generation chips. This position offers the chance to work at the intersection of infrastructure management and chip development at a company that consistently pushes technological boundaries.

Last updated 14 days ago

Responsibilities For Senior Site Reliability Engineer, HPC and LSF

  • Manage and support workload and resource schedulers in large-scale HPC environment
  • Develop automation scripts for deployment, configuration management, and monitoring
  • Develop solutions for complex computing resource management
  • Extract and leverage grid performance metrics for optimization
  • Perform comprehensive troubleshooting from bare metal to application level
  • Develop and document standard methodologies
  • Collaborate with domain experts to improve chip development infrastructure
  • Contribute to quality and improve time to market for next generation chips

Requirements For Senior Site Reliability Engineer, HPC and LSF

Linux
Python
  • Extensive knowledge with job scheduler administration (IBM Spectrum LSF or SLURM)
  • Proficient in administering Centos/RHEL Linux distributions
  • In depth understanding of container technologies like Docker
  • Proficiency in UNIX scripting languages and Python
  • Excellent problem-solving skills
  • Excellent communication and teamwork skills
  • 5+ years experience in large, distributed Linux environment
  • BS in Computer Science, similar degree or equivalent experience

Related Jobs

Senior Software Configuration Management Engineer - SCM

Senior SCM Engineer role at NVIDIA managing large-scale Perforce and Git installations, developing automation tools, and improving infrastructure for global engineering teams.

Senior Tools Development Engineer

Senior Tools Development Engineer role at NVIDIA focusing on building data-driven tools and automated testing solutions for software quality improvement.

Senior Software Configuration Management Engineer - SCM

Senior Software Configuration Management Engineer position at NVIDIA, managing large-scale Perforce and Git installations, developing automation tools, and improving SCM infrastructure.

Senior Software QA Engineer

Senior Software QA Engineer role at NVIDIA focusing on Ethernet solutions testing, requiring 5+ years of networking experience and Python automation skills.

Senior Build and Release Methodology Engineer

Senior Build and Release Methodology Engineer position at NVIDIA, focusing on developing scalable infrastructure for SOC development with emphasis on build automation and release management.