Senior Site Reliability Engineer, HPC and LSF

NVIDIA

NVIDIA is the world leader in accelerated computing, AI, and machine learning.

Bengaluru, Karnataka, India

DevOps

Senior Software Engineer

Hybrid

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA, the pioneer in accelerated computing and AI, is seeking a Senior Site Reliability Engineer to join their Hardware Infrastructure Farm team. This role is crucial in designing and implementing cutting-edge compute clusters that power NVIDIA's silicon development. As an SRE, you'll be responsible for building and operating high-reliability, efficient clusters while driving automation and improvements to enhance engineer productivity.

The position combines deep technical expertise in HPC environments with strategic thinking about system interactions. You'll work with technologies like LSF/SLURM, Linux, Docker, and Python to manage complex computing resources. The role emphasizes automation, proactive problem-solving, and maintaining high-reliability systems that directly impact NVIDIA's chip development process.

NVIDIA's culture values diversity, intellectual curiosity, and openness. The company has transformed itself over two decades, from inventing the GPU that revolutionized gaming and graphics to leading AI and machine learning innovation. This role offers the opportunity to work with cutting-edge technology while contributing to NVIDIA's continued innovation in AI and accelerated computing.

The ideal candidate brings 5+ years of experience in large Linux environments, strong expertise in job scheduler administration, and excellent problem-solving abilities. You'll collaborate with diverse teams, automate processes, and directly influence the quality and time-to-market of NVIDIA's next-generation chips. This position offers the chance to work at the intersection of infrastructure management and chip development at a company that consistently pushes technological boundaries.

Last updated 14 days ago

Responsibilities For Senior Site Reliability Engineer, HPC and LSF

Manage and support workload and resource schedulers in large-scale HPC environment
Develop automation scripts for deployment, configuration management, and monitoring
Develop solutions for complex computing resource management
Extract and leverage grid performance metrics for optimization
Perform comprehensive troubleshooting from bare metal to application level
Develop and document standard methodologies
Collaborate with domain experts to improve chip development infrastructure
Contribute to quality and improve time to market for next generation chips

Requirements For Senior Site Reliability Engineer, HPC and LSF

Linux

Python

Extensive knowledge with job scheduler administration (IBM Spectrum LSF or SLURM)
Proficient in administering Centos/RHEL Linux distributions
In depth understanding of container technologies like Docker
Proficiency in UNIX scripting languages and Python
Excellent problem-solving skills
Excellent communication and teamwork skills
5+ years experience in large, distributed Linux environment
BS in Computer Science, similar degree or equivalent experience

NVIDIA

NVIDIA is the world leader in accelerated computing, AI, and machine learning.

Bengaluru, Karnataka, India

DevOps

Senior Software Engineer

Hybrid

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

NVIDIA

Senior SCM Engineer role at NVIDIA managing large-scale Perforce and Git installations, developing automation tools, and improving infrastructure for global engineering teams.

Senior Tools Development Engineer

NVIDIA

Senior Tools Development Engineer role at NVIDIA focusing on building data-driven tools and automated testing solutions for software quality improvement.

Senior Software Configuration Management Engineer - SCM

NVIDIA

Senior Software Configuration Management Engineer position at NVIDIA, managing large-scale Perforce and Git installations, developing automation tools, and improving SCM infrastructure.

Senior Software QA Engineer

NVIDIA

Senior Software QA Engineer role at NVIDIA focusing on Ethernet solutions testing, requiring 5+ years of networking experience and Python automation skills.

Senior Build and Release Methodology Engineer

NVIDIA

Senior Build and Release Methodology Engineer position at NVIDIA, focusing on developing scalable infrastructure for SOC development with emphasis on build automation and release management.

Senior Site Reliability Engineer, HPC and LSF

NVIDIA

Job Description

Responsibilities For Senior Site Reliability Engineer, HPC and LSF

Requirements For Senior Site Reliability Engineer, HPC and LSF

NVIDIA

Related Jobs