Senior Production Engineer - Storage

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions in AI and digital twins.

San Francisco, CA, USA

$148,000 - $356,500

DevOps

Senior Software Engineer

In-Person

5+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Production Engineer - Storage

NVIDIA is seeking a Senior Production Engineer for their Storage team to join their Site Reliability Engineering (SRE) organization. This role combines software engineering practices with systems operations to build and maintain large-scale production systems. The position focuses on ensuring reliable storage solutions and managing data efficiently for NVIDIA's GPU cloud services.

The role requires expertise in various domains including systems, networking, storage, coding, and database management. You'll work with cutting-edge technologies including Kubernetes, containers, and virtualization while ensuring high reliability and uptime for both internal and external facing services.

As a Senior Production Engineer, you'll be responsible for designing and implementing large-scale storage clusters, working with AI/ML workloads, and improving service lifecycles. The role involves hands-on work with monitoring systems, automation, and performance optimization. You'll be part of a diverse team that values intellectual curiosity and problem-solving in a blame-free environment.

Key responsibilities include supporting services before they go live through system design consulting, developing software frameworks, and managing capacity. You'll maintain services by monitoring availability and system health, often leveraging machine learning models. The role requires participation in an on-call rotation and practicing sustainable incident response.

The ideal candidate will have strong experience with Linux systems, infrastructure configuration management tools, and observability solutions. You'll need to demonstrate excellent debugging skills and thrive in collaborative environments. This position offers competitive compensation, including equity, and the opportunity to work with some of the most forward-thinking professionals in technology.

NVIDIA's culture promotes self-direction and provides support for learning and growth. You'll be part of an organization that brings together people with diverse backgrounds and perspectives, encouraging collaboration and innovation. This role offers the chance to work on meaningful projects while contributing to the reliability and efficiency of NVIDIA's critical storage infrastructure.

Last updated 5 months ago

Responsibilities For Senior Production Engineer - Storage

Design, implement, and support large-scale storage clusters
Work with AI/ML workloads to analyze behavior in large clusters
Improve service lifecycle from design through deployment and refinement
Support services through system design consulting and capacity management
Monitor and maintain service availability, latency, and system health
Implement automation and machine learning models for system scaling
Participate in on-call rotation for production systems
Practice sustainable incident response and blameless postmortems

Requirements For Senior Production Engineer - Storage

Python

Linux

Kubernetes

BS degree in Computer Science or related technical field
5+ years practical experience
Experience with algorithms, data structures, and large-scale Linux systems
Experience in C/C++, Java, Python, Go, Perl or Ruby
Knowledge of infrastructure tools like Ansible, Chef, Puppet, and Terraform
Experience with observability tools like InfluxDB, Prometheus, and Elastic stack

Benefits For Senior Production Engineer - Storage

Equity

Equity