Senior SRE Software Engineer, Storage and Data

NVIDIA is a technology company specializing in AI computing, graphics, and accelerated computing solutions.
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior SRE Software Engineer, Storage and Data

NVIDIA is seeking a Senior SRE Software Engineer to join their DGX Cloud platform team. This role focuses on ensuring the reliability, availability, and performance of storage infrastructures that support NVIDIA's mission-critical applications. The ideal candidate will work at the intersection of storage systems and site reliability engineering, implementing modern automation practices and performance tuning.

The position offers an opportunity to work with cutting-edge AI/ML workloads and large-scale distributed systems. You'll be responsible for developing and maintaining storage solutions that power NVIDIA's cloud platform, implementing monitoring systems, and ensuring high availability through careful planning and automation.

As an SRE at NVIDIA, you'll be part of a team that values self-direction and modern engineering practices. You'll work with cross-functional teams, participate in on-call rotations, and have the opportunity to impact the reliability and performance of systems that support NVIDIA's AI and GPU cloud infrastructure.

The role combines hands-on technical work with strategic thinking, requiring both deep technical knowledge of storage systems and the ability to implement SRE best practices. You'll be working with state-of-the-art technology stack including Kubernetes, various storage solutions, and modern observability tools, while contributing to the success of NVIDIA's cloud platform.

Last updated 5 days ago

Responsibilities For Senior SRE Software Engineer, Storage and Data

  • Develop strategies to ensure reliability and availability of storage systems
  • Analyze and fine-tune storage systems for optimal performance
  • Develop and maintain automation scripts and tools
  • Implement monitoring and alerting systems
  • Participate in on-call rotation
  • Conduct root cause analysis of outages
  • Collaborate with cross-functional teams
  • Work with AI/ML workloads in large clusters

Requirements For Senior SRE Software Engineer, Storage and Data

Python
Go
Java
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of practical experience
  • Experience with Git, RESTFul API, Linux service operation
  • Experience with Ansible, Bash, Python, Go, YAML, Java
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like Prometheus and Elastic stack
  • Strong Linux and network troubleshooting skills
  • Experience with storage solutions like OpenStack Swift, AWS S3

Interested in this job?

Jobs Related To NVIDIA Senior SRE Software Engineer, Storage and Data

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer role at NVIDIA focusing on maintaining and scaling data science and ML platforms, requiring expertise in SRE practices and distributed systems.

Senior Site Reliability Engineer - AI Research Clusters

Senior SRE position at NVIDIA focusing on AI research clusters, requiring expertise in GPU computing, cluster management, and automation with 5+ years of experience.

Senior SRE Software Engineer, Storage and Data

Senior SRE Software Engineer position at NVIDIA, focusing on storage infrastructure for DGX Cloud platform, requiring 5+ years of experience in storage systems and reliability engineering.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, offering competitive compensation and the opportunity to work with cutting-edge GPU technology.

Sr. Site Reliability Engineer - Top Secret Clearance

Senior Site Reliability Engineer position at SpaceX, requiring Top Secret clearance, focusing on infrastructure automation and DevOps practices for space flight systems.