Taro Logo

Senior SRE Software Engineer, Storage and Data

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior SRE Software Engineer, Storage and Data

NVIDIA is seeking a Senior SRE Software Engineer to join their Storage and Data team, focusing on ensuring the reliability and performance of their DGX Cloud platform. This role is crucial in maintaining and optimizing storage infrastructures that support NVIDIA's mission-critical applications and services. As an SRE, you'll be working with cutting-edge technology in AI and ML workloads, designing and implementing scalable storage solutions, and collaborating with cross-functional teams.

The position requires strong expertise in storage systems and reliability engineering, with a focus on automation and performance optimization. You'll be responsible for developing strategies for system reliability, implementing monitoring solutions, and participating in on-call rotations to maintain 24/7 system availability. The role combines hands-on technical work with strategic planning and cross-team collaboration.

NVIDIA offers an exciting opportunity to work with some of the world's most advanced technology in AI and accelerated computing. The company is known for pushing boundaries in technology and innovation, making it an ideal place for engineers who want to work on challenging problems at scale. The role provides exposure to large-scale distributed systems and the chance to work with modern cloud technologies and automation tools.

The ideal candidate will bring a strong background in storage system administration, site reliability engineering, and software development. You'll need expertise in various programming languages and infrastructure tools, along with a deep understanding of Linux systems and networking. This position offers the opportunity to work with cutting-edge technology while ensuring the reliability and performance of systems that power AI and machine learning workloads.

Last updated 2 months ago

Responsibilities For Senior SRE Software Engineer, Storage and Data

  • Develop strategies to ensure reliability and availability of storage systems
  • Analyze and fine-tune storage systems for optimal performance
  • Develop and maintain automation scripts for storage provisioning
  • Implement monitoring and alerting systems
  • Participate in on-call rotation
  • Collaborate with cross-functional teams
  • Work with AI/ML workloads in large clusters

Requirements For Senior SRE Software Engineer, Storage and Data

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years equivalent practical experience
  • Experience with Git, RESTFul API, Linux service operation
  • Experience with Ansible, Bash, Python, Go, YAML, Java
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like InfluxDB, Prometheus
  • Strong Linux and network troubleshooting skills

Interested in this job?