Senior SRE Software Engineer, Storage and Data

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior SRE Software Engineer, Storage and Data

NVIDIA is seeking a Senior SRE Software Engineer to join their Storage and Data team, focusing on maintaining the reliability and performance of their DGX Cloud platform. This role combines software engineering with infrastructure management, requiring expertise in storage systems and reliability engineering. The position involves working with cutting-edge AI/ML technologies while ensuring system stability and efficiency. As part of NVIDIA, a world leader in accelerated computing, you'll be responsible for designing and implementing scalable storage solutions, automation tools, and monitoring systems. The role offers the opportunity to work with large-scale distributed systems and contribute to the company's mission of transforming industries through AI and digital twins technology. The ideal candidate will have strong technical skills in Linux, storage systems, and programming languages like Python or Go, combined with a customer-first approach and passion for problem-solving. NVIDIA offers competitive compensation and is known for being one of the most desirable employers globally, providing an inclusive environment that values diversity and innovation.

Last updated 5 days ago

Responsibilities For Senior SRE Software Engineer, Storage and Data

  • Develop strategies to ensure reliability and availability of storage systems
  • Analyze and fine-tune storage systems for optimal performance
  • Develop and maintain automation scripts and tools
  • Implement monitoring and alerting systems
  • Participate in on-call rotation
  • Collaborate with cross-functional teams
  • Work with AI/ML workloads in large clusters

Requirements For Senior SRE Software Engineer, Storage and Data

Python
Go
Java
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years equivalent practical experience
  • Experience with Git, RESTful API, Linux service operation
  • Experience with Ansible, Bash, Python, Go, YAML, Java
  • Knowledge of infrastructure configuration management tools
  • Experience with observability tools like InfluxDB, Prometheus, Elastic stack
  • Strong Linux and network troubleshooting skills
  • Experience with storage solutions

Benefits For Senior SRE Software Engineer, Storage and Data

Medical Insurance
  • Competitive salaries
  • Generous benefits package

Interested in this job?

Jobs Related To NVIDIA Senior SRE Software Engineer, Storage and Data

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer role at NVIDIA focusing on maintaining and scaling data science and ML platforms, requiring expertise in SRE practices and distributed systems.

Senior Site Reliability Engineer - AI Research Clusters

Senior SRE position at NVIDIA focusing on AI research clusters, requiring expertise in GPU computing, cluster management, and automation with 5+ years of experience.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, offering competitive compensation and the opportunity to work with cutting-edge GPU technology.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Senior SRE position at Google Cloud focusing on building and maintaining large-scale distributed systems, requiring 5+ years of software development experience and strong system design skills.