NVIDIA is seeking a Senior SRE Software Engineer to join their Storage and Data team, focusing on ensuring the reliability and performance of their DGX Cloud platform. This role is crucial in maintaining and optimizing storage infrastructures that support NVIDIA's mission-critical applications and services. As an SRE, you'll be working with cutting-edge technology in AI and ML workloads, designing and implementing scalable storage solutions, and collaborating with cross-functional teams.
The position requires strong expertise in storage systems and reliability engineering, with a focus on automation and performance optimization. You'll be responsible for developing strategies for system reliability, implementing monitoring solutions, and participating in on-call rotations to maintain 24/7 system availability. The role combines hands-on technical work with strategic planning and cross-team collaboration.
NVIDIA offers an exciting opportunity to work with some of the world's most advanced technology in AI and accelerated computing. The company is known for pushing boundaries in technology and innovation, making it an ideal place for engineers who want to work on challenging problems at scale. The role provides exposure to large-scale distributed systems and the chance to work with modern cloud technologies and automation tools.
The ideal candidate will bring a strong background in storage system administration, site reliability engineering, and software development. You'll need expertise in various programming languages and infrastructure tools, along with a deep understanding of Linux systems and networking. This position offers the opportunity to work with cutting-edge technology while ensuring the reliability and performance of systems that power AI and machine learning workloads.