NVIDIA is seeking a Senior SRE Software Engineer to join their DGX Cloud platform team. This role focuses on ensuring the reliability, availability, and performance of storage infrastructures that support NVIDIA's mission-critical applications. The ideal candidate will work at the intersection of storage systems and site reliability engineering, implementing modern automation practices and performance tuning.
The position offers an opportunity to work with cutting-edge AI/ML workloads and large-scale distributed systems. You'll be responsible for developing and maintaining storage solutions that power NVIDIA's cloud platform, implementing monitoring systems, and ensuring high availability through careful planning and automation.
As an SRE at NVIDIA, you'll be part of a team that values self-direction and modern engineering practices. You'll work with cross-functional teams, participate in on-call rotations, and have the opportunity to impact the reliability and performance of systems that support NVIDIA's AI and GPU cloud infrastructure.
The role combines hands-on technical work with strategic thinking, requiring both deep technical knowledge of storage systems and the ability to implement SRE best practices. You'll be working with state-of-the-art technology stack including Kubernetes, various storage solutions, and modern observability tools, while contributing to the success of NVIDIA's cloud platform.