NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team. This role involves designing, building, and maintaining services that enable real-time data analytics, streaming, data lakes, observability and ML/AI training and inferencing. Responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of the platform, applying SRE principles to improve production systems and optimize service SLOs, and collaborating with customers to plan and implement changes to existing systems.
Key responsibilities:
- Develop software solutions for large-scale system reliability
- Gain deep understanding of system operations and identify improvement opportunities
- Create tools and automation to reduce operational overhead
- Establish frameworks and processes to enhance operational maturity
- Define reliability metrics and oversee capacity management
- Build tools for improved service observability
- Practice sustainable incident response and blameless postmortems
Requirements:
- 5-8 years of experience in SRE, Cloud platforms, or DevOps
- Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent
- Strong understanding of SRE principles
- Proficiency in incident management and problem-solving
- Experience with streaming data infrastructure services
- Expertise in observability platforms
- Proficiency in programming languages like Python, Go, Perl, or Ruby
- Experience with scaling distributed systems in cloud environments
This role offers the opportunity to work on innovative technologies powering the future of AI and data science, as part of a dynamic team that values learning and growth. Join NVIDIA in shaping the future of accelerated computing and AI!