NVIDIA, the world leader in accelerated computing, is seeking a Senior AI Observability Engineer to join their AI Infrastructure organization. This role focuses on architecting and implementing distributed observability systems for AI and HPC clusters, working directly with NVIDIA's growing AI, Hardware, and Software engineering teams.
The position involves developing sophisticated systems for data collection, aggregation, enrichment, storage, retrieval, and visualization to enhance the efficiency and performance of AI and HPC workloads. You'll be responsible for deploying and operating observability solutions across multiple global compute clusters.
The ideal candidate should have 8+ years of experience with distributed observability systems and a strong background in Python programming. Experience with platforms like Apache Spark, Elastic Search, Grafana, and Prometheus is essential. The role requires both technical expertise and strong collaborative skills, as you'll be working closely with data scientists, researchers, and engineering teams.
NVIDIA offers competitive compensation with a base salary range of $184,000 - $356,500 USD (depending on level), plus equity and comprehensive benefits. The company is committed to fostering a diverse and inclusive work environment, making it an excellent opportunity for professionals looking to make an impact in the AI and accelerated computing space.
This role presents an exciting opportunity to work at the forefront of AI infrastructure, helping to build and maintain the systems that power NVIDIA's cutting-edge research and development. The position combines technical challenges with strategic thinking, requiring someone who can both architect complex systems and understand the broader business impact of their work.