Taro Logo

Senior AI Observability Engineer

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$184,000 - $356,500
DevOps
Senior Software Engineer
Hybrid
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS

Description For Senior AI Observability Engineer

NVIDIA, the world leader in accelerated computing, is seeking a Senior AI Observability Engineer to join their AI Infrastructure organization. This role focuses on architecting and implementing distributed observability systems for AI and HPC clusters, working directly with NVIDIA's growing AI, Hardware, and Software engineering teams.

The position involves developing sophisticated systems for data collection, aggregation, enrichment, storage, retrieval, and visualization to enhance the efficiency and performance of AI and HPC workloads. You'll be responsible for deploying and operating observability solutions across multiple global compute clusters.

The ideal candidate should have 8+ years of experience with distributed observability systems and a strong background in Python programming. Experience with platforms like Apache Spark, Elastic Search, Grafana, and Prometheus is essential. The role requires both technical expertise and strong collaborative skills, as you'll be working closely with data scientists, researchers, and engineering teams.

NVIDIA offers competitive compensation with a base salary range of $184,000 - $356,500 USD (depending on level), plus equity and comprehensive benefits. The company is committed to fostering a diverse and inclusive work environment, making it an excellent opportunity for professionals looking to make an impact in the AI and accelerated computing space.

This role presents an exciting opportunity to work at the forefront of AI infrastructure, helping to build and maintain the systems that power NVIDIA's cutting-edge research and development. The position combines technical challenges with strategic thinking, requiring someone who can both architect complex systems and understand the broader business impact of their work.

Last updated 13 days ago

Responsibilities For Senior AI Observability Engineer

  • Collaborate with AI, HW, SW engineering and research teams to deliver observability solutions
  • Develop, test, and deploy data collectors, pipelines, visualization and retrieval services
  • Build a self-serve platform
  • Define data collection and retention policies
  • Provide operational and strategic data to improve performance and efficiency
  • Continuously improve quality, workloads, and processes through better observability

Requirements For Senior AI Observability Engineer

Python
Kubernetes
  • Experience developing large scale, distributed observability systems
  • Ability to collaborate with data scientists and engineering teams
  • Experience with turning raw data into actionable reports
  • Experience with observability platforms (Apache Spark, Elastic/Open Search, Grafana, Prometheus)
  • Python programming experience and use of API calls
  • MS (preferred) or BS in Computer Science, Electrical Engineering, or related field
  • 8+ years of proven experience
  • Excellent planning and interpersonal skills

Benefits For Senior AI Observability Engineer

Equity
Medical Insurance
  • Equity
  • Medical Insurance

Jobs Related To NVIDIA Senior AI Observability Engineer