Taro Logo

Senior Site Reliability Engineer — AI Studio (Inference Platform)

Nebius is leading cloud computing company serving the global AI economy, creating tools and resources for AI/ML infrastructure.
Site Reliability
Senior Software Engineer
Remote
501 - 1,000 Employees
5+ years of experience
AI

Description For Senior Site Reliability Engineer — AI Studio (Inference Platform)

Nebius, a leading AI cloud infrastructure company headquartered in Amsterdam and listed on Nasdaq, is seeking a Senior Site Reliability Engineer for their AI Studio team. The role is part of Nebius Cloud, one of the world's largest GPU clouds, running tens of thousands of GPUs. The position focuses on building and maintaining an inference platform that makes foundation models fast, reliable, and easy to deploy at scale.

The ideal candidate will be responsible for the entire inference stack's reliability, performance, and observability. Daily tasks include designing telemetry pipelines, optimizing GPU efficiency through Kubernetes autoscaling, implementing resilient infrastructure with Terraform, and improving system reliability. The role requires expertise in Kubernetes, Prometheus, Grafana, and infrastructure-as-code, along with strong scripting abilities in Python or Bash.

The company offers a compelling opportunity to work at the cutting edge of AI cloud infrastructure alongside experienced leaders and innovators. With over 800 employees globally, including 400+ skilled engineers, Nebius provides a dynamic environment for professional growth. The position offers competitive compensation, comprehensive benefits, and flexible working arrangements, making it an ideal opportunity for those passionate about building the future of AI infrastructure.

This role is perfect for someone who thrives on debugging performance from kernel to application layer, enjoys building self-healing systems, and wants to contribute to the infrastructure powering next-generation AI technologies. The position offers the flexibility of remote work with the backing of a rapidly growing, innovative company at the forefront of AI cloud computing.

Last updated 3 days ago

Responsibilities For Senior Site Reliability Engineer — AI Studio (Inference Platform)

  • Own the reliability, performance, and observability of the entire inference stack
  • Design and refine telemetry pipelines for metrics, logs, and traces
  • Tune Kubernetes autoscalers for GPU efficiency
  • Create Terraform modules for cluster resilience
  • Improve request-routing and retry logic
  • Handle incident response and drive post-mortem culture
  • Scale the platform while meeting cost and reliability targets

Requirements For Senior Site Reliability Engineer — AI Studio (Inference Platform)

Kubernetes
Python
Linux
  • Deep fluency with Kubernetes, Prometheus, Grafana, Terraform
  • Infrastructure-as-code expertise
  • Proficiency in Python or Bash scripting
  • Understanding of alert design and SLOs for high-throughput APIs
  • Production experience with distributed back-ends
  • Experience with GPU workloads (vLLM, Triton, Ray)
  • Background in MLOps or model-hosting platforms preferred

Benefits For Senior Site Reliability Engineer — AI Studio (Inference Platform)

  • Competitive salary and comprehensive benefits package
  • Opportunities for professional growth
  • Hybrid working arrangements
  • Dynamic and collaborative work environment

Interested in this job?

Jobs Related To Nebius Senior Site Reliability Engineer — AI Studio (Inference Platform)