Taro Logo

Senior Site Reliability Engineer — AI Studio (Inference Platform)

Nebius is leading cloud computing company serving the global AI economy, creating tools and resources for AI/ML infrastructure.
Site Reliability
Senior Software Engineer
Remote
501 - 1,000 Employees
5+ years of experience
AI

Job Description

Nebius, a leading AI cloud infrastructure company headquartered in Amsterdam and listed on Nasdaq, is seeking a Senior Site Reliability Engineer for their AI Studio team. The role is part of Nebius Cloud, one of the world's largest GPU clouds, running tens of thousands of GPUs. The position focuses on building and maintaining an inference platform that makes foundation models fast, reliable, and easy to deploy at scale.

The ideal candidate will be responsible for the entire inference stack's reliability, performance, and observability. Daily tasks include designing telemetry pipelines, optimizing GPU efficiency through Kubernetes autoscaling, implementing resilient infrastructure with Terraform, and improving system reliability. The role requires expertise in Kubernetes, Prometheus, Grafana, and infrastructure-as-code, along with strong scripting abilities in Python or Bash.

The company offers a compelling opportunity to work at the cutting edge of AI cloud infrastructure alongside experienced leaders and innovators. With over 800 employees globally, including 400+ skilled engineers, Nebius provides a dynamic environment for professional growth. The position offers competitive compensation, comprehensive benefits, and flexible working arrangements, making it an ideal opportunity for those passionate about building the future of AI infrastructure.

This role is perfect for someone who thrives on debugging performance from kernel to application layer, enjoys building self-healing systems, and wants to contribute to the infrastructure powering next-generation AI technologies. The position offers the flexibility of remote work with the backing of a rapidly growing, innovative company at the forefront of AI cloud computing.

Last updated 3 months ago

Responsibilities For Senior Site Reliability Engineer — AI Studio (Inference Platform)

  • Own the reliability, performance, and observability of the entire inference stack
  • Design and refine telemetry pipelines for metrics, logs, and traces
  • Tune Kubernetes autoscalers for GPU efficiency
  • Create Terraform modules for cluster resilience
  • Improve request-routing and retry logic
  • Handle incident response and drive post-mortem culture
  • Scale the platform while meeting cost and reliability targets

Requirements For Senior Site Reliability Engineer — AI Studio (Inference Platform)

Kubernetes
Python
Linux
  • Deep fluency with Kubernetes, Prometheus, Grafana, Terraform
  • Infrastructure-as-code expertise
  • Proficiency in Python or Bash scripting
  • Understanding of alert design and SLOs for high-throughput APIs
  • Production experience with distributed back-ends
  • Experience with GPU workloads (vLLM, Triton, Ray)
  • Background in MLOps or model-hosting platforms preferred

Benefits For Senior Site Reliability Engineer — AI Studio (Inference Platform)

  • Competitive salary and comprehensive benefits package
  • Opportunities for professional growth
  • Hybrid working arrangements
  • Dynamic and collaborative work environment

Related Jobs

Senior Software Engineer SRE

Senior SRE position at Spire Global focusing on maintaining and improving reliability of satellite constellation operations through software automation and monitoring.

Site Reliability Engineer

Senior Site Reliability Engineer position at LexisNexis IP, focusing on cloud infrastructure, Kubernetes, and DevOps practices in Farringdon, UK.

Senior Software Engineer, Site Reliability Engineering

Senior SRE position at Google focusing on building and maintaining large-scale distributed systems for Google Cloud services.

Site Reliability Engineer

Senior Site Reliability Engineer position at Gizmo in London, focusing on scaling systems for millions of users with hybrid work arrangement.

Senior Software Engineer, Site Reliability Engineering

Senior Site Reliability Engineering role at Google, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and automation.