Senior Site Reliability Engineer — AI Studio (Inference Platform)

Nebius

Nebius is leading cloud computing company serving the global AI economy, creating tools and resources for AI/ML infrastructure.

Amsterdam, Netherlands • Berlin, Germany • London, UK…

Site Reliability

Senior Software Engineer

Remote

501 - 1,000 Employees

5+ years of experience

Job Description

Nebius, a leading AI cloud infrastructure company headquartered in Amsterdam and listed on Nasdaq, is seeking a Senior Site Reliability Engineer for their AI Studio team. The role is part of Nebius Cloud, one of the world's largest GPU clouds, running tens of thousands of GPUs. The position focuses on building and maintaining an inference platform that makes foundation models fast, reliable, and easy to deploy at scale.

The ideal candidate will be responsible for the entire inference stack's reliability, performance, and observability. Daily tasks include designing telemetry pipelines, optimizing GPU efficiency through Kubernetes autoscaling, implementing resilient infrastructure with Terraform, and improving system reliability. The role requires expertise in Kubernetes, Prometheus, Grafana, and infrastructure-as-code, along with strong scripting abilities in Python or Bash.

The company offers a compelling opportunity to work at the cutting edge of AI cloud infrastructure alongside experienced leaders and innovators. With over 800 employees globally, including 400+ skilled engineers, Nebius provides a dynamic environment for professional growth. The position offers competitive compensation, comprehensive benefits, and flexible working arrangements, making it an ideal opportunity for those passionate about building the future of AI infrastructure.

This role is perfect for someone who thrives on debugging performance from kernel to application layer, enjoys building self-healing systems, and wants to contribute to the infrastructure powering next-generation AI technologies. The position offers the flexibility of remote work with the backing of a rapidly growing, innovative company at the forefront of AI cloud computing.

Last updated 3 months ago

Responsibilities For Senior Site Reliability Engineer — AI Studio (Inference Platform)

Own the reliability, performance, and observability of the entire inference stack
Design and refine telemetry pipelines for metrics, logs, and traces
Tune Kubernetes autoscalers for GPU efficiency
Create Terraform modules for cluster resilience
Improve request-routing and retry logic
Handle incident response and drive post-mortem culture
Scale the platform while meeting cost and reliability targets

Requirements For Senior Site Reliability Engineer — AI Studio (Inference Platform)

Kubernetes

Python

Linux

Deep fluency with Kubernetes, Prometheus, Grafana, Terraform
Infrastructure-as-code expertise
Proficiency in Python or Bash scripting
Understanding of alert design and SLOs for high-throughput APIs
Production experience with distributed back-ends
Experience with GPU workloads (vLLM, Triton, Ray)
Background in MLOps or model-hosting platforms preferred

Benefits For Senior Site Reliability Engineer — AI Studio (Inference Platform)

Competitive salary and comprehensive benefits package
Opportunities for professional growth
Hybrid working arrangements
Dynamic and collaborative work environment

Nebius

Nebius is leading cloud computing company serving the global AI economy, creating tools and resources for AI/ML infrastructure.

Amsterdam, Netherlands • Berlin, Germany • London, UK…

Site Reliability

Senior Software Engineer

Remote

501 - 1,000 Employees

5+ years of experience

Related Jobs

Senior Software Engineer SRE

Spire Global

Senior SRE position at Spire Global focusing on maintaining and improving reliability of satellite constellation operations through software automation and monitoring.

Site Reliability Engineer

RELX - LexisNexis IP

Senior Site Reliability Engineer position at LexisNexis IP, focusing on cloud infrastructure, Kubernetes, and DevOps practices in Farringdon, UK.

Senior Software Engineer, Site Reliability Engineering

Google

Senior SRE position at Google focusing on building and maintaining large-scale distributed systems for Google Cloud services.

Site Reliability Engineer

Gizmo

Senior Site Reliability Engineer position at Gizmo in London, focusing on scaling systems for millions of users with hybrid work arrangement.

Senior Software Engineer, Site Reliability Engineering

Google

Senior Site Reliability Engineering role at Google, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and automation.