Site Reliability Engineer, AI/ML Platforms

Adobe is a global leader in digital experiences, helping everyone from emerging artists to global brands create and deliver exceptional digital content.
$133,900 - $242,000
Site Reliability
Staff Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Site Reliability Engineer, AI/ML Platforms

Adobe is seeking an exceptional Site Reliability Engineer to join their AI Training and Inference Platforms team within Adobe Firefly. This is a unique opportunity to work at the intersection of Site Reliability Engineering and cutting-edge AI technology at one of the world's leading software companies.

The role focuses on building, scaling, and securing Adobe's AI Platform, which enables Firefly product teams to efficiently manage and deploy Machine Learning capabilities across Adobe's client applications. You'll be working with a team of SREs to support a platform that will handle thousands of ML models from Adobe's Applied Research groups and App Teams, spanning various lifecycle stages from early research to production deployment.

As an SRE, you'll be responsible for ensuring the platform's reliability, scalability, and efficiency across multiple cloud environments. Your work will directly impact Adobe's ability to deliver high-quality AI services to its customers, making you a crucial part of Adobe's AI innovation journey.

The ideal candidate brings a strong background in distributed systems and container orchestration, particularly with Kubernetes. You should be comfortable with programming in Python or Go, and have experience with modern DevOps practices and tools. Your expertise in observability, automation, and system reliability will be essential in maintaining and improving the platform's performance.

This role offers the opportunity to work with cutting-edge AI/ML technologies while solving complex technical challenges at scale. You'll collaborate with talented engineers across Adobe, contributing to the company's mission of changing the world through digital experiences. The position comes with competitive compensation, comprehensive benefits, and the chance to work on technology that impacts millions of users worldwide.

If you're passionate about reliability engineering, excited about AI/ML technologies, and want to work on infrastructure that powers next-generation creative tools, this role at Adobe could be your next career milestone.

Last updated 15 hours ago

Responsibilities For Site Reliability Engineer, AI/ML Platforms

  • Identify and implement methodologies to increase reliability, scalability, security, and efficiency
  • Ensure highest uptime and Quality of Service (QoS) through operational excellence
  • Define service level objectives (SLOs) and indicators (SLIs)
  • Support and maintain globally distributed, multi-cloud environments
  • Automate common, repeatable tasks at large scale
  • Identify areas to improve service resiliency through chaos engineering and performance testing
  • Coordinate with other Adobe platform teams and service providers

Requirements For Site Reliability Engineer, AI/ML Platforms

Kubernetes
Python
Go
  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field
  • 5+ years relevant industry experience
  • Experience in building and scaling distributed systems
  • Production level expertise with containerization orchestration engines (Kubernetes)
  • Fundamental programming skills in Python, Go
  • Knowledge of infrastructure configuration management tools like Ansible and Terraform
  • Experience with observability tools like InfluxDB, Prometheus, and Elastic Stack
  • Understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions

Interested in this job?

Jobs Related To Adobe Site Reliability Engineer, AI/ML Platforms

Senior Site Reliability Engineer

Senior Site Reliability Engineer position at Microsoft Security, focusing on building and managing critical infrastructure for red team operations with emphasis on security and automation.

Cloud Site Reliability Engineer I

Cloud Site Reliability Engineer I position at Zafin, responsible for ensuring seamless operation of cloud infrastructure and applications.

Cloud Site Reliability Engineer II

Lead Cloud Site Reliability Engineer position at Zafin, requiring 12+ years of experience in cloud operations, focusing on Azure infrastructure and container orchestration for banking solutions.

Lead Site Reliability Engineer (Product SRE)

Lead Site Reliability Engineer position at Xero, focusing on driving reliability, observability, and high-performing services across product teams.

Staff Site Reliability Engineer

Staff Site Reliability Engineer position at Assured, offering $180K-$210K with equity, focusing on building and scaling infrastructure for insurance claims processing platform.