Lead /Principal - Site Reliability Engineering

Salesforce is the Customer Company, inspiring the future of business with AI + Data + CRM. They help companies blaze new trails and connect with customers in meaningful ways.
Site Reliability
Principal Software Engineer
In-Person
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:
Principal Software Engineering - Availability

Principal Software Engineering role at Salesforce focusing on Site Reliability Engineering, building and maintaining large-scale distributed systems with 15+ years of experience required.

Principal AI Infrastructure SRE Engineer

Lead AI infrastructure transformation at NVIDIA as a Principal SRE Engineer, managing large-scale systems and implementing modern automation solutions.

Principal Site Reliability Developer

Principal Site Reliability Developer position at Oracle, focusing on cloud services and infrastructure with 10+ years experience required, based in Bengaluru, India.

Principal Site Reliability Developer

Principal Site Reliability Developer position at Oracle, focusing on cloud infrastructure, automation, and distributed systems architecture in Bengaluru.

Director, Software Engineering - SRE

Lead SRE engineering teams at Capital One, overseeing system reliability and scalability while managing and mentoring software engineers in a technology-forward financial institution.

Description For Lead /Principal - Site Reliability Engineering

As a Lead/Principal Software Engineer in Site/Product Reliability Engineering at Salesforce, you will play a pivotal role in ensuring and scaling the reliability of the AgentForce platform. Working in the India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You'll be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce's AgentForce platform, with a focus on generative and predictive AI platform production support.

Key responsibilities include:

  • Triaging and solving complex problems in production systems
  • Establishing reliability processes and collaborating with lead engineers
  • Multi-system debugging and triage across Salesforce platforms and LLM providers
  • Leading production triage for AgentForce, focusing on service, infrastructure, and performance issues
  • Managing infrastructure and scaling, including capacity modeling and forecasting
  • Creating automation and maintaining operational excellence
  • Monitoring and trust management, including adjusting SLOs and SLIs
  • Cross-functional collaboration with Customer Support Groups and other teams
  • Stakeholder collaboration for operational excellence and process improvements

The ideal candidate will have:

  • Bachelor's degree in Computer Science, Engineering, or related field
  • 8+ years of experience in production support and triaging roles
  • Expertise in implementing reliability processes for full-stack, end-to-end ML platforms
  • Strong knowledge of cloud services (AWS preferred), container technologies, and CI/CD tools
  • Proficiency in scripting languages and AI model deployment

This role offers an opportunity to lead key initiatives within Salesforce's AI platform, work in a collaborative environment focused on innovation, and receive competitive compensation and benefits.

Last updated 6 months ago

Responsibilities For Lead /Principal - Site Reliability Engineering

  • Lead production triage for AgentForce AI platform
  • Implement automated solutions to enhance reliability
  • Maintain documentation of production incidents
  • Collaborate with AI, product, and platform teams
  • Manage infrastructure and scaling, including capacity modeling
  • Create and maintain playbooks and knowledge articles
  • Utilize availability and trust dashboards, adjust SLOs and SLIs
  • Participate in 24x7 on-call support

Requirements For Lead /Principal - Site Reliability Engineering

Python
Linux
Kubernetes
  • Bachelor's degree in Computer Science, Engineering, or related technical field
  • 8+ years of experience in production support and triaging roles
  • Expertise in implementing reliability processes for full-stack, end-to-end ML platforms
  • Strong knowledge of cloud services (AWS preferred)
  • Proficiency in scripting languages (Python, Shell, Golang)
  • Experience in DevOps or data center management roles
  • Knowledge of container technologies (Docker, Kubernetes) and CI/CD tools

Benefits For Lead /Principal - Site Reliability Engineering

  • Competitive compensation
  • Benefits package

Interested in this job?