Lead /Principal - Site Reliability Engineering

Salesforce

Salesforce is the Customer Company, inspiring the future of business with AI + Data + CRM. They help companies blaze new trails and connect with customers in meaningful ways.

Hyderabad, Telangana, India

Site Reliability

Principal Software Engineer

In-Person

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Lead /Principal - Site Reliability Engineering

As a Lead/Principal Software Engineer in Site/Product Reliability Engineering at Salesforce, you will play a pivotal role in ensuring and scaling the reliability of the AgentForce platform. Working in the India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You'll be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce's AgentForce platform, with a focus on generative and predictive AI platform production support.

Key responsibilities include:

Triaging and solving complex problems in production systems
Establishing reliability processes and collaborating with lead engineers
Multi-system debugging and triage across Salesforce platforms and LLM providers
Leading production triage for AgentForce, focusing on service, infrastructure, and performance issues
Managing infrastructure and scaling, including capacity modeling and forecasting
Creating automation and maintaining operational excellence
Monitoring and trust management, including adjusting SLOs and SLIs
Cross-functional collaboration with Customer Support Groups and other teams
Stakeholder collaboration for operational excellence and process improvements

The ideal candidate will have:

Bachelor's degree in Computer Science, Engineering, or related field
8+ years of experience in production support and triaging roles
Expertise in implementing reliability processes for full-stack, end-to-end ML platforms
Strong knowledge of cloud services (AWS preferred), container technologies, and CI/CD tools
Proficiency in scripting languages and AI model deployment

This role offers an opportunity to lead key initiatives within Salesforce's AI platform, work in a collaborative environment focused on innovation, and receive competitive compensation and benefits.

Last updated 9 months ago

Responsibilities For Lead /Principal - Site Reliability Engineering

Lead production triage for AgentForce AI platform
Implement automated solutions to enhance reliability
Maintain documentation of production incidents
Collaborate with AI, product, and platform teams
Manage infrastructure and scaling, including capacity modeling
Create and maintain playbooks and knowledge articles
Utilize availability and trust dashboards, adjust SLOs and SLIs
Participate in 24x7 on-call support

Requirements For Lead /Principal - Site Reliability Engineering

Python

Linux

Kubernetes

Bachelor's degree in Computer Science, Engineering, or related technical field
8+ years of experience in production support and triaging roles
Expertise in implementing reliability processes for full-stack, end-to-end ML platforms
Strong knowledge of cloud services (AWS preferred)
Proficiency in scripting languages (Python, Shell, Golang)
Experience in DevOps or data center management roles
Knowledge of container technologies (Docker, Kubernetes) and CI/CD tools

Benefits For Lead /Principal - Site Reliability Engineering

Competitive compensation
Benefits package