As a Lead/Principal Software Engineer in Site/Product Reliability Engineering at Salesforce, you will play a pivotal role in ensuring and scaling the reliability of the AgentForce platform. Working in the India operations center, this role requires shift work, including weekends, to support services aligned with US hours. You'll be part of a high-impact team dedicated to maintaining the availability and performance of Salesforce's AgentForce platform, with a focus on generative and predictive AI platform production support.
Key responsibilities include:
- Triaging and solving complex problems in production systems
- Establishing reliability processes and collaborating with lead engineers
- Multi-system debugging and triage across Salesforce platforms and LLM providers
- Leading production triage for AgentForce, focusing on service, infrastructure, and performance issues
- Managing infrastructure and scaling, including capacity modeling and forecasting
- Creating automation and maintaining operational excellence
- Monitoring and trust management, including adjusting SLOs and SLIs
- Cross-functional collaboration with Customer Support Groups and other teams
- Stakeholder collaboration for operational excellence and process improvements
The ideal candidate will have:
- Bachelor's degree in Computer Science, Engineering, or related field
- 8+ years of experience in production support and triaging roles
- Expertise in implementing reliability processes for full-stack, end-to-end ML platforms
- Strong knowledge of cloud services (AWS preferred), container technologies, and CI/CD tools
- Proficiency in scripting languages and AI model deployment
This role offers an opportunity to lead key initiatives within Salesforce's AI platform, work in a collaborative environment focused on innovation, and receive competitive compensation and benefits.