As a Principal Site Reliability & Availability Engineer at Salesforce, you'll be part of a specialist unit focused on availability and resilience. You'll embed with delivery teams, acting in a Lead capacity, creating bandwidth and prioritizing corrective and proactive availability measures. Your responsibilities include:
- Designing, developing, debugging, and operating resilient applications and platforms deployed across distributed systems running on thousands of compute nodes in multiple data centers.
- Championing resiliency best practices, including observability tool integration, horizontal/vertical sizing & auto-scaling, release rollback & recovery workflows, and integration tests.
- Using and contributing to open source technology (e.g., Spinnaker, Zookeeper).
- Developing and leveraging Infrastructure-as-Code using Terraform.
- Building and integrating with APIs and microservices deployed on containerization frameworks such as Kubernetes, Docker, and Mesos.
- Resolving complex technical issues and driving innovations to improve system availability, resilience, and performance.
- Balancing live runtime management, feature delivery, and retirement of technical debt.
- Participating in the team's on-call rotation to address complex problems in real-time and maintain high service availability.
Required skills include:
- A related technical degree (master's preferred)
- 15+ years of hands-on software development experience
- 5+ years in a Tech Lead, Principal, or Architect capacity
- Mastery of object-oriented languages like Java, Golang, APEX, or Python
- Deep experience with core web technologies and databases
- Expertise in service ownership best practices, SLO/I/A definition, and incident management
Join Salesforce to work on cutting-edge technology and contribute to the reliability and availability of systems used by millions of users worldwide.