I'm currently drafting a design document for a high TPS service in a big tech. I'm surprised to find that this service doesn't have a framework for gradually releasing features from 0% to 100%. Instead, they solely rely on binary switches. The recommendation is that if gradual feature rollout is necessary, it should be implemented using these binary switches. Then I think, why don't I propose to implement a gradual rollout framework for the service?
After looking into the existing binary switch configuration, I am surprised to find numerous switches already in place. Features are launched without gradual rollout, yet the service runs fine. Most features try to reduce blast ratio by going through stages like a few customers, a group of customers, regional customers, worldwide customers. Is gradual dial up framework a required feature of a good production system?
One thing I learned from my time at Meta was that very few things are "required" with engineering. So no, I don't think a sophisticated rollout system is required for a production system.
Case in point: WhatsApp has a very unique culture within Facebook. (This was true several years ago, but may no longer be true.) I was shocked when I learned:
On a related discussion on system design: How do I improve at building systems?
If you have robust logging and an automated way to turn off the switch in the case of a high failure rate, you could make the case that you'd be able to detect any issues as soon as you enabled the switch and roll back immediately. I have experienced cases where we gradually ramped up an experiment, and we only started to detect issues when we rolled out to a higher percentage of the population.
But, if the service is a high TPS service at the scale of a big tech company, I'd imagine you'd want to mitigate any failure rates if you can avoid them, and that would mean having a gradual ramp up period. Definitely ramp up to internal users first to make sure nothing silly/obvious mistakes get rolled out to external users. The higher the impact of the service, the more care you want to take.
Either way, you should have a very good logging system in place to make sure you're not flying blindly into a mountain.