1

Is gradual dial up framework a required feature of a good production system?

Profile picture
Mid-Level Software Engineer at Taro Community3 months ago

I'm currently drafting a design document for a high TPS service in a big tech. I'm surprised to find that this service doesn't have a framework for gradually releasing features from 0% to 100%. Instead, they solely rely on binary switches. The recommendation is that if gradual feature rollout is necessary, it should be implemented using these binary switches. Then I think, why don't I propose to implement a gradual rollout framework for the service?

After looking into the existing binary switch configuration, I am surprised to find numerous switches already in place. Features are launched without gradual rollout, yet the service runs fine. Most features try to reduce blast ratio by going through stages like a few customers, a group of customers, regional customers, worldwide customers. Is gradual dial up framework a required feature of a good production system?

25
2

Discussion

(2 comments)
  • 1
    Profile picture
    Tech Lead/Manager at Meta, Pinterest, Kosei
    3 months ago

    One thing I learned from my time at Meta was that very few things are "required" with engineering. So no, I don't think a sophisticated rollout system is required for a production system.

    Case in point: WhatsApp has a very unique culture within Facebook. (This was true several years ago, but may no longer be true.) I was shocked when I learned:

    • WhatsApp doesn't do experiments. They form a core thesis around what they're building, and when they have conviction from user feedback, and they've done enough QA testing, they'll ship the feature to everyone. 😨 This is anathema in core Facebook culture.
      • If WhatsApp didn't need a rollout service with 1B+ users, you probably don't need one. Sure, it could be nice, but it's not required for success.
    • WhatsApp didn't do as much code review. Engineers would often submit code without a formal stamp of approval from another engineer.

    On a related discussion on system design: How do I improve at building systems?

  • 0
    Profile picture
    Eng @ Taro
    3 months ago

    If you have robust logging and an automated way to turn off the switch in the case of a high failure rate, you could make the case that you'd be able to detect any issues as soon as you enabled the switch and roll back immediately. I have experienced cases where we gradually ramped up an experiment, and we only started to detect issues when we rolled out to a higher percentage of the population.

    But, if the service is a high TPS service at the scale of a big tech company, I'd imagine you'd want to mitigate any failure rates if you can avoid them, and that would mean having a gradual ramp up period. Definitely ramp up to internal users first to make sure nothing silly/obvious mistakes get rolled out to external users. The higher the impact of the service, the more care you want to take.

    Either way, you should have a very good logging system in place to make sure you're not flying blindly into a mountain.