Monitor System Health

Fixing a bug isn’t the end — true resolution means confirming the fix worked at scale. Here are the core points from the lesson:

  • Always monitor system health after a fix using real metrics, not just local device testing, to verify the problem is truly resolved
  • Track both user-facing metrics (like revenue, call minutes, login attempts) and code-facing metrics (like error rates, ingress/egress traffic, build failures)
  • Be aware of natural variations in user metrics across days, weekends, and holidays so you don’t misinterpret normal fluctuations
  • Revisit your original blast radius notes and make sure the impacted areas are reflected in your dashboards to fully validate recovery
  • Understand the difference between root causes (e.g., spike in Firestore reads) and lagging indicators (e.g., billing costs) when monitoring your fix

Example metrics dashboard: https://convertkit.baremetrics.com/