I read The Calculus of Service Availability: You’re only as available as the sum of your dependencies today and it summarizes some of the most salient wisdom in designing for reliability targets.
The takeaways are:
- Humans have enough imperfections in their nearby systems that 4 or 5 9s of reliability is the maximum value worth targeting
- Problems come from the the service itself or its critical dependencies.
- Availability = MTTF/(MTTF+MTTR)
- Rule of an extra 9 - Any service should rely on critical services that exceed their own SLA by 1x 9 of reliability, ie a 3x 9s service should only depend on 4x 9s services in its critical path. 5. When depending on services not meeting that threshold, it must be accounted for via resilient design 6. The math 7. Assuming a service has error budget of 0.01%. 8. They choose 0.005% error budget for service and the 0.005% for critical dependencies, then their 5 dependencies each get 1/5 of 0.005% or 0.001%, ie must be 99.999% available.
- Frequency * Detection * Recovery constitute the impact of outages and the feasibility of a given SLA. 7. Thus the levers for improvement are: reduce frequency of outage, blast radius (sharding, cellular architectures, or customer isolation), and MTTR.
I’ve been evolving our Online Database Platform at work and the themes of “rule of an extra 9” and how to move quickly as well as safely with limited blast radius are top of mind.
We’ve made some major changes (cluster topology, upgrades in nosql and sql, automation tooling) that are moving our stack to a point where I’m proud of the accomplishments.
Hundreds of TB of online data and hundreds of clusters managed by a team that I can count on one hand :).