The proverbial "six nines" availability (99.9999% uptime) means no more than 30 seconds downtime per year. That's really kind of ridiculous...Think of it this way: If your six nines system goes down mysteriously just once and it takes you an hour to figure out the cause and fix it, well, you've just blown your downtime budget for the next century.
After some internal discussion we all agreed that rather than imposing a statistically meaningless measurement and hoping that the mere measurement of something meaningless would cause it to get better, what we really needed was a process of continuous improvement. Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we're doing to prevent that problem in the future. In this case, the change is that our internal documentation will include detailed checklists for all operational procedures in the live environment.
Our customers can look at the blog to see what caused the problems and what we're doing to make things better, and, hopefully, they can see evidence of steadily improving quality.
No comments:
Post a Comment