Yesterday, I was asked for my thoughts on High-Availability (HA) for 24x7x365, 99.999% uptime software. And today, a friend forwarded this article about experiences with Apple and Dell Customer Service. The common theme between the two is “recover gracefully, recover fast!“.
My answer (yesterday) was elaborate and went into some detail about the various features required in the software but I started with “A good HA architecture starts with accurate detection followed closely by recovery. Recovery is more important than determining the root-cause of the failure“. If you have worked on widely deployed, sufficiently complex, leading edge products you may appreciate what I just said. You
may will not be able to find all failure cases in your software ever, so do your best to ensure a flawlessly, graceful recovery. Think of what I just said again.
Similarly in the Customer Service story, Apple figured a way to “recover gracefully, recover fast” and even find a way to make this challenging customer touch-point enjoyable. Wow.
As we Engineer software, build applications and manageproducts, I encourage everyone to think of this post. If you don’t remember anything but just remember the associated image – and what the person being resuscitated will appreciate more – it will do just as well!