I was recently asked a good question by one of the readers: “What is indestructible computing, and why should I care?” It’s a good question. Here are a few common terms. What you should keep in firmly in mind is that whatever aspect a system you look at, the actual service level you experience is usually the weakest of your components. Guess which aspect one is almost always the weakest? If you’ve been following the blog, you already know: software change.
General Purpose Computing
Well, you’re probably reading this blog from a general purpose environment. A workstation or laptop can be considered general purpose hardware. Your browser could probably be considered general purpose software. The combination of the two gives you a general purpose environment.
Highly Availability Systems
These systems are available most of the time – generally 99.99% of the time, or slightly under 5 minutes of unplanned or planned downtime a month. Banking systems are typical of these. Fitting maintenance into even a five minute window is difficult, particularly when you’re upgrading disks or restructuring your Operational Data Store (ODS).
Continuously Available Systems
These systems are available virtually all of the time – generally 99.999% of the time (about 30 seconds of down-time). Extensive use of independent components allows these systems to operate virtually without any unplanned outages. Planned outages do occur for upgrades, but the window for these outages is very small. There’s a lot of confusion between Highly Available and Continuously Available systems, the lines are pretty blurry, and I won’t really differentiate between them, much. That there is even a distinction is arguable.
Critical Systems
These systems include some of the obvious life-critical systems: flight control systems; rockets; many health monitor devices. Systems like this do not have the same level of long-term availability that continuously available systems have, but during their duty cycle, no outage is permitted at all. Fortunately, no changes are generally permitted while the systems are up. How many launches were delayed because of sensor or software issues?
Long-Life Systems
In long-life systems, reliability is the number one priority. Unscheduled maintenance is usually impossible or cost-prohibitive. Scheduled maintenance is possible but not desirable, and usually involves only software components. During maintenance, rigorous testing is done to ensure that the system will function reliably when back online. Communication satellites and the Mars Explorers fall into this category. Even then, subtle defects, like miles vs. kilometres per hour in a calculation, can cause disastrous failures.
Indestructible Systems
A truly indestructible system builds on the best of all of these systems. The systems are expected to be long-life, yet dynamic. Change is not only possible, but expected. Yet, there are no unplanned outages and no planned outages. Not only small components, but major components like data centres can go offline without a perceived outage or noticeable reduction is service levels. Maintenance is done while the system is up.
And I don’t blame anyone for thinking indestructibility is unattainable. It’s very hard to get right and even then, it’s always possible that something will go wrong. In future posts I’ll go into what it takes to make this work. Hopefully you’ll see that indestructible systems are practical in the real world and understand what it takes to make them work for you.
The next post will go into the starting points of view for building these systems and how money gets wrapped up in it.