So, my system blew up. Ok, not really, but what does that mean to you? It means a lot of different things to me. Here are a few things that have gone wrong with my computers recently:
- A monitor refused to power on at start-up.
- A patch came in that caused a test system to refuse to boot.
- A Software license key got corrupt – you’ve seen this one last week.
- An external battery melted into a puddle of smoking silicon, plastic, cadmium, and goo.
- A laptop wouldn’t even power up, but I had complete backups.
There are more, but I think you get the point: I’ve had my share of interesting experiences. So which one caused me the most grief (and embarrassment)? Well, I have many monitors, so losing one is never an issue. External batteries are interesting devices, but it only cost me some time on the plane that I used for some much needed rest anyway. The software vendor with the license key issue responded quickly, so, while I lost six hours, it was overnight and I should have been sleeping. So what bothers me the most?
From a convenience standpoint, there’s no question that losing my laptop was the worst. It took me six weeks to get back to full speed. However, the laptop was at the end of its life. I had a backup and lost virtually nothing. However, I couldn’t buy a comparable laptop that had the version of the operating system and productivity software I wanted on it. So reconstructing my environment was painful. But they key part of it was that I didn’t slow down at all for my customers. Not a single document or presentation revision was lost nor any e-mails nor anything else of importance.
From an operational standpoint, my company took a hit because one of our servers refused to boot. We had to run without our core document repository for almost a day. There was no issue in the hardware though. The problem was that an operating system patch came down automatically, but the BIOS couldn’t handle it. The hardware vendor issued a BIOS patch after the operating system patch came down and the problem was resolved. So who’s to blame?
In a word: Me. It’s always my fault. Hey, when you run a company, it’s your fault. Take the blame and move on. I’ve since made sure to disabled automatic patch installation on all my servers, and they now go through quarantine prior to installation on business critical systems. The operating system vendor has to take some of this for not testing their patch adequately on very common BIOS releases where their operating system is installed as OEM software. And it’s also the hardware vendor’s fault for not issuing the warning to their customers in advance of the issue. I can’t believe that we were the first company to encounter the problem – although, as some of you readers already know, it’s happened before.
So what did I learn? There was egg on my face, of course. My staff was pretty upset that the server was down, even if only for a short period. My customers didn’t notice, because it was an internal server, fortunately. But, mostly, that it’s not the hardware. The physical machine had no problems, although there were lots of amber warning lights flashing at me. It’s not the individual software components, because they all worked to specification, according to the vendors anyways. It was the inter-relationship between changing software components. And you know what? That’s usually what burns you. A few years ago, I gave a keynote where I asked the audience to choose which represented the most likely risk to their organization from a set of disaster scenarios, from an asteroid, nuclear war, floods, mould, and a picture of my (then toddler) son at a keyboard. I have to give the group (one of the HP NonStop user groups) credit for getting the answer right – my son. Changing software represents something far different than a disaster scenario. It is a quantifiable, known, and expected risk. Each time you change your system, you are putting your company, your customers, and your stakeholders at risk, because something may break. On the other side of the coin, not changing your system also puts the same group at risk of no longer being competitive. So what do you do?
It’s simply not an option to stand still. Could we surf the web if we all were using bicycle-chain driven computers or upgraded looms? I don’t know about you, but I can’t blog on a loom. My cats can blog on a rug, but that’s a different story and very messy. So change is something we want and have to embrace. Dealing with the risk-reward of moving forward is pervasive in technology. We can’t ignore it, so we have to change and put in new releases. The questions are where to find a balance and how to do it safely.
Stay tuned for upcoming entries where I’ll talk about this balance. The next blog will go into a comparison of different levels of expectation of reliability.