HCF – the reanimated flaming zombie server

This is a true IT horror story for the spooky season. Enjoy.

One day many years ago, on a Friday afternoon, I was visiting a client site in order to upgrade the OS on their unix minicomputer (for those that don't know, that's about half the size of a modern fridge/freezer). I think it might have been an ICL DRS 3000, if anyone cares that much ... so early 1990s.

It was raining outside, and I was just verifying the latest backup tape, prior to the upgrade. There were rumbles of thunder in the distance.

Then, all of a sudden ... boom as a lightning strike connected somewhere nearby. The office lights went out. The server and its screen stayed on. At the bottom of the screen, interrupting the backup log, was some text I've never seen since and can't completely remember, but it was something like

ERROR: MAIN POWER SUPPLY FAILU

In the dark, I was staring at the green screen in a small amount of shock. I had no idea how such a message could be generated, let alone still be visible ...

Then I noticed the smell of smoke. Blue smoke. Something was either burning or had already burst open ... and it was not pleasant.

The server beeped.

I hadn't touched it.

It beeped again, and the screen cleared ...

It started to boot back up.

It was still smoking, perhaps it was still on fire!

In a rush, I looked around the darkened room for the power lead going to the wall socket, pushed the server aside and pulled the power lead out as quickly as possible ...

The office lights came back on – I guess someone found a fusebox or breaker switch – and the server continued to boot. It wasn't plugged in to the wall but it was still booting up ...

After a moment or two more of panic (I didn't want this thing to come back up until I'd contacted proper hardware support engineers!) I looked again at the power lead I'd pulled out ... and it didn't actually go into the server after all ... it went into the UPS unit sitting next to it! The UPS that had done its job of dealing with the loss of mains power had allowed the server to ... catch fire? Tell me it was on fire? HCF isn't a real instruction, is it?

Well, I found the correct power supply switch and got it finally switched off properly, before the OS could come up (machines used to take ages to boot back in the day) and before some enthusiastic fcsk could wreak havoc on my filesystems ... and before anything else could happen to the tape drive with the latest backup tape still in it ...

But there was still wisps of smoke rising from the back of the server ...

So pushing the server around again, I exposed the back where the cables all went in and the option cards were connected – the site had a lot of terminals all directly connected to the server by serial lines, connecting to RS-232 'concentrator' cards. I can't clearly remember how to open the case of one of these machines today, but it must have been easy enough, because it wasn't long before I found the source of the blue smoke ... one of the serial cards had exploded. So no active fire risk ... but still a nasty smell.

So a little investigation and back-tracking was needed, but pretty quickly it turned out that some of these serial cables were run up to the roof and out the side of the building, strung across the carpark bundled with telephone lines, and into one of the smaller portacabin offices out the back, and then directly down to the terminals. No separation, no isolation, no grounding. They always were a little ... unreliable ... but never bad enough for anyone to spend any time or money improving them.

The lightning strike had contacted some of those cables. So at the same time that the site power went down, the cables jumped a little bit above their normal 5-12V states and ... out came the magic blue smoke. The UPS hadn't been quick enough switching over, so the server was reporting PSU errors ... and the message stayed on the screen (until the server rebooted) because it was a serial console; as long as it still had power, whatever was on the screen stayed on the screen ...

So nothing supernatural ...

For the aftermath ... I spent a lot of time on the phone with engineering, I removed the broken serial card, and we restarted the server. Which came back up again just fine; fsck had its way but there were no detectable problems there. The backup tape was fine, all the other hardware was fine ... (a couple of terminals in the external office were toasted, but not dramatically so). We took the machine slowly through its paces over the weekend (which is when we should have been upgrading it), and got everything back to normal – obviously with only half the number of terminals. The customer replaced the serial cabling with more of the same, and when the new serial card arrived on Monday we were able to get them all back up again.

But I'll not forget that server, telling me it's PSU had failed and restarting while the power was off ... all while it was still on fire.