Improving Software for Society
News | Blog Post : SOME PEOPLE CAN PICK THEM
The IAP at 50. Reflections from an IAP Member.
This story sounds better when told. But here goes…
It concerns a Pick system. Pick used file hashing based upon a given module number. The speed of the file search depended upon having this value correctly specified when the file was first created. At the time, if you wanted to re-size a file then you had to calculate the new required modulo, manually set this value for the file, then save the file to tape, delete the file and then restore the file. Or if you needed to do this for many files, then once you had set the new required values, do a complete system save then a complete system restore. This was often how files were resized. The issue, of course, was that once the restore started all the system was effectively ‘erased’ and could only be used again once the restore had successfully completed. It was very common at the time for daily complete saves to be done followed by a verify that checked that the written data could be read and matched the data. Actual restore was only done when it had to be – for the obvious reason.
OK. Now to the story. I’d installed a large (32 multi-user system – which was large back then) which had worked fine for a few months. Then I got a telephone call saying it was now running much slower than it was when first installed. Well the reason was obvious. File module sizes for the hashing were now too small. The end user knew nothing about how to resize etc so I went to the user site to do the resize. Knowing the save itself took a few hours, it was agreed I’d get there Friday lunchtime, do the required file calculations, set the new module values etc so that when usage was finished, we could do the save/restore overnight.
Well everything went OK to start with. Around 6pm on the Friday we were all ready for the save. The backup medium was 1/2 inch tape. To be sure there would be no tape issues, I used a new tape. Set the back-up going. OK. Went to the pub for a couple of hours. Came back. Checked the system. The verification was just finishing. OK. Crossed fingers and started the restore. Watched it start the restore. Ok for 15 minutes. I then went to a restaurant for dinner. Came back. And then the problem. Parity error when reading data. Retries failed. Restore terminated. This is just what I didn’t want. No restore meant no system and no data – and quite possibly no company if I couldn’t get this system back up and running.
Well I tried the restore again. This time it didn’t as far through the restore as it did the first time before the same error. What restored the first time wouldn’t now. Panic was starting to set in. Now from experience I knew that the tapes/tape drive was susceptible to excess heat. That’s why when the system was designed plenty of fans had been included to keep the inside cool. So I felt the air flow where the fans were located on the back. Hardly anything from one set. Oh ****. I took the back off to see what was happening. Some of the fans weren’t moving. More ****. Checked the power to the fans. Nope. No power. After some more investigation I found a blown fuse. Ok. Just replace the fuse (I had spares – be prepared!). Done that. Switched back on and the fuse blew straight away. There was a problem with the fan tray. Well there was no way that that fan tray could be replaced until Monday PM at the earliest. Oh dear. This system had to running by Monday morning at the latest. Now convinced that the issue was heat, I put the back-up tape in the fridge to cool it down. I turned the room air conditioning to the coldest it could do. Got my hairdryer that I used when I stayed in hotels overnight and set it on to cold and balanced it so that the cool air coming out went over the tape drive. Waited 30 minutes. I crossed my fingers – and everything else – got the tape from the fridge and tried the restore again. Got past the last error point. Got past the first error point. Got to an hour – everything working OK. Got to 2 hours. OK. Got to the end of the tape and the system started. Whepeeee! This was now after 2am in the morning. I left a note, switched off the system and finally got to my hotel. The next day I went back and explained to the user what had happened and the issue. The system was started Monday morning still using the hair dryer. The fan tray was replaced Monday night and everything then went smoothly.
We added a heat detector to the hardware and I added Pick os code to detect and report excess heat which also shut down the system if over-heating persisted. We never had the same issue again.