When I was writing an article on updating FreeBSD from the 11.2 version to the new major release number 12, I was trying to add something extra for those who may read some of the information I publish. FreeBSD as a UNIX operating system has similar functionality to the old school UNIX ones such as AIX, Solaris and the like. Of course they are different in some ways, sometimes weird ways, but they share common concepts. The GNU/Linux distributions do also share those concepts and sometimes add some more weirdness to the mix. We, the humans, the weirdo species.
That little extra, not seen in many other upgrade guides, is the rollback chapter. I was reading that article back and I decided to write this one to make the rollback case out loud. While my eyes were scrutining my own words (and my bad spelling joined by my ‘English’) memories of my stay as a UNIX operator at a major European bank stroke me. Operators do ‘simple’ tasks and deal with the day to day operations. You do not need to be a mastermind nor be a proficient ‘C’ programmer to do that job. Indeed if you were such a person you would quit that job very quickly.
Dealing with application changes, configuration tweaks and validations of such is a regular task for an operator. Often times everything has been already planned beforehand by systems engineers and developers. Approvals have been passed and the changes have passed some tests on development and integration changes before passing up to production, where the real deal sits. So little room is left for improvisation and the operator becomes the man feeding the machine. A monkey playing key-strokes some other, intelligent, being already layed out.
Many configuration changes are trivial and last less than five minutes. Sometimes the preparation phase takes longer than the execution itfself. Only a set of regular configuration and application changes last longer than hour. A few last hours, and a selection of one or two stand out above all. Those last for so long and across so many teams tasks are passed from one shift of operators to the other. Spending two hours of one hot sunny afternoon sitting at a ‘terminal’ just to make a backup copy of the configuration files speaks about the volume of the change. And you’d better be careful with those backups. You’d better read the checksum to compare after the copy has been made. You’d better not miss any single step of the play. Otherwise you may ruin hours of work from others.
Hours go by and changes are applied rigurously without a hassle. Sometimes it almost seems like a dance happening before your eyes but nobody is really moving. Key-strokes here and there and some confirmation calls. The minutes go by at a hugely boring pace. Until something breaks, breaks in the middle of a big application change, and it is production stuff. Then, oh Lord, the bell rings. Alarms pop up and late calls are made. Coordination teams are set and expectation levels rise. Operators are now the engineers’ fingers and eyes. Some traces here, some conversations there, a bit of panic, a bit of a call, a bit of another call, the whole lot. Some engineer suggest something but is not a good solution in the eyes of others, and then the operator is just waiting for the call to end and apply whatever he or she is commanded.
I’ve seen something I thought I’d never see in my entire life. You know, in the past I was a ski bum, who made it to the ski instruction realm and trained kids, built races, skied big mountains of powder, crud snow, icy slopes, everything. And when you are out there you can see things sometimes you never expected to see, like a chair from a chairlift going around the cable, upside down, because of the high speed winds, and the very next day the same cable is completely derailed and it has fallen from the piles, chairs are broken, pulleys bent,… I’ve seen huge avalanches. I’ve been ‘trapped’ on top of a chairlift a few hours, with more than twenty customers, because of sudden and dangerous weather change. That same day people died a few miles away. Shit happens in nature, there is no way around it. So when I was working at the bank, with central heating for the winter, air conditioning for the summer, vending machines and hords of clever and smart people I thought I’d never see a bank stop. Until the mainframe went down. Oh boy.
All banking operations ceased on a Friday afternoon. I believe almost nobody is buying stuff at the mall at that particular time. There was nothing me or my team could do but wait and see. The operations chief made his presence into the room and a small team was assembled to resolve the issues. The order was clear: shut down the whole thing, stop the mainframe. Other services had to stop too so you can imagine what kind of situation that was. Once the engine was stopped, step by step, with very carefully thought commands, operations resumed little by little. There was a bad combination of issues that made a ‘job’ jam and make no progress. This ‘small’ incident, combined with other events, was back traced as the root of the incident in the mainframe a couple of days later, although it was a suspect from the very beginning.
Yes, I had seen it before my own eyes. A bank stopped because of the mainframe stop. Not in my wildest dreams I thought I could see that, live, in the operations room. It was a pity I couldn’t smell the big machine or at least knew it was sitting in the next room. But yes, as a testimony I had seen a major incident, something I believe happens every ten years or more. Something supposed not to happen. But it did.
Back to our operator. Back to your particular day to day rutine. We all have to do updates. For God’s sake, Windows does this all the time. You are very familiar to this. Let’s say there is a twelve hour intervention on a major application for a bank and in the seventh hour something breaks. Validations of middle steps are giving unexpected errors. Conference calls are arranged and people is out of the office but dealing with a serious problem. Technicians of all sorts are trying to figure out what is happening. The operator is sending traces of what he is asked for and applying some other steps that engineers believe would work out. Despiste all the efforts time is running out, normal operations will resume soon on Monday morning, and there is no much time left. The rollback order comes in.
The operator has now to move down to the ‘darkest’ part of the document where the rollback instructions are placed. Configuration files have to be restored, whole folders have to be replaced with other folders, and sometimes backup copies of particular file systems have to be requested to be placed back in. All these pieces have to come in order and validations are also necessary to restore the system as it was before. Luckily things will be as they were on Monday morning after an intense weekend, or middle of the week night, like if nothing had happened. But it has.
I’ve seen, and you too, very old kernels up and running. I know for the fact big companies and governmental institutions are still running some service on Red Hat version 5 boxes. Who wants to run the risk of upgrading a huge custom application, or worse yet, upgrade the operating system version underneath without touching the app substantially? I guess you know that answer. And that explains, partially, why does still exist a set of boxes with old UNIX or Linux versions managing them on every big corporation basement. With ZFS and boot environments you can safely try out OS upgrades without being too worried. You know you can safely roll back the upgrade. Your OS will run again and your application too. There are commercial proprietary implementations of this technology, but there are open source ones too, like FreeBSD and the Illumos derivatives like SmartOS. So yes, in conclusion, you can rollback anytime now. Ooooh yes! I know you are already doing it with hypervisors, and some other shiny red and shifty software. Is that cheating?