Friday, March 26, 2010

How NOT to backup a UNIX system

I'm a fan of Pixar. Not a casual fan, mind you. More like a fanatic, watch every movie multiple times, plus every piece of bonus material and director commentary, read every article, follow every artist and animator on Twitter kind of fan. When I have spare time I even work on the Pixar Wikia site.

So of course I bought the new Toy Story/Toy Story 2 Blu-rays when they came out this past Tuesday (and saved $10 each since I had previous copies of the movies on DVD!). One of the great bonus features on the discs are the "Studio Stories": cute, short stories told by the artists and animators reminiscing about some of the funny moments during the production of the films.

How does this relate to the title of this post? One of the Studio Stories on the Toy Story 2 disc is called "The Movie Vanishes". Oren Jacob and Galyn Susman tell the story of how someone ran the "rm -rf *" command on the UNIX server holding the Toy Story 2 film. For you non-UNIX folks, this command removes all the files on the system. In this case it included large parts of the still-under-construction film! Fortunately they were able to recover the data, but I can imagine how much panic there must have been!

This story reminded me one of my early experiences as a UNIX administrator. It was around 1986 and I was at U S WEST, where they were beginning to "mechanize" the engineering department - i.e. provide PCs to all staff and introduce email, word processing and other office automation tools. We were working with Power 6/32 mini computers from CCI, using commands like "cu" for remote communication (which would occasionally crash the system) and protocols like SLIP and UUCP.

One of my tasks was to create a backup process for the mini computers. I wrote a shell script using "find" and "cpio" to copy the different partitions to their backup counterparts (like /usr to /usrback, /home to /homeback). This script was to create a "full" backup of each partition. To do this the script had to first remove everything in the backup partition. For some reason I couldn't just use newfs, so I used "rm -r" to clean the partition.

So now the script is written, I add it to cron to run overnight (it would take hours to run) and go home. I come in the next day to see the results of my effort and what do I find? A crashed system with most of its files gone! After a couple of days of investigation I discovered my error. The script was looping through a list of partitions to backup. Within the loop I first removed all the data from the backup directory, then copied the source directory to the backup. The logic was something like this:

for dir in usr home var
do
rm -r /$dirback
find /$dir -print | cpio -pvd /$dirback # Don't remember the options to cpio
done

See the mistake? Pretty obvious but it took me a while to figure it out. While /$dir was defined (say /usr), I thought /$dirback would be converted to /usrback. What, no? Ah, so that's what "{" and "}" are used for! I thought the interpreter was smarter than that and know I meant "/${dir}back". So instead of removing everything under /usrback, I was actually doing a recursive remove from "/"!

So rather than spending the day being congratulated for my superb programming skills, I spent it reloading the OS (back then the system crashed quite often, usually requiring us to re-install, so I was fairly adept at doing it).

And do you know what the worse, most embarrassing, part of it was? I just couldn't understand how my backup script could wipe out the entire system. So after spending the day reinstalling, I was dumb enough to add my script to cron and try again. I came in the next morning with the system in the same state as the previous morning. At that point I had to admit it must have something to do with my script and went researching. Nothing more to say but FAIL.

No comments:

Post a Comment