|
BackgroundThe whole topic of backing up computer data is fairly complex and it is rather difficult to describe in a single linear document. For this reason I have split this into a number of separate articles that discuss different aspects of this issue.
The ArcvBack SolutionWhy Did I Write This?I needed a backup system for user data on my home LAN. I have over the years used a number of commercial systems (both traditional file based and imaging systems) and kept running into issues. Eventually I had settled on NovaNet, which is a commercial system that is designed to backup a LAN of Windows machines. I was generally happy with it, until a couple of things happened:
At this time DVD+RW media had fallen to about the $0.50 to $0.75/GB range which made it less than half the price of any blank tape and DVD burners were in the $100-200 range which made them about 1/10th the price of the sort of tape drive I needed so I started to look into what it would take to write my own software to address this issue. The first version of arcvback was pretty simple, and as I used it I thought of useful enhancements to it. These were added after about a year along with some other features I thought might be useful to make version 2. After using version 2 for another year I noticed that the way the backup media was structured could be changed to reduce the time it took to burn the final backup DVDs. Also, during the two years or so since the first version was written the cost of IDE drives had dropped significantly and the USB2.0 interface had made it possible to easily add and remove external drives, making this sort of drive an excellent backup media. With these issues in mind, in the fall of 2006 I rewrote arcvback for the third time. At this point I did some code simplification and dropped a number of features that had sounded like a good ideas earlier, but which over time I had not found to be that useful. The end result was less code and fewer features but a program that did a better job of its main task. After another 2 years (in the summer of 2008) I revised arcvback to address a number of bugs, update it to Python 2.5 and to remove the file-based version database and replace it with an object database using the Zope Object Database (ZODB). Intended ApplicationArcvBack is primarily intended to backup user data files that are distributed over a set of directories on multiple computers on a LAN. It is not intended to backup operating system files, installed applications or such things as the Windows Registry or boot blocks.ArcvBack uses a backup cache (usually on one or more local drives) to save the first copy of the backup data. This allows the backups to happen at maximum speed, with minimum operator intervention (especially if you install it as a Windows Service). ArcvBack also employs an online backup version database so that the user can locate older versions for easier restoration tasks. In most configurations the backup cache will be as large as the backup media set so that in the event a restore is needed quickly, it can be done from the data in the cache without further bother. I envision two common ways in which ArcvBack might be deployed which are discussed in the following sections. For a small systemFor a system with smaller amounts of data (say less than 50GB or so) ArcvBack would be used with a DVD writer, the user would have three sets of read/write DVDs. He would select the oldest set, erase it and then burn a full backup to it followed by a number of days of incremental backups until he runs out of blank disks in the set. Then he would save the version data base along with the set that just filled up, put it aside (perhaps taking it off site for storage) and repeat the whole process again. Because he has three sets of media he will have one in use and two others (perhaps containing the previous month or two of backups) available for file restore purposes. This gives him a certain degree of redundancy, especially with data that is not frequently revised (like the family photos). If there was a problem with the media in the most recent set for a file he needed to restore, there is a good chance that the same file is available in one of the other two sets.The cost of doing this, assuming DVD+RW media is used and you need to protect about 50GB of data and you are modifying about 10% of that per month and want the media for a set to last about a month is $80.00 total (including the drive). If you don't want to bother with the time taken to burn on average about one DVD every 2 days you could implement the same thing with three USB attached hard drives of a least 55GB capacity each. Of course, you can't buy drives this small today in the more cost effective 3.5 inch size (though you could use the smaller, slower and more expensive lap top 2.5 inch drives for this), so let's say you use the common 500GB size which can typically be purchased for about $100 each (including the USB case). You are still only looking at about a $300 total investment. You can't even buy the least expensive tape drive for that, let alone the 10 or more tapes you would need to have a decent rotation frequency. For a larger systemConsider a system (a home or small office LAN) where there is a total of 200GB to be backed up. Assume a small fraction of this data also changes on a daily basis, for about 10GB per day. This data may be distributed across a number of hard disks and across a number of computers. Most of these computers are running some version of Windows, but there might be data that resides on Linux file servers and is accessible through SMB (SAMBA) drive shares.This is probably about the limit of what one might what to backup using DVDs, if you assume a one month duration per media set then you would have the 200GB full backup plus about 30 daily backups of 10GB each for about 500GB total. That's about 110 DVDs or about an average of 4 DVDs burned per day. The cost of the drive plus three sets of RW media would be about $300, which is still quite reasonable. If instead, you use 500GB USB attached hard drives (three of them, one per "set") you are looking at about $300 total as you can pick drive of this sort up for about $100 including the case. You could even pick up external 1TB drives for about $140 each and not have to worry about backup space for even longer. If you are worried about the reliability of hard drives consider the following points:
If you have a much larger set of static data the ArcvBack solution will still work, but you will probably want to extend the duration of each backup cycle so that the load of the full backup is distributed over a longer period of time. You could also configure more than one machine to act as a backup server, thus sharing the load across a number of machines - you might want to do this anyway to make use of free drive space that happens to be available. Note that if you have all the backup data in your cache drive restores are easily done and are not greatly affected by the number of incremental passes so having longer runs of incrementals is not a particularly bad thing. This is the main reason why ArcvBack provides support for spreading the cache across several drives if needed. What I DoI use ArcvBack to backup key data on three PCs, one running Linux (configured as a file server) and the other two running Windows XP Pro. In total there is about 130GB of data that is backed up, so the initial pass takes about 5 hours to run (for an average rate of about 25GB per hour on my gigabit LAN, when my LAN was only 100MHz this took about twice as long doing about 13GB per hour). Once the initial pass is complete the daily incremental backups take about 10-20 minutes to search out any updated or new files on these machines and to record the new contents to the backup disk. Typically the daily backup process requires about 700MB of additional storage. Over a 6 month period the total backup media ended up requiring 251GB (so about 130GB of initial files and then about 121GB of changed files - so about 670MB per day of revised data). The database for this was about 105MB and compressed to 35MB. In my case much of the revised data is due to my large email folders (inbox, trash, etc.) that get updated on a daily basis.I have configured one of the Windows XP workstations as the backup server, it has a RAID-5 array with enough free space to hold a full 6 months or more of backups (i.e. the initial full backup and all the incrementals). I have ArcvBack configured as a Windows Service to run one backup pass a day first thing in the morning, the backup saves all its data and the backup database to the RAID-5 array. About once a week I connect an external USB drive (300GB or so) to the system and run the arcvpkgcopy.py utility to copy all the new backup media files and the current database to the external drive. In this way I have enough redundancy that I can tolerate loosing 2 hard drives at once without loosing any data (remember that most of the time the data is available in 3 places, unless it is less than a week old - in which case just two places, but one of them is RAID-5 redundant, so I would still have to loose 2 drives on the RAID at once to loose data). If data is less than a day old it is at some risk as it only exists on one drive. Periodically I take the external drive off site and swap it for a second unit so that I have protection against fire, theft and flood. Since the off site unit is still in the same city I'm not fully protected against a nuclear strike or anything larger than a very small asteroid impact. With this setup (since I have all the incrementals for all the files for the last few months online) I can, and have, used the arcvrestore programs to recover lost files and earlier versions of existing files. This works quite well, and is quite quick to do, since all the of database and the necessary media files are online all the time in the RAID array. About once every 6 months or so its time to consider throwing out the current backup and starting a new one. At this point you might consider keeping a copy of the old backup around on a second USB drive in case you later need to get to an earlier version of something. DownloadingThe latest version of ArcvBack can be downloaded here: arcvback2009-10-18.zip (version 4.4).If you need the older version 3 you can download it here: arcvback2007-01-03.zip, the documentation for it is in this arcvback3-docs.zip zip file. Release notes are here. InstallationNote that as the version database storage system changed completely from version 3 to 4 you will need to start a fresh backup cycle and version 4 commands cannot be used to work with a version3 backup (and version 3 commands will not work with a version 4 database either).It is recommended that you install ArcvBack by just unzipping the download archive into a directory called: C:\programs\arcvback4, this way you will minimize the installation effort. See the section on the config.ini file for more details. Before you attempt to use any ArcvBack utilities you will also need to install Python (any version from 2.5 and up should work) and if you are running on a Windows machine then also install the Win32 extensions to Python. To install these things you should download their installers and then just following the installation instructions. I have written this using Python 2.5, and it should work with newer versions of Python too. It might still work with an older Python, but I seem to recall there was one place I used something that was new in Python 2.5. You can find the appropriate install packages here:
I typically install my Python to C:\programs\python25, if you install it to a different directory you may need to change one or two things. You may need to add the C:\programs\python25 directory to your Window PATH environment variable. When running ArcvBack on Windows you can choose to run it as a Windows Service, which will allow it to run automatically once the machine is booted, even when no one is logged into the computer. To allow it to work in this way you must have the Python Win32 Extensions installed and you then need to use the following additional installation instructions: Install Python as a windows service (this only needs to be done once on a machine, so if you have other Python services running you don't need to do this again). I have my python installed in c:\Programs\Python25 so the commands to execute (in a CMD console window) are:
cd
C:\Programs\Python25\lib\site-packages\win32
Before you start the service you will want to configure the service to
automatically start when the machine boots up, to this you right click
on the ArcvBack Backup Server title in the control panel and select
"Properties" from the popup menu, the property sheet should be
configured like:
If you need to remove the arcvback service you can do: ConfigurationMost of the configuration settings for ArcvBack are stored in a file called config.ini, the various settings are explained here, along with what you need to change if you have installed your ArcvBack to some directory other than c:\programs\arcvback4. A sample config.ini file is included in the distribution zip file, you can also read the comments in it.Command summaries:arcvback.py is the main backup utility, you use this for manual backups (or include it in a script file), it shares the same version database as the Windows ArcvBack Service, so it is best to stop the service before running this. As ZODB locks the database while it is in use you will probably have to stop the service to release the database lock.arcvdbrebuild.py is an emergency repair utility, in the event the version database is lost or corrupted and all the backups of it are also missing, then you can run this utility against a set of backup package files to build a new version database so that you can select and restore files as normal. arcvlist.py is a command line utility to list the contents of the version database. This should be considered a last resort as arcvrestoregui is a lot easier to use and is more powerful. arcvpkgcopy.py is a command line utility that copies (with verify) package files from the cache directories to another device (typically an external USB drive) so that you can have extra redundancy (and even take that drive off-site for better protection against data loss). arcvpkglist.py is used to list the contents of package files, normally this is not required (the arcvlist.py and arcvrestoregui.py are better). Note that the arcvpkglist program does have a verification function, and with it you can check a single package or all the packages in a directory for integrity. arcvpkgrestore.py is an emergency command used to extract the contents of a package file, normally this should never be needed. arcvreset.py is a utility command to manipulate the next package and event IDs that the backup will use. Normally this should not be needed, it is primarily used for testing a few special cases. arcvreschedule.py is a utility used when setting up a schedule of backup events for the service.py program arcvrestore.py is the file restore command line utility. Probably best to just use the arcvrestoregui program instead. arcvrestoregui.py is the file restore utility with a graphical user interface, this is the preferred way of selecting files or directories to be restored. You can restore single files, previous versions of single files, the latest version of a directory tree, just the files that were saved in one backup pass in a directory tree, or the most recent version of all files in a directory tree up to and including a particular backup pass. It also verifies all restored data with SHA1 hashes to check the integrity of the restore. arcvrisk.py is a utility to report on files that may be at risk, this will search the database and list any files that:
service.py is the Windows ArcvBack Service application (largely described in the Installation section above) treecopy.py is a command line utility for copying the contents of a directory trees, I wrote this largely for testing the arcvback backup/restore processing to see that it was all correct. However, from time to time I find this utility useful for migrating data from one drive to another. This utility does SHA1 hashes on the data to check that the destination files were correctly written and that the source data was read the same way twice. This way if one of the drives (or controllers) is misbehaving there is a chance that you will notice it. Also, if your computer has faulty memory there is also a chance this utility will notice it. treediff.py is a command line utility for comparing the contents of two directory trees, I wrote this largely for testing the arcvback backup/restore processing to see that it was all correct. However, from time to time I find this utility useful for determining what has changed between two directory trees.
features LicensingThis is free for non-commercial use. If you have a web site or a blog I would appreciate a linkage to this page, that way I can see how much interest there is in this package.If you want to use it for commercial purposes you are free to evaluate it for 100 days, then contact me for licensing if you find it suitable. |