Block Based Backup

Copyright 2008 by Stephen Vermeulen
Last updated: 2008 Oct 12


This is part of a series of articles on backing up computers. The top page is Design for an Archiving Backup System.

This paper entitled: Venti, a new approach to archival storage (saved here as venti.html) got me thinking that there might be better ways to manage the computer file backup problem. As I see it most backup approaches are falling into one of the following areas:
  1. none, or add-hoc saving of some critical things to archive media such as CDR/DVDR
  2. image based backup, usually of a computer's boot drive or partition to allow rapid recovery from a dead drive
  3. traditional file based systems, with some combination of periodic full backups, perhaps interlaced with incremental or differential backups
For the home user, who may only have a single machine the first two approaches appear to be moderately acceptable and cost effective using a CD-R or DVD-R type drive, but the third approach really needs a tape drive and may prove to be rather expensive. Once you need to backup more than one or two machines on a regular basis the CD-R or DVD-R type drives become too cumbersome (especially the CD-R with its smaller capacity) to use, so larger capacity devices (like tape drives or removable disks) become attractive.

Much the same can be said for the small business environment, except now the financial costs of performing the backups on a regular basis need to be weighed against the cost of lost data (for example due to a disk failure).

For the home user the financial costs of lost data are hard to evaluate, some even look on a disk failure as an opportunity to upgrade a system. However, with the advent of wide spread digital photography the need to reliably backup photos is rising, and the difficulty of doing a good job of this is also rising because of the volume of photos that are taken.

Block Oriented Storage

As an alternative to storing data on a file-by-file basis a backup system could break up the files into blocks (say 8K bytes each) and work with these instead. This would allow further reduction of redundancy in the case where a part of a file was changed or in the case where there are different versions of the same file scattered across a network and the versions are largely similar. It might also make retrival of data from backup media more rapid, especially with tape devices which can often seek forward to a particular block quite quickly.

Use of block oriented storage might also allow for easier implementation of a caching mechanism within the backup system.

The disadvantage of the block based approach is that the file database gets larger (since additional data to track the blocks is required). The costs of this are examined later, but with a block size in the 256k to 1M range the extra overhead is not too prohibitive.

On the same network I looked at how much redundancy there actually was. In this case I used the 256k byte chunk size and over a total of 1024525 blocks (this is slightly less than the 1035201 quoted above because there were files that could not be opened for various reasons), 914029 blocks were unique (had unique MD5 digests). This left 110496 blocks which were duplicated (usually between machines, but sometimes on the same machine due to copies of directories being made). The savings that could be made if the duplicate blocks were not re-saved was 6,655,101,923 bytes (which is only 3.4% of the network total). If one drops the two drives #8 and #15 from the network total (as they contain archived photos, mp3 and video) then this rises to 18.8%. With the savings only being at this sort of level it does not seem worth doing (which is maybe why I have been unable to find any software that does this already).

                back to home