Future Ideas in Backup

Copyright 2008 by Stephen Vermeulen
Last updated: 2008 Oct 12


Introduction

This is part of a series of articles on backing up computers. The top page is Design for an Archiving Backup System.

Further Research

  • Is there a Python API for writing directly to CDR or DVDR burners?
  • Test out using the win32 file API for writing/reading tapes on NT
  • Test using win32api to set/clear the archive bit
  • estimate how many chunks will end up in active use in the system (i.e. including the effect of having to keep old chunks around for the archive period to allow deleted files to be restored).

Experiment 1

Write a program to tally the MD5 digests across a set of machines via the administrative shares. Examine the results to determine:
  1. number of files per machine
  2. number of chunks per machine
  3. number of bytes per machine
  4. number of places the same hash digest is found (get an idea of how much redundancy there is across the network)
The program works in two passes, the first gathers statistics for all files and the second pass examines these files to build the overall statistics. The first pass writes two primary output files. It also reports any access errors into fileerrors.txt

filehash.txt

This contains the chunk hash data, each line is of the format:

fileNumber blockNumber BID

filename.txt

This contains the file object data, each line is of the format:

driveNumber fileNumber fileSize fileUNCName


Some Estimation Statistics

On FLARE (a Windows 2K box) there are about 1591MB in 18364 system files and installed applications, for about 86K/file. Plus there is 5480MB of user data in 3934 files, for about 1400K/file.

On GALAXY (a Windows XP box) there is about 3867MB in 36650 files (system and apps), for about 105K/file.

On NOVA (a Windows 4.0 box) there is about 2476MB in 24943 files (system and apps), for about 100K/file, but on the data drives 5141MB in 47909 files for 107K/file and 138170MB in 84778 files for 1630M/file on the network media drive (largely audio and digital photographs, but some video).

So figure on 50-100K files per machine in a network, and each file typically would be about 100K in size, giving about 5-10GB per machine. For the case of a file server that is holding a central repository of media files the number of files probably will not be vastly larger, but the average size of each file may be much larger (16 times in the above case).



                back to arcvback.com home