Macromolecular Crystallography Facility: Home

Compression and Backup

In transferring data from synchrotrons, compressing files is essential to reduce space and speed up transfers. In many cases on the home disks, much space can be saved if you compress some of your larger or infrequently used files. (Deleting temporary files etc will also help).

Compressing Files

With compress, compressed files have a .Z extension:
compress *img         - compress all files matching pattern
uncompress *img.Z     - uncompresses all files with .Z extension
With gzip, compressed files have a .gz extension
gzip *img             - gzip all files matching pattern
gunzip *img.gz        - gunzip all those files
With bzip2, compressed files have a .bz2 extension
bzip2 *img            - bzip2 all files matching pattern
bunzip2 *img.bz2      - bunzip2 all those files

Finding large or old files

If you're running out of space on your home disk, a sensible approach is to compress files that you don't use often. The biggest gain is made when applying this to large files. In order to find those files, use the find command:
find ~ -size +2000 -print
which prints (-print) the names of all files from your home directory (~) on down that are greater (+) than 2000 blocks. If you omit the + it only prints files that are 2000 blocks, and that's of less use. The command:
find ~ -mtime +180 -print
prints the names of all files that were modified at least 180 days ago. You can combine these features:
find ~ -size +3000 -mtime +365 -print
which prints only large (>3000 blocks), old (> 1 year) files. To compress these files using gzip, use:
find ~ -size +3000 -type f -mtime +365 -print -exec gzip '{}' \;
gzip will whine if some of the files selected are already gzipped, so you can use more elaborate find syntax (-not -name '*.gz') if you like, or just ignore the error messages. I include "-type f" in the above command just to stop gzip even trying to compress links or directories - it won't but it will reduce potential error messages. Consult the manual page to find for more interesting options and features - really a very powerful program.

Which compression program?

The old Unix compression program was compress is reasonably efficient but does no error checking. The newer Free Software Foundation program gzip is more efficient, does error checking and is widely available for all Unixes. An even newer program bzip2 is a slower than gzip but does an even tighter job of compressing. It's available on Linux, OS X and I have installed it on our SGIs. I most frequently use gzip, but for situations where disk space is at a premium, bzip2 is probably better unless you are short on CPU.

Here's a comparison of compression efficiencies and times on NINE data frames from X9A. This is in Linux on a local SCSI disk, timings averaged over three operations, default program options:
Program Time Taken Final Size Fraction of original size
compress 13.2 sec 5010353 0.597
gzip 14.7 sec 4458008 0.531
bzip2 33.8 sec 2957671 0.352

Actual compressed sizes depend on the strength of the data. However note that bzip2 is much more efficient at compression on these files at the cost of a whole lot of CPU. For decompression of these same frames (less of an issue):

Program Time Taken
uncompress 5.2 sec
gunzip 3.3 sec
bunzip2 20.6 sec

Note that for ASCII files (e.g. PDB files) often much greater compression efficiencies can be achieved. Gzip or bzip2 are always preferable to compress because of their error checking of the archive.

Testing archive integrity

If you are feeling sufficiently paranoid, or you suspect that data transfer may not be correct, you can test integrity of compressed frames by some variant on the following, which takes all .gz files in the current directory and reads them, disposing of the output. Error messages (e.g. corrupted files) are echo'd to the terminal.
find . -name '*gz' -print -exec zcat '{}' \; > /dev/null
zcat should be replaced by gzcat on the SGI.

Note that very recent experience with older machines is that they can sometimes silently fail on writing files that they should otherwise have no problem with. It's not clear how this happens, but it certainly happened at APS beamlime 8BM. It may also have happened with xtreme4 locally.

Backup Options

We have a variety of drives within the floor: DDS-1 tape drive (SGI - currently Xtreme4), DDS-4 tape drive in Nikolov lab, Exabyte 8500c tape drive (Ximpact1 - now unplugged), CD-RW (most Macs) and DVD-R (Facility G4, some Linux workstations, some Macs). The advantages of tape are their generally larger capacities. The advantages of optical media is that they are random access - you can read the last file off the CD-R quickly, whereas you have the read the entire tape to get to the last file. Tapes are inherently re-usable. CD-R and DVD-R are not, and even their -RW versions have limited overwrite counts.
Media Capacity Notes
DDS-1 4mm DAT tape 2.0 Gb Obsolete - drives no longer made
DDS-4 4mm DAT tape 20.0 Gb Drives cannot read DDS-1 ?
Exabyte 8500c 8mm tape 8.0 Gb Obsolete - drives no longer made
CD-R, CD-RW 700 Mb -RW allows rewrites
DVD-R, DVR-RW 4.3 Gb -RW allows rewrites
DVD+R, DVD+RW 4.3 Gb Different format to -R, -RW

Note that -RW media are not as long-lived as -R media, and it's become pretty obvious that -RW media may only last several years, not several decades. In short they're no better than magnetic tapes for longevity unless you specifically buy archival media.

I've basically drifted away from tapes and toward DVD+R and DVD-R as the media of choice for regular backups of datasets. I may add a DDS-4 drive in 2004, because I probably need something to transfer all the old DDS-1 tapes to for long-term storage and DDS-4 is still a nominal standard. However since the development of DDS as a storage medium has halted, there are questions about it's longevity. The 8500c format is no longer developed, so along with the DDS-1 drive these are basically legacy support issues.

I usually use the program tar for Unix archiving on tapes. The program bru has similar (and sometimes more sophisticated) options, but is sometimes less widely available (e.g. no bru on OS X). The basic format is:

tar -cvf tape_drive file1 [file2 ....]
where this is interpreted as create (c) a tar archive on the tape drive (f tape_drive) using the files directories listed on the remained of the command line. The -v flag makes is verbose - i.e. it lists the files as it archives them. On the SGIs the name of the default tape drive can be omitted. The 8500c tape drive is called /dev/rmt/tps1d4.8500c:
tar -cvf /dev/rmt/tps1d4.8500c file1 [file2 ....]
Do not attempt to put multiple archives on one tape unless you know exactly what you are doing. Unix tape devices rewind the tape between each command, so multiple tar commands just result in only the most recent archive being put on tape.

DVD-R and CD-R archives

In most cases, under Linux, the optical drives are only writable by root, i.e. you will have to get Phil to do such archives. With CD-R (or DVD-R drives with CD-R media) you use mkisofs and cdrecord:
/usr/bin/mkisofs -J -R /some/FILES | /usr/bin/cdrecord -v -speed=4 -dev=2,0,0 -
With DVD-R drives with DVD-R or DVD+R media on Linux you use:
~xtal/dvd+rw-tools-5.14.4.7.4/growisofs -Z /dev/scd1 -J -R /some/FILES
There's a more thorough description of all this in my Linux Setup notes.

Macs are easier - with supported drives you just drag files onto the CD icon, then select burn. With unsupported drives you must use some sort of 3rd party software like Toast. Toast is pretty much a drag-and-drop program as well, so it's all fairly intuitive.

Unix Command Line Methods for Moving Files

Here, when I talk about "source" and "destination" I mean the destination machine and directory is that which you want to backup the data TO. The source machine and directory is where you want to copy the data FROM.

Increasingly, we eschew tapes and optical media for backups and write them straight to USB or FireWire hard drives. Methods at different synchrotrons differ as to the method of attaching these drives, but it's fairly obvious that attaching it to the machine that holds the data is an advantage since anything that does a lot of network I/O is going to be slower.

If the source disk is readable by the machine holding the destination disk: then you can use simple Unix commands. "cp -R" is the obvious one (recursive copy). Read the man page to cp but it's pretty darn obvious how to use it (cp -R /source/directory/* /destination/directory/.)

If the source disk is on a remote machine then you have a couple of options. If the machine is on a very remote site then you mostly have to use ftp:

cd /the/destination/directory

ftp the.source.machine.org
(supply username and password)
cd /the/source/directory
bin
prompt
mget *.img

cd /another/source/directory
lcd /another/desination/directory
mget *img
quit
FTP is simple - the mget command grabs files from the current directory on the machine you've FTP'd to, and stuffs them onto the current directory of the machine you've FTP'd from. Bin just ensures that files transfer using binary mode (important), and prompt will make FTP do all it's work in the background. You can move around the remote (source) and local (destination) directory structure using cd and lcd respectively. The secure version of FTP is SFTP which has basically the same syntax but a slightly different method of logging in:
sftp username@machine
Some machines no longer support ftp but will allow sftp connections.

The program scp is an alternative to sftp. Basically scp works like a secure version of rcp and allows you to do copies between remote machines. Syntax is as follows:

scp -r 'username@source.machine:/the/source/directory/*' /the/destination/directory/.
Some elaboration is required. The -r is the "recursive" flag just like -R in "cp -R", so it will copy the specified files including subdirectories. The username@machine syntax is fairly obvious, and the source files are specified as a single path after the colon. It's not possible to specify multiple directories at once using this syntax. The source location is enclosed in single quotes in the above example because otherwise the local shell will interpret the *. You don't want that. You want scp to interpret the * so it knows what files/directories to get. You can use *img if you just want the image files in that directory. Scp is reasonably efficient, shows you what it is doing, and is flexible enough to make me use it as the default method for my file transfers at synchrotrons (e.g. if I'm transferring processing files from the processing machine to my laptop). Scp is rather less efficient if you have lots of files. Also "scp -r" is potentially dangerous since scp follows the link rather than simply copying the link (like tar, ftp). This means that if you use "ln -s" to make links to other directories, scp will copy the link destination contents as well. This can potentially massively expand the amount copied !!

Rsync is also a means to copy files, in that it synchronizes a copy of the source directory onto the destination directory. The upside of it is that if the source directory contents change, the next rsync command updates it. The downside of it is that if you are running rsync automatically, you end up deleting files in the destination directory that you deleted by accident in the source directory...

Nevertheless, with that caveat, rsync is often quite efficient for large directory collections:

rsync -azv /the/source/directory/  /the/destination/directory/.
will do the trick. Anything new that appears in source will get copied to destination, with the downside that if you uncompress files already in source, the compressed and uncompressed versions are likely to turn up in destination. To make a precise copy of the two directories, include the --delete flag:
rsync -azv --delete /the/source/directory/  /the/destination/directory/.
with the aforementioned issue that things deleted from source also get deleted from destination.

Rsync is easiest to use when the disks are mounted via NFS (or on the same machine). To use rsync between networked machines that don't share NFS mounts, use the ssh protocol flag:

rsync -e ssh -azv --delete /the/source/directory/  user@hostname:/the/destination/directory/.
where the -e ssh flag tells rsync to use ssh to connect to remote machines. In principle, if you're not transferring too much data, something like xray3 or xtreme3 accept ssh logins from select hosts.

Phil Jeffrey, January 2004 - revised July 2004