How to Preserve Your Data on Synchrotron Trips

What follows is to a certain extent X9A- and BNL- specific, yet the underlying principles apply to more or less any beamline, subject to giving it the correct name for the tape drive, and having ftp access.

Option 1: Good Olde Magnetic Tapes

We use 4mm DAT tapes in the lab, of the relatively short 90m variety (i.e. DDS1). This is relatively old technology but the advantage is that these can tapes can be read and written almost anywhere.

90m tapes can hold about 2.0Gb in native (uncompressed) format. As a practical matter, it's best to assume that no more than 1.8Gb will fit on a tape because of overheads with the backup programs, etc. I advise against using tape drive compression options (and most tape drives write in uncompressed format as a default) because not all tape drives can read compressed tapes. Instead, pre-compress all your files with gzip, which is almost universally available on Unix systems these days. Gzip can compress a file by 50% or more, unless the file is already compressed. (FYI: using a compressing tape drive to backup gzip'd files usually doesn't yeild any improvement in tape capacity since the data is already shrunk). A more recent program, bzip2, is faster and more efficient than gzip but not available for our SGIs.

IMPORTANT: it's a common misconception that tapes can hold multiple archives. The default behavior with Unix is to rewind the tape after each archive is written. Therefore if you try to put multiple archives on a tape using the usual syntax, each subsequent archive overwrites the previous one because the tape rewinds to the start of the tape each time. The only way to put multiple archives on a tape is to use the non-rewinding tape device (such as /dev/nrtape instead of /dev/tape). However succesful use of this technique requires you to be very awake at the time and to know a fair amount about Unix tape handling. It's better just to carefully select which data to put on a tape, and use more tapes rather than getting cutesy with non-rewinding devices.

To select which data you want to fit on a tape, use the Unix command "du -sk" to figure out how much data is in each directory (specifically, try du -sk *). The output is very straightforward to interpret (the -k option displays the results in kilobytes, but the default is to show it in 512 byte blocks).

Then, to actually backup the data, first insert the tape in the drive. Wait until any lights have stopped flashing, then start the tape backup. I personally always use tar for backups. It doesn't have quite as many fancy features as bru, but it still does a decent job. The generic command format is:
tar -cvf [name of tape drive] [list of files and directories]
e.g. tar -cvf /dev/tape hy110 hy111 hy112

On the SGI at X9A you don't actually have to specify the tape drive, since it defaults to the 4mm DAT drive (i.e. /dev/tape), so you can use a more minimal "tar -cv hy110 hy111 hy112". Under Linux operating systems no such default exists, apparently, but the tape drives are commonly called things like "/dev/st0" (Scsi Tape 0). Note that tar completely backs up the entire contents of each directory you specify, so you can use "hy110" to backup all of hy110, rather than having to use "hy110/*". This is very useful if the directory contains a great number of files.

If at all possible, use the local names of the directories and files to be backed up rather than the absolute path names, i.e. "hy110" or "../hy110" rather than "/data1/ccd/marccd/pavletich/hy110" since the former method makes it a bit easier to extract the frames at the other end if you need to restore data to disk. There are ways around this, so don't panic if you've done it the other way - see the man page for tar.

After the tape archive has finished, press the little black eject button to eject the tape, or do the command line equivalent of the same thing - mt unload. Sometimes you have to give mt the name of the tape drive (using the -t or -f flags), but again, not on the X9A SGI.

Sometimes DAT drives get dirty. Most errors during backups are attributable to the write heads needing cleaning. Just find a tape cleaning cartridge, pop it into the tape drive, and wait for it to eject itself. It is automatic.

Option 2: FTP

Practical Realities: it's not always practical to FTP your frames back to MSKCC. At APS, data collection rates are often considerably too fast to do much more than make a vague attempt at keeping up with data collection, much less FTPing things. For unbinned frames, collecting one frame every 10 seconds, your average collection rate is 1.8 Mb/sec (18 Mb frames every 10 sec). The maximum FTP rate that I usually see is 300 kbyte/sec. To put this in perspective, if you collect an hour's worth of data at 1.8 Mb/sec, it will take you SIX hours to FTP it back. Often exposure times are a little longer than 10 seconds, but even assuming 20 second frames with 50% efficiency on the beamline, you're still greater than the continuous FTP rate.

120 Gb of data will take 400,000 seconds at 300 kbyte/sec. That's 111 hours.

At CHESS and APS, you can FTP directly to MSKCC. At BNL you have to go through some sort of proxy setup (via ftpgw.bnl.local). At any location the fundementals are the same.

First, compress your data (using gzip) as much as possible. FTP doesn't do any compression on the fly, so you want to minimize the number of kilobytes that you are transferring over the 'net. The maximum transfer speed I have seen is a little over 1 megabyte/sec, translating to 8-12 sec per gzip'd frame, during weekdays it can be much slower.

The IP addresses of our two open machines are 140.163.179.210 (xray3) and 140.163.179.204 (xtreme3) and if all else fails you can use xray2.mskcc.org (140.163.179.209). If you are at CHESS simply FTP to these machines and log in as username raxis. At BNL you have to undergo a more tedious procedure:

ftp ftpgw.sec.bnl.local
(use defaults for username and passwd)
quote site 140.163.179.210
user raxis
password (enter the raxis password)

and then you will be logged into the machine as user raxis under the FTP system. For both CHESS and BNL the next procedures are identical:

cd /xtreme4/data2/raxis
bin
prompt

The command "cd" moves you around on the remote machine. The corresponding command "lcd" moves you around on the local machine. The "bin" command makes sure all transfers are done in binary mode. It is very important to specify bin during BNL sessions since the default is not necessarily correct . Most direct Unix-to-Unix FTP sessions set binary mode my default, but it will do you no harm to specify it. The command "prompt" just turns off (actually it toggles) the prompting of FTP for each file transfer. Unless you like hitting "y" several hundred times, make sure prompting is turned off.

If a directory does not exist on the remote machine, use the command "mkdir" to make it. Once you are in the right place, use "mput" to transfer the files:

mkdir X9A_Sept
cd X9A_Sept
mkdir hy110
lcd hy110
mput *

and then once the transfer is complete, enter "quit" to drop the connection. Most FTP connections time out after 10 minutes of idle time, but this is not carved in stone.

Option 3: Laptops

Increasingly, labs use laptops at synchrotrons to backup their data. The Facility laptop runs Linux 8.0 and has a considerable amount of crystallographic software on it. You can do computations independently of the setup at the beamline, and you can transfer your own data to it.

The particular methods for putting a laptop on the network at a synchrotron vary a lot by site. In some places one picks up the network IP address via DHCP methods, on other sites it may be a fixed available IP address (rarer these days). The current implementation requires some tinkering with /etc/resolv.conf to get name service to work well. Some sites (e.g. BNL) do a quick security scan of your laptop once it is attached. At some sites (e.g. BNL) you have to set up proxies to let your web browser access external sites.

In all probability, you will not be able to mount disks from the data collection or processing machines onto your laptop via NFS, even if you are running Linux. So you will either have to do some version of the FTP commands or run something like "scp". I use scp for file transfers at BNL because it gives me continuous updates as to the state of the transfer and also can transfer files recursively (i.e. it scans into subdirectories) which is quite a lot more convenient than messing around with FTP. Read the man page for scp. My usual syntax for the scp command is:

scp -r x9auser@bnlx9y:/disk1/marccd/pavletich/phil/crappy_data .

Note that this command will prompt for the password of the desination site, but the rest of the command specifies the user name, machine name, source directory. The -r flag specifies that scp attempt to recursively copy entire directories (i.e. everything in and below the directory you specify). If you are copying one file do not use the -r flag (not that it will hurt).

Scp is the secure equivalent of rcp. Many sites require that you use the secure versions (ssh, scp etc) rather than the older insecure versions (rsh, rcp).

Cutesy Methods

Please, to avoid complete mayhem, make sure you've understood all the issues I've listed above, before trying anything in this section. It relies on some adeptness with the Unix operating system.

When at CHESS and BNL I often transfer the processing part of data directory ahead of transferring the data itself, because if all else fails at least you have some data. Because the processing files are quite small I use the following method to transfer the frames:

tar -cvf - hy110/proc | gzip >! h110.tgz

This command creates a compressed tar archive on disk rather than a conventional one on tape. What this actually does is create a tar archive of the contents of hy110/proc (relative to the current directory) but instead of sending the tar archive to tape (default) the "f -" flag sends the archive to stdout, which is then picked up by gzip, compressed on the fly and stuffed into the file h110.tgz on disk. Then you can FTP the .tgz file up to MSKCC and unpack it at the other end with:

cd /xtreme4/data2/raxis/X9A
gzcat h110.tgz | tar -xvf -

Which creates hy110/proc and dumps the processing files into it. The directory hy110 will be created if it doesn't already exist.

This syntax is a little arcane, so you may want to practice it a little before trying it in times of stress and sleep deprivation, but once you get the hang of it it can be quite a lot faster than transferring the contents of the directory using the conventional methods outlined above (although scp is pretty convenient). It's not a good idea to transfer frames by this method, since the .tgz files get very large and it is no more efficient than transferring via the standard FTP method.

Remember to delete the disk tar archives (.tgz files) from disks at both ends once you have everything transferred and unpacked - they can be quite large for high resolution or low symmetry datasets.

Rsync is also a means to copy files, in that it synchronizes a copy of the source directory onto the destination directory. The upside of it is that if the source directory contents change, the next rsync command updates it. The downside of it is that if you are running rsync automatically, you end up deleting files in the destination directory that you deleted by accident in the source directory...

Nevertheless, with that caveat, rsync is often quite efficient for large directory collections:

rsync -azv /the/source/directory/  /the/destination/directory/.

will do the trick. Anything new that appears in source will get copied to destination, with the downside that if you uncompress files already in source, the compressed and uncompressed versions are likely to turn up in destination. To make a precise copy of the two directories, include the --delete flag:

rsync -azv --delete /the/source/directory/  /the/destination/directory/.

with the aforementioned issue that things deleted from source also get deleted from destination.

Rsync is easiest to use when the disks are mounted via NFS (or on the same machine). To use rsync between networked machines that don't share NFS mounts, use the ssh protocol flag:

rsync -e ssh -azv --delete /the/source/directory/  user@hostname:/the/destination/directory/.

where the -e ssh flag tells rsync to use ssh to connect to remote machines. In principle, if you're not transferring too much data, something like xray3 or xtreme3 accept ssh logins from select hosts. -Phil Jeffrey 9/13/2002 (updated July 2004)