How to Preserve Your Data on Synchrotron Trips

What follows is to a certain extent X29/X25 and BNL- specific, yet the underlying principles apply to more or less any beamline, subject to giving it the correct name for the tape drive, and having ftp access.

Preparing Your Data

You want to compress your data as much as possible for two reasons: it fits on fewer media, and it takes less time to traverse the network. My favorite program for this is gzip:
gzip *.img
which has a fairly heavily-tested compression algorithm that does error/sanity checking and is also tolerably fast. If you are short of space you might try bzip2 which compresses rather better, at the expense of taking far more CPU to acheive it. If I'm on top of data backup I use bzip2 now, but gzip is significantly faster.

I strongly recommend compressing your data immediately after you've finished processing it, rather than waiting until the end of the data collection run - running gzip on a few thousand files takes a while. It is only necessary to compress the frames - data processing files are usually much smaller than images and so compressing these files are not an obvious gain.

Option 1: BNL's DVDBackup

A brief manual for this is at http://www.px.nsls.bnl.gov/databackup/DVDRobotDataBackup.html. The ones we use are at the "X29 cyber cafe" right outside the X29 user area - there is also one at X12. Basically you "cd" to a directory you want to back up then type dvdbackup whence the program will prepare the data and cut one or more DVDs. The machine on which you are logged in determines which DVD burner it goes to.

It is important to note that creating the DVDs takes a while, in fact 20-40 minutes from typing the command to the first DVD appearing. The DVD backup machinery takes quite a while when the network is slow or the machine is burning DVDs for multiple people. DVDbackup cannot be run more than one at a time on the same machine, but it appears to be possible to run multiple DVDbackups from multiple machines.

To make the most of this approach it makes sense to collect multiple datasets within the same master directory, and not to have too many Gb of data in each master directory. For example if you know you're going to collect 30-40 Gb of data, waiting for a single DVDbackup job to complete 10 DVDs at the end of the run will take several hours - better to split up the data into 3 or 4 different directories and start the DVDbackup on the previous one once you start writing to the next one. This way the overhead backing up the last few datasets at the end of the run is as small as possible. The new demented NSLS method for assigning frames in directories can make this more difficult than it need be.

Option 2: Hard Drives

The "X29 cyber cafe" has the facility to mount firewire drives directly onto the Linux PCs there. This has the advantage that backing up onto an external hard drive is often much faster than doing anything with optical media. Ideally you'd do both for a combination of short- and long-term archiving but the hard drive option is quite viable for quickly restoring data back in the lab. Eventually hard drives fill up and older data is removed, so this is not a long-term solution.

Instructions for mounting hard drives on the cyber cafe machines are rather minimal. Basically if you have a FAT32-formatted hard drive (the default for things like the external 250Gb Maxtor drives) then you can simply do a:

mount /mocha/1394_data1
(where you should substitute "mocha" for the name of the machine you have your hard drive plugged into) then cd to this directory and use it as a hard drive. To unmount it, make sure you are not cd'd to any directory on the hard drive and then do:
cd
umount /mocha/1394_data1
This doesn't always work, probably because of users failing to unmount cleanly, so you may need to bug the PX operator. The systems can also mount HFS (Mac) and Linux formatted hard drives, but both of these typically require extra effort. If Linux says something like "disk is busy" you're cd'd to a directory on the drive from somewhere.

To do the actual backup you can simply "cp -r" from one directory to another but this tends to be wasteful if you go back and reprocess the data again, then have to "cp" it all over again. You could use the somewhat messy tar option:

cd wherever/I/want/my/data
(cd where/my/data/is ; tar czf - data) | tar xvzf -
in which case you need to read up on the usage of tar in order to figure out what I've just done. The method I prefer uses rsync in the following manner:
cd /wherever/I/want/my/data
rsync -azv --modify-window=1 --delete /where/my/data/is/. .
The combination of trailing / and . is significant in rsync so it pays to have them the same. Rsync has the advantage that in principle it will not transfer data that's already in the destination directory (it compares file sizes/modification times using a rapid algorithm). This algorithm is not perfect, and sometimes seems to fall afoul of FAT32. The --delete removes files on the destination (backup) directory that are no longer present in the source directory. This can be an advantage if you delete (or compress) a whole bunch of frames, but understand that you will lose the deleted items if you utilize this option. Read the manual for rsync.

Option 3: Laptops

No longer remotely viable at BNL - you can thank them as this is a function of their general failure in security or having too much of an idea how to implement it.

These are often useful in that the issue with mounting the hard drive on the remote systems is removed - you just mount the hard drive to your own laptop. Laptops are also an advantage in that you can do some data manipulation (I've solved structures on my old 800MHz iBook G4 with 256Mb memory). At CHESS it's pretty straightforward to plug them into the wall. Last time I used APS it wasn't too complicated either. At Brookhaven it's increasingly hampered by BNL's IT department having it's head firmly wedged up its [anatomical reference removed] - now requiring you to install ORDO on your own laptop (and something they refused to give me additional information about). However if you use a lab laptop that you are willing to do this with, the instructions for laptop installation are here. You need to remember to fill out the web form after you plug the computer into the wall, however, or bad things can happen with the entire beamline's ethernet connection.

Option 4: FTP

Practical Realities: it's not always practical to FTP your frames back to Princeton. At APS, data collection rates are often considerably too fast to do much more than make a vague attempt at keeping up with data collection, much less FTPing things. For unbinned frames, collecting one frame every 10 seconds, your average collection rate is 1.8 Mb/sec (18 Mb frames every 10 sec). The maximum FTP rate that I usually see is 300 kbyte/sec. To put this in perspective, if you collect an hour's worth of data at 1.8 Mb/sec, it will take you SIX hours to FTP it back. Often exposure times are a little longer than 10 seconds, but even assuming 20 second frames with 50% efficiency on the beamline, you're still greater than the continuous FTP rate. 1.8 Mbyte/sec is not a bad estimate of how much data you generate for binned frames at X29 using a 2 second/frame exposure time, either.

120 Gb of data will take 400,000 seconds at 300 kbyte/sec. That's 111 hours. Back at MSKCC the P.I. was particularly "optimistic" about the practicality of FTP'ing mutliple gigabytes of data across the network. Of course he didn't actually have to try doing it while collecting data at the same time. I suggest avoiding it at all costs. If you want to transfer processing files, you can just create compressed tar archives and email them to yourself (or someone else). They are rarely more than 1Mb and so work just fine as attachments. Do:

  tar -cvzf my_proc_stuff.tgz proc99/.
and RTFM for tar if you're not clear how that line works.

At CHESS and APS, you can FTP directly to Princeton. At BNL you have to go through some sort of proxy setup (via ftpgw.bnl.local) which is particularly slow - BNL's IT department like to make our lives difficult to they seem to have disabled this. BNL also have an Anonymous FTP option (see http://www.px.nsls.bnl.gov/databackup/howto_ftp.htm) that lets you put your data on a local anonymous FTP site and retrieve it once you get home. I would call this the option of last desperation if all else fails. Apparently the recent change in the way PXRR archive images means that you can retrieve them from the lab via FTP after data collection - I have no information on that at this time, because I always do my own hard backups before I leave. I'm paranoid like that.

Options no longer considered viable

4mm DAT and 8mm tapes are supported at increasingly fewer beamlines. They also tend to be at least as slow as modern optical drives, are not random access, have greater longevity issues and basically are NOT RECOMMENDED. There's a small legacy issue with DVDs in that older DVD drives do not read the newest DVDs. Optical drives are relatively cheap compared to the cost of your project, however.

-Phil Jeffrey last revised Feb 16th 2007, updated Aug 2010.