X-ray Data Wrangling

Phil Jeffrey, August 2017, v0.3

Preamble

This is a small piece on how to handle your data after HKL/Scalepack and how to get it into the various formats required for further calculations. Not much theory is in this section - it's mostly workflow and code snippets. The previous version of this guide was written from a command line perspective - one I still use in many contexts - but it's an undeniable fact that many programs now have a GUI component to them which can make for easier and/or faster use. Some GUIs are better than others. So this document is now edited to partially reflect that and the old version is perserved as DataWrangling_CmdLine.html

Contents

 

Scalepack Format

More often than not we use HKL3000 for home source data processing and not infrequently use HKL2000 for synchrotron data processing so you end up manipulating the output of SCALEPACK from the HKL suite. This is certainly not your only option - if you are using MOSFLM the output method travels through POINTLESS and AIMLESS you'd use a different method to achieve the steps I outline below. AIMLESS is also a convenient route out of XDS, also, although you can directly use the output of the CORRECT step. There are an increasing number of "meta" processing programs that typically use MOSFLM or XDS to drive the data integration step - here you're also likely to end up in MTZ format without you having to do anything. My favorite program of this kind is autoProc but XIA2 and DIALS are also worth checking out.

SCALEPACK outputs an ASCII (human-readable) "flat" file with the reflection data:

    1
 -985
   108.547   160.025   270.755    90.000    90.000    90.000 p212121         
   0   0   6    -5.1    10.1
   0   0  10  3034.3   153.9
   0   0  12   840.6    49.0
   0   0  14   651.1    43.0
   0   0  16  2475.6   136.3
   0   0  18  8179.2   388.2
   0   0  20 29222.3  1326.0
I have no idea about the first two cryptic lines but the third line contains your unit cell dimensions and space group. If you've got the
 REFERENCE FILM 1 
line in your scale.com/scale.in file - or you've requested Global Refinement in HKL2000 - then the cell dimensions are the post-refined (i.e. optimal) ones out of Scalepack. The remaining lines in the file contain reflection data. For data output without the SCALE ANOMALOUS or ANOMALOUS flags, the reflection data all looks like:
   0   0   6    -5.1    10.1
   0   0  10  3034.3   153.9
   0   0  12   840.6    49.0
   0   0  14   651.1    43.0
   0   0  16  2475.6   136.3
   0   0  18  8179.2   388.2
   0   0  20 29222.3  1326.0
which is h, k, l, intensity, sigma(intensity) : the Miller indices of the reflection, it's estimated averaged intensity and the standard deviation thereof.

For data output with the anomalous flags, it looks more like:

    .
    .
    .
   1   0  38 17916.5   524.3
   1   0  39  6971.1   225.9
   1   0  40 11223.0   344.4
   1   0  41  1858.6   119.9
   1   0  42   180.7    78.3
   1   0  43   132.9    84.1
   1   1   1   814.6    22.2
   1   1   2 22016.1   667.7
   1   1   4       0      -1  5599.1   148.8
   1   1   5       0      -1 16369.3   331.3
   1   1  10 74860.2  2886.0 69208.2  1818.2
   1   1  11 24627.8   818.8 30576.5   941.0
   1   1  12  9907.5   398.1 10845.7   403.4
   1   1  13  5758.6   250.4  5909.7   256.7
   1   1  14   412.8    53.4   486.1    53.8
    .
    .
    .

Centric reflections contain only h,k,l,Int,sigma(Int) because they have no anomalous difference. Acentric reflections should have Int(h,k,l) and Int(-h,-k,-l) - [we'll call this I(+) and I(-) from now on] on the same line with the format h,k,l,I(+),sigma(I(+)),I(-),sigma(I(-)) . For reflections with I(+) missing the values are replaced with 0 and -1 - see the (1,1,4) and (1,1,5) reflections in the above example which have I(-) observed but are missing I(+) observations. For reflections with I(-) missing the reflection just looks like a Centric reflection (e.g. the (1,1,1) and (1,1,2) reflections in the above example) with just the I(+) values listed. If you're not clear what a Centric and Acentric reflection is, then Centric reflections have phase values restricted one of a few values (typically 0 and 180 degrees). Centrics usually have one Miller index of zero, but this doesn't guarantee that they are Centric - e.g. there are none in the P1 space group. For a formal definition of what makes a Centric reflection see: CCP4's maths for protein crystallographers on this very subject.

Scalepack format gotcha: in what seems like a dubious design decision, those people at HKL make Scalepack change format for really strong reflections that do not fit within the standard format:

    .
    .
    .
   8   1   5  167487    3612
   8   2   2  168299    6854  178745    6459
   8   2   3  133771    2903  125313    3261
   8   3   3  144493    3074  136694    5072
   8   3   5  164070    3787  142771    4140
    .
    .
    .
i.e. the decimal point is removed, allowing 100x larger numbers to be stored in the same column width.

All this makes it very ugly to start writing your own data conversion program - leave it to the pros.  

Scalepack for SHELX

I'm personally a big fan of George Sheldrick's SHELXL suite of programs for macromolecular structure phasing. They are extremely fast and efficient and in favorable cases I have phased entire structures in 10 minutes. They can use SCALEPACK format .sca files directly. Things have got even better now that the Tk-based GUI HKL2MAP is available which directs you through the process of constructing command files and will run SHELXC/SHELXD/SHELXE directly from the GUI and write a MTZ conversion script for you - however it still doesn't actually write the MTZ.

Point it at the .sca file(s) in either SAD or MAD mode and let it run SHELXC. You'll have to specify a run name (I typically choose "a" and other simple names) to get started. Subsequent file names (e.g. a_fa.ins) use that base name. HKL2MAP is one of those GUI programs that is a wrapper around the command line approach to running the programs, so you can see the input for SHELXC in the file "a_shelxc.in". Use the checkboxes to open up sections for each program (SHELXC in this case) and click Run SHELXC to run the program with the output showing up in the tabbed window below. When running a slower program (e.g. SHELXD) the output shows up in real time.

You can sequentially run SHELXC and SHELXD and then run SHELXE with the two possible absolute structures. To view the phasing solutions in e.g. Coot you'll have to convert the .phs SHELXE phase files into MTZ format - these have a naming scheme like a_m20_s.5.phs and a_m20_s.5_i.phs - the latter for the inverted enantiomorph. HKL2MAP writes a script called something like a_phs2mtz.csh to convert the phs files into mtz format. Sometimes the script needs a little tweaking (e.g. change spacegroup from R3 to H3 if using the hexagonal setting) but it's just using CCP4's F2MTZ and CAD programs to do the conversion.

If you still want to use the command line

If you want more control, want to play with SHELXD options, or want to script the process it's still quite possible to do so from the command line. SHELXC/D/E are now distributed via CCP4 but you might want to download the full SHELX suite. Since HKL2MAP writes the input files to the disk you can use those as the basis for your scripts or you can use the templates below.

SHELXC uses Scalepack files directly so there's no conversion process - but SHELXC does expect the Scalepack files to have a .sca extension to you might have to do some renaming (I personally always use .sca as a Scalepack output file extension).

A typical SHELXC/SHELXD/SHELXE script looks like:

#!/bin/csh -f
#
#   Script to run SHELXC/SHELXD to find heavy atoms in the MAD case
#
#   SHELXC prepares Fa coefficients for SHELXD to chew on, finding the
#   effective resolution of the MAD/SAD/SIRAS data by correlation between
#   wavelengths (where possible) or signal/noise ratios.
#   It reads SCALEPACK files, WHICH MUST HAVE A .sca EXTENSION and contain
#   anomalous differences (use the ANOMALOUS keyword in Scalepack).
#   (if you have a low-energy remote it's keyword LREM, if you have a
#   reasonably isomorphous native you can include it with keyword NAT)
#   
#   (Files with extension .hkl are assumed to be SHELX files, which will 
#   give the wrong result if they are really scalepack files)
#   
#   Change: names of files, cell dimensions, spacegroup, and number of sites
#   to FIND.  The number of tries can be left at 50 and then sharply reduced
#   if SHELXD obviously finds a solution quickly.
#
#   The name "se1" and "se1_fa" are the job names.  The .lst
#   (log), .res (result) files are all given that name with various extensions
#
##
##  SHELXC prepares the data - the instruction (.ins) file is given the job
##  name with _fa appended to it.
##
#
/usr/local/shelx/macosx/shelxc se1 << EOF
HREM se1rm.sca
PEAK se1pk.sca
INFL se1in.sca
LREM se1lo.sca
CELL 35.104  63.496  76.458 90. 90. 90.
SPAG P212121
FIND 8
NTRY 50
EOF
#
#
##
##  Actually run SHELXD - the best result is in the .res file, the log file
##  is in the .lst file, results also listed to the screen
##
#
/usr/local/shelx/macosx/shelxd se1_fa
#
##
##  Run solvent flattening in SHELXE with 35% solvent for 100 cycles
##  Include the -h flag if you don't have NATIVE data included in
##  SHELXC (since your "native" data for flattening will contain the heavy atom)
##  The -i flag tries the inverse of the heavy atom solution
##  Supposedly the one with the highest contrast is the correct one.  Note that
##  inverting the sites will also invert the space group (e.g. P3<1>21 goes 
##  to P3<2>21)
##
##  The program automatically picks up the best result from the .res file
##  produced by SHELXD
##
#
/usr/local/shelx/macosx/shelxe se1 se1_fa -s0.45 -m60 
/usr/local/shelx/macosx/shelxe se1 se1_fa -s0.45 -m60 -i
#
After SHELXC the program(s) use their own reflection format with the extension .hkl (consult the documentation). SHELXC is a pretty good program for analyzing your MAD data while you are collecting it (d''/sig signal/noise ratio, for example).  

Scalepack for CCP4

CCP4 is one of the more dominant crystallographic software packages - with good reason - although by no means the only one. A common step is therefore to import your data into CCP4 reflection file format, namely MTZ format, which conveniently some other packages can use as well. I actually trust this method via CCP4 far more than e.g. importing into the Phenix package because it's less likely to try and outsmart me during the data conversion step. This involves two steps, typically: SCALEPACK2MTZ to convert the data into the format, and TRUNCATE to convert the data into structure factors. Documentation for each program is easily found at the CCP4 site.  

CCP4: native data example

This is a fairly simple example where we read the data using SCALEPACK2MTZ, convert it to structure factor moduli using TRUNCATE, generate Free-R flags using the "uniqueify" script and then write it out as CNS format using MTZ2VARIOUS.
#!/bin/csh -f
#
##
##  Scalepack2MTZ takes SCALEPACK output files (with or without anomalous
##  data) and converts them to MTZ files containing the same intensity data
##  For anomalous data it also computes Imean and SIGImean
##
#
scalepack2mtz HKLIN bxc-063.sca HKLOUT bxc063.mtz << EOF
ANOMALOUS NO
SYMMETRY P212121
TITLE Convert scalepack data to MTZ format
SCALE 10.0
EOF
#
##
##  Convert the INTENSITIES to STRUCTURE FACTORS
##  If TRUNCATE NO is used the conversion is a simple SQRT()
##  If TRUNCATE YES is used for estimating very weak reflections, which may
##  have observed small negative I's, will be given small positive F's based
##  on the expectation of the distribution derived from Bayesian statistics
##
##  NRESIDUE <#residues> will attempt to put the data on an ABSOLUTE SCALE
##          assuming you have an educated guess as to the a/u contents
##  Absolute scales can be useful for analysis of difference maps
##
##  Part of the truncate output gives a Wilson-like description of the 
##  project atomic/solvent content based on NRESIDUE
##
#
truncate hklin bxc063.mtz hklout bxc063_truncate.mtz << EOF
TITLE Truncate run on native data
TRUNCATE YES
LABIN IMEAN=IMEAN SIGIMEAN=SIGIMEAN 
NRESIDUE 340
EOF
#
uniqueify bxc063_truncate.mtz
#
#
mtz2various HKLIN bxc063_truncate-unique.mtz HKLOUT bxc063.free << EOF
RESOLUTION 500.0 2.9
OUTPUT CNS  
FREEVAL 1
MISS 0.0
LABIN FP=F SIGFP=SIGF FREE=FreeR_flag
END
EOF
In SCALEPACK2MTZ we can override the cell dimensions, if you prefer, and define the space group which might be different to the arbitrary one you assigned during scaling. We tell if if there is anomalous data present or not using the ANOMALOUS keyword. For reasons of precision it's usually better to multiply the supplied data by a factor of 10. From the Scalepack .sca file, we get an MTZ-format file bcx063.mtz that contains essentially the same contents as the original .sca file. TRUNCATE converts intensities to structure factors, and puts the data on an absolute scale if you supply the number of residues in the asymmetric unit. Here we tell TRUNCATE to actually remap the intensity data (TRUNCATE YES) based on Bayesian statistics which can be useful for weak data where a decent chunk of the reflections would have |F|=0 which may then get omitted from CNS refinement. The uniqueify line adds randomly-selected FreeR flags to the MTZ file, creating bxc063_truncate-unique.mtz. We can then write the contents of this file out to CNS format using MTZ2VARIOUS for use in refinement.

This is a very simple script and many variants are possible. You can use the program MTZDUMP to probe the data stored in the binary (not human-readable) MTZ files.

Some attention needs to be paid to the output of TRUNCATE - in particular it gives you an estimate for the solvent content given the number of residues that you specified. This is useful when estimating the number per asymmetric unit. Also the cumulative intensity statistics can be a fairly good indicator of the potential presence of twinning (which is guaranteed to ruin your day).

If you put your MTZ file into X-PLOR format (like CNS format but without the headers) then you can use Todd Yeates' twinning server to test for twinning. Alternatively take a look at the Xtriage section below.  

CCP4: MAD data example

For MAD data, you've got multiple reflection files, now with anomalous data. The methodology is basically the same, using SCALEPACK2MTZ and TRUNCATE, but there's an intermediate step using the program CAD to merge all the individual files together. Typically I call this l123_dano.mtz (three lambdas with anomalous data) and use this as input to SHARP (renaming it to l123_dano.data.mtz to keep SHARP happy). Apart from the CAD step the only new thing here is the wrangling of multiple wavelengths.
#!/bin/csh -f
#
##
##  Scalepack2MTZ takes SCALEPACK output files (with or without anomalous
##  data) and converts them to MTZ files containing the same intensity data
##  For anomalous data it also computes Imean and SIGImean
##
#
scalepack2mtz HKLIN se1in.sca HKLOUT inflection.mtz << EOF
ANOMALOUS YES
CELL 35.104  63.496  76.458 90. 90. 90.
SYMMETRY P212121
TITLE Convert scalepack data to MTZ format
SCALE 10.0
EOF
#
scalepack2mtz HKLIN se1pk.sca HKLOUT peak.mtz << EOF
ANOMALOUS YES
CELL 35.104  63.496  76.458 90. 90. 90.
SYMMETRY P212121
TITLE Convert scalepack data to MTZ format
SCALE 10.0
EOF
#
scalepack2mtz HKLIN se1rm.sca HKLOUT remote.mtz << EOF
ANOMALOUS YES
CELL 35.104  63.496  76.458 90. 90. 90.
SYMMETRY P212121
TITLE Convert scalepack data to MTZ format
SCALE 10.0
EOF
#
scalepack2mtz HKLIN se1lo.sca HKLOUT lowremote.mtz << EOF
ANOMALOUS YES
CELL 35.104  63.496  76.458 90. 90. 90.
SYMMETRY P212121
TITLE Convert scalepack data to MTZ format
SCALE 10.0
EOF
#
scalepack2mtz HKLIN nat1.sca HKLOUT native.mtz << EOF
ANOMALOUS NO
CELL 34.859  63.271  76.360 90. 90. 90.
SYMMETRY P212121
TITLE Convert scalepack data to MTZ format
SCALE 10.0
EOF
#
##
##  Convert the INTENSITIES to STRUCTURE FACTORS
##  If TRUNCATE NO is used the conversion is a simple SQRT()
##  If TRUNCATE YES is used for estimating very weak reflections, which may
##  have observed small negative I's, will be given small positive F's based
##  on the expectation of the distribution derived from Bayesian statistics
##
##  Probably the simplest option is TRUNCATE NO, reserving TRUNCATE YES
##  for difficult cases or if you want to fantasize about dropping your
##  R-free by 0.1%
##
##  NRESIDUE <#residues> will attempt to put the data on an ABSOLUTE SCALE
##          assuming you have an educated guess as to the a/u contents
##  Absolute scales can be useful for analysis of difference maps
##
##  Part of the truncate output gives a Wilson-like description of the 
##  project atomic/solvent content based on NRESIDUE
##
#
truncate hklin inflection.mtz hklout inflection_truncate.mtz << EOF
TITLE Truncate run on SAD data
TRUNCATE NO
LABIN IMEAN=IMEAN SIGIMEAN=SIGIMEAN I(+)=I(+) SIGI(+)=SIGI(+) -
      I(-)=I(-) SIGI(-)=SIGI(-)
NRESIDUE 180
EOF
#
truncate hklin peak.mtz hklout peak_truncate.mtz << EOF
TITLE Truncate run on SAD data
TRUNCATE NO
LABIN IMEAN=IMEAN SIGIMEAN=SIGIMEAN I(+)=I(+) SIGI(+)=SIGI(+) -
      I(-)=I(-) SIGI(-)=SIGI(-)
NRESIDUE 180
EOF
#
truncate hklin remote.mtz hklout remote_truncate.mtz << EOF
TITLE Truncate run on SAD data
TRUNCATE NO
LABIN IMEAN=IMEAN SIGIMEAN=SIGIMEAN I(+)=I(+) SIGI(+)=SIGI(+) -
      I(-)=I(-) SIGI(-)=SIGI(-)
NRESIDUE 180
EOF
#
truncate hklin lowremote.mtz hklout lowremote_truncate.mtz << EOF
TITLE Truncate run on SAD data
TRUNCATE NO
LABIN IMEAN=IMEAN SIGIMEAN=SIGIMEAN I(+)=I(+) SIGI(+)=SIGI(+) -
      I(-)=I(-) SIGI(-)=SIGI(-)
NRESIDUE 180
EOF
#
truncate hklin native.mtz hklout native_truncate.mtz << EOF
TITLE Truncate run on native data
TRUNCATE NO
LABIN IMEAN=IMEAN SIGIMEAN=SIGIMEAN
NRESIDUE 180
EOF
#
#####
#   MERGE all the MTZ files into one huge file
#####
#
cad     HKLIN1 inflection_truncate.mtz \
        HKLIN2 peak_truncate.mtz \
        HKLIN3 remote_truncate.mtz \
        HKLIN4 lowremote_truncate.mtz \
        HKLIN5 native_truncate.mtz \
        HKLOUT l123_dano.mtz    << EOF

CELL 35.104  63.496  76.458 90. 90. 90.
SYMM P212121
RESOLUTION OVERALL 100000.0  1.0
TITLE Combined native and derivative data
LABIN  FILE 1  E1=F     E2=SIGF   E3=DANO  E4=SIGDANO
CTYP   FILE 1  E1=F     E2=Q      E3=D     E4=Q
LABOUT FILE 1  E1=F1    E2=SIGF1  E3=ANO1  E4=SIGANO1

LABIN  FILE 2  E1=F     E2=SIGF   E3=DANO  E4=SIGDANO
CTYP   FILE 2  E1=F     E2=Q      E3=D     E4=Q
LABOUT FILE 2  E1=F2    E2=SIGF2  E3=ANO2  E4=SIGANO2

LABIN  FILE 3  E1=F     E2=SIGF   E3=DANO  E4=SIGDANO
CTYP   FILE 3  E1=F     E2=Q      E3=D     E4=Q
LABOUT FILE 3  E1=F3    E2=SIGF3  E3=ANO3  E4=SIGANO3

LABIN  FILE 4  E1=F     E2=SIGF   E3=DANO  E4=SIGDANO
CTYP   FILE 4  E1=F     E2=Q      E3=D     E4=Q
LABOUT FILE 4  E1=F4    E2=SIGF4  E3=ANO4  E4=SIGANO4

LABIN  FILE 5  E1=F     E2=SIGF   
CTYP   FILE 5  E1=F     E2=Q      
LABOUT FILE 5  E1=FN    E2=SIGFN  

END
EOF
#
Here we've got FIVE runs of SCALEPACK2MTZ, one for the high energy remote, peak, inflection and (unusually) the low energy remote and native data. We flag all the MAD datasets with ANOMALOUS YES in SCALEPACK2MTZ. Conversion is via TRUNCATE in each case, this time there are more labels to deal with because of the presence of I(+) and I(-). Finally the CAD run is new: there are five input files going into one output file. Each dataset has a unique "F" column label (F1, F2, F3, F4, FN) and the MAD datasets have the corresponding ΔAno column labels (ANO1, ANO2, ANO3, ANO4) plus the standard deviations of all of those. The use of the RESOLUTION keyword makes sure that all the input data makes it to the output. Note that because the native and MAD datasets have different unit cells, we force a specific unit cell within CAD for the purposes of eliminating inconsistency.

Using the CCP4i or CCP4i2 GUIs

In recent years CCP4 has put considerable effort into creating GUI programs that are wrappers around the existing (and expanding) program suite. I've mostly used the older Tk-based CCP4i but the newer CCP4i2 sports a newer interface and is arrange more in terms of tasks. CCP4i and CCP4i2 are based on the idea of a "project" which typically corresponds to a crystal structure, and these projects are linked to data directories in which the GUIs write the task results and look for the source data.  

Scalepack for PHENIX

If you use the above CCP4 protocols be aware that Phenix.refine still has this nasty habit of grabbing your intensities and using them instead of your structure factors despite the fact that it uses structure factors internally for refinement - if it does this it'll override your choice of using TRUNCATE=YES for converting your I's to F's, for example. So you'll want to make sure you've removed intensities from the MTZ files before you start refinement - I use CCP4's program CAD for this which can read in an MTZ file and write out a subset of the columns - F,SIGF and FreeR_Flag are my typical outputs. I've ranted about this "feature" previously. For Phenix's AUTOSOL, you can use .sca files or .MTZ files. Phenix.hyss is similarly liberal in its inputs. For heavy atom structure determination you might as well feed it the .sca's since the closer to the native data you get, probably the better.  

Scalepack for SOLVE

I don't use SOLVE so need some additional help here, however I am aware that RESOLVE can write in MTZ format. SOLVE can probably read the original data in .sca (scalepack) format, or more processed data in .mtz (CCP4) format. Probably the easiest way to feed data into SOLVE is by using the PHENIX GUI.  

Scalepack for SHARP

SHARP reads CCP4 format so all you need to do is combine your MIR/MAD data into a single file. However SHARP wants the file to be called (e.g.) l123_dano.data.mtz so typically you'll have to rename it.  

Diagnostic Tools - Phenix.xtriage

One of the more powerful tools for analysing your reflection data is mmtbx.xtriage - now part of the phenix package so phenix.xtriage - which does a good job with quick analysis of Scalepack files (also works on simple MTZ files).
phenix.xtriage residues=240 my_data.sca
produces the desired behavior (defining the number of residues is rather useful). The twinning analysis is pretty good, and it also looks for the effects of ice rings and pseudo-centering to warn you of potential pitfalls in your data. You can use the L-test and other statistical tests to look for weird intensity distributions in the data that suggest twinning or pseudo-centering, both of which can have devastating effects on the ability to solve your structure. Xtriage is now more widely installed than previously, so most synchrotron beamlines should have it.  

Matthews Coefficient Probabilities

Bernhard Rupp has a nice server that will calculate the Matthews Coefficient and solvent content for your new crystal. It's rather more sophisticated and voluminous than the output from TRUNCATE or WILSON in CCP4. The server is at http://ruppweb.dyndns.org/mattprob/. The first part of the PHASER log file is useful for working out how may you might have in your asymmetric unit. The number of molecules you can actually find is of course more relevant. For MAD it's used mainly as a preliminary guess for how many heavy atom sites but since SHELX runs so quickly you can start conservative and just try increasing the number.  

Data for PDB Deposition

You don't want to give your original SCALEPACK files to PDB. You want to supply the files that you actually used in refinement, along with the R-free flags that you used. PDB prefers that you use the accursed mmCIF format to deposit structure factors although allegedly they will take MTZ files. You can fairly easily use MTZ2VARIOUS to do the conversion for you. Following from the above native data example:
#!/bin/csh -f
#
#
#
mtz2various HKLIN bxc063_truncate-unique.mtz HKLOUT bxc063.cif << EOF
RESOLUTION 50.0 2.5
OUTPUT CIF data_sf
FREEVAL 1
MISS 0.0
LABIN FP=F SIGFP=SIGF FREE=FreeR_flag
END
EOF
#
and in fact this small script looks a lot like the mtz2various incarnation used to produce the CNS format reflection files. In fact it should be as identical as possible to avoid causing inconsistencies.