Phenix.refine Defaults Considered Harmful

The title is a riff on Csh Programming Considered Harmful and also an overt indicator of why I'm writing this. You've been warned.

Side note: life evolves, and so do programs. The cutoff on the more recent version (1.6.2) of Phenix.refine now enforces a cutoff that is |F|≥0 not |F|>0 which certainly helps with this phenomenon.

I'm a convert to using phenix.refine for protein structure refinement since it has some nice feature combinations. When I was doing a low resolution refinement of the 20S proteasome structure I used a group B-factor and TLS model that was unavailable in another refinement program (CNS can't do TLS, REFMAC doesn't do grouped B-factors). While REFMAC might still perhaps rule the roost at high resolution I have the impression that phenix.refine may be top dog at lower resolutions. And I work mostly at lower resolutions.

Like other programs it takes a while to get used to the quirks and syntax. Phenix.refine syntax is quirkier than most - if I was going to slur it I'd say it looked like it was designed by computer scientists, but it looks positively idyllic if you've ever looked at MOLREP syntax. It's workable. The biggest problems I run into is the program trying to be smart. In some cases it may in fact be smarter than me, but in other cases it's clearly rather more stupid. This is one of these latter cases.

Phenix.refine grabs your intensities by default and throws some of them out

This behavior is not documented in the program documentation - neither defaulting to picking up intensity data nor throwing out data with F=0 (the latter being a hard limit of the program).

My default method for going from SCALEPACK (HKL suite) to MTZ (CCP4 suite) is via SCALEPACK2MTZ and TRUNCATE, both programs in CCP4. SCALEPACK2MTZ is just a reformatting program. TRUNCATE (see French G.S. and Wilson K.S. Acta. Cryst. (1978), A34, 517) coverts intensities to structure factor moduli and has the option of modifying structure factor values so that the weak reflections whose intensities are negative end up with small positive F's. I use this option frequently. While this certainly squicks some crystallographers out - at least in part because of the assumptions in the intensity data distribution - empirically it doesn't significantly increase working R-factors (the new |F|'s are reasonable) and the great benefit is that you can now use all that weak data in refinement since that weak data now has |F| > 0. For anisotropic data - which has a lot of weak reflections at the higher resolution limits - this is somewhere between useful and essential. Additionally since we routinely push resolution limits in data processing where the outer shells have <I>/σI is about 2 (as is the case for the data discussed below), it's almost inevitable that a decent proportion of the data is weak and can have an intensity of < 0 just by statistical fluctuation in measurements. The "Massage" algorithm mentioned below in phenix.reflection_file_converter is broadly related to the French & Wilson approach but is of this form:

|Fnew| = sqrt((Io+sqrt(Io**2 +2sigma**2))/2.0).
and at least in this particular test case gives the best results. "Massage" is their nomenclature, not mine, and I've got no opinion over whether it's better or worse than French and Wilson. In this single test case it appears to be a little better.

Problem is, phenix.refine tries really very hard to circumnavigate what you've done with TRUNCATE. When you convert .sca with SCALEPACK2MTZ it creates IMEAN and SIGIMEAN intensity data from the SCALEPACK file. When TRUNCATE converts the intensities to structure factor moduli it leaves the unmodified IMEAN data in the output MTZ file. That data stays there if you add Free R flags using the "uniqueify" script. If you then use this data in phenix.refine:

phenix.refine mydata_truncate-unique.mtz mydata.pdb
the first thing the program does is ignore that structure factor data and use the IMEAN data instead. On top of that it does a simple dumb conversion procedure of discarding all the data with IMEAN ≤ 0 which means phenix.refine will by default reject all that weak data you tried to hard to preserve using TRUNCATE. You can check for this by looking for a line like:
      labels = "IMEAN,SIGIMEAN"
in the .eff or .def files created by refinement. You can force phenix.refine to use the labels that you want it to by either editing the .def/.eff files or putting
refinement.input.xray_data.labels="F,SIGF"
on the command line during refinement, i.e.:
phenix.refine mydata_truncate-unique.mtz mydata.pdb refinement.input.xray_data.labels="F,SIGF"
Or you can simply run CCP4's CAD program and omit the intensity data. This is dumbing down your MTZ file for consumption by phenix.refine but it might not be the worst idea to prevent this undocumented stealth feature coming back to haunt you.

Manifestation of Vanishing Data

Compared to using the F's out of TRUNCATE, phenix.refine makes your weakest data vanish. This is a snippet of the log of two phenix.refine runs, one explicitly using TRUNCATE F's:
| 13:  2.1079 -  2.0524 0.97   2546  135 0.1862 0.2383
| 14:  2.0524 -  2.0023 0.96   2545  128 0.1874 0.2623
| 15:  2.0023 -  1.9568 0.96   2499  149 0.1920 0.2562
| 16:  1.9568 -  1.9152 0.92   2452  132 0.2106 0.2385
| 17:  1.9152 -  1.8769 0.95   2500  130 0.2169 0.2895
| 18:  1.8769 -  1.8415 0.83   2182  115 0.2403 0.2680
which is a table of resolution range, completeness, reflection counts for work and free reflections and the R-factors for work and free sets. This next table is from the phenix.refine defaults (IMEAN converted to F, rejecting IMEAN ≤ 0):
| 12:  2.1292 -  2.0684 0.90   2536  123 0.1851 0.2381
| 13:  2.0684 -  2.0139 0.88   2462  123 0.1813 0.2654
| 14:  2.0139 -  1.9648 0.87   2400  136 0.1936 0.2637
| 15:  1.9648 -  1.9201 0.81   2273  130 0.2090 0.2789
| 16:  1.9201 -  1.8793 0.83   2303  118 0.2216 0.2761
| 17:  1.8793 -  1.8417 0.72   1998  106 0.2388 0.2790
It's just the outermost data shells in each case. Obviously there's less data in the latter case. If you take a look at the "REMARK 3" lines in the PDB files you find out that the default (Imean) behavior has 45999 reflections in refinement of which 2339 are in the free set versus the Truncate data which has 48895 reflections of which 2494 are in the free set. So phenix.refine throws away 5.7% of your data overall, and 11% of it in the outermost shell.

I have to emphasize that this is pretty decent data collected to 1.84 Å resolution with a quite low degree of anisotropy - imagine what it's doing to your data if it isn't as good ?

What would REFMAC and CNS do ?

REFMAC5 appears to ignore data for with |F|=0 (this tested for v.5.4.0077). As of recollection CNS will also ignore data with |F|=0 but I have not formally tested that - but this is the way it used to behave. What neither of these programs do is unilaterally grab the intensity data from your file. You explicitly tell them which data to use, which avoids confusion. You can either choose to do the hard truncation (I<0 becomes F=0) or you can do French & Wilson, but whatever it is you tell the program what to use. phenix.refine is unique in that it will unilaterally grab data by default in ambiguous cases (more than one I/F data source in the MTZ file).

How Much Does This Really Matter ?

Twice now I've tried to get the Phenix developers to fix the rejection of F=0 reflections. Simply including them as F=0 would be a big step forward since this puts restraints on what the model F can be. Deleting them from the data removes any requirement that F-calc be "weak". Replies from the the Phenix developers expressing a lack of urgency in this tends to make me grind my teeth a little because the fact that this weak data is useful is not a sudden revelation. This has been obvious for many years, and in fact NOT deleting your weak data via sigma cutoffs has been repeatedly emphasized in the modern era with maximum likelihood targets. (Something tells me I should document this).

But let's make a test case, given the current structure I'm working on. This data goes to 1.84 Å, the model is relatively final (i.e. I'm tinkering with the loops and to placate Molprobity). I'm using my usual method of converting the data with SCALEPACK2MTZ and TRUNCATE with the French and Wilson truncate model in action (truncate=yes). By request from Peter Zwart I also included a comparison with the "massage" option of phenix.reflection_file_converter. Because this is an active project the filenames are generic.

If you want to skip a couple of pages of test case, the bottom line is that inclusion of the weak data improves the model, and comparing the same (common) R-free subsets between three coordinate files refined against Imean, Truncate'd and Massage'd data indicates that the effect is quite large - a difference of up to 1% at an R-free of 24% between the worst case (phenix.refine default) and best case ("Massage" data) with the Truncate results close to the Massage results. We can haggle over the meaning of "large" but most of us would be pretty happy with that sort of model improvement at this stage of refinement. I'd assume that in more anisotropic cases where the proportion of weak data is larger this effect might be larger too, but this remains to be demonstrated. In any event it's hard to advocate for the current default (undocumented) behavior of phenix.refine.

A Long-winded Exposition of the Test Case

Take the extant MTZ file that contains intensity data IMEAN, SIGIMEAN (via SCALEPACK2MTZ) and the derived structure factors F, SIGF from TRUNCATE with truncate=yes.
Run:
phenix.reflection_file_converter mydata_truncate-unique.mtz --label='IMEAN,SIGIMEAN' --massage-intensities --write_mtz_amplitudes --mtz_root_label=FM --mtz=massage.mtz
to generate the structure factors F's I shall call "Massage" in the rest of this via the alternative method. These are approximately equivalent to the TRUNCATE structure factors but use a different algorithm. Use CCP4's CAD program to merge F,SIGF,IMEAN,SIGIMEAN,Free_R_flag data from the original file and FM,SIGFM from the massaged data. Call this the "superset" dataset mydata_superset.mtz.

This is the snippet from MTZDUMP for this data:

   1 ASC    -38      35      0  100.00     -5.1     14.4  84.38   1.84   H  H
   2 NONE     0      26      0  100.00      9.7      9.7  84.38   1.84   H  K
   3 NONE     0      49      0  100.00     18.6     18.6  84.38   1.84   H  L
   4 NONE -246.1 39869.3  1489   97.05   678.60   682.10  32.94   1.84   J  IMEAN
   5 NONE    5.0  2335.7  1489   97.05    58.26    58.26  32.94   1.84   Q  SIGIMEAN
   6 NONE   20.9  1970.5  1489   97.05   194.95   194.95  32.94   1.84   F  F
   7 NONE    3.4    59.5  1489   97.05    17.29    17.29  32.94   1.84   Q  SIGF
   8 NONE    0.0    19.0     0  100.00     9.50     9.50  84.38   1.84   I  FreeR_flag
   9 NONE    2.0   199.8  1489   97.05    20.16    20.16  32.94   1.84   F  FM
  10 NONE    0.3     5.8  1489   97.05     1.71     1.71  32.94   1.84   Q  SIGFM
Note the presence of Imean < 0 in the third column which is the lower limit of the value range for the data.

Now I create an Imean > 0 subset of this data using SFTOOLS via "select col imean > 0" and "purge". Call this mydata_subset.mtz

   1 ASC    -38      35      0  100.00     -4.9     14.1  32.94   1.84   H  H
   2 NONE     0      26      0  100.00      9.6      9.6  32.94   1.84   H  K
   3 NONE     0      49      0  100.00     18.6     18.6  32.94   1.84   H  L
   4 NONE    0.0 39869.3     0  100.00   723.29   723.29  32.94   1.84   J  IMEAN
   5 NONE    5.0  2335.7     0  100.00    58.64    58.64  32.94   1.84   Q  SIGIMEAN
   6 NONE   28.8  1970.5     0  100.00   204.35   204.35  32.94   1.84   F  F
   7 NONE    3.4    59.5     0  100.00    16.98    16.98  32.94   1.84   Q  SIGF
   8 NONE    0.0    19.0     0  100.00     9.50     9.50  32.94   1.84   I  FreeR_flag
   9 NONE    3.1   199.8     0  100.00    21.11    21.11  32.94   1.84   F  FM
  10 NONE    0.3     5.8     0  100.00     1.64     1.64  32.94   1.84   Q  SIGFM
And there are no longer any IMEAN < 0, as desired.

Because of the slightly different F's in the three cases below by virtue of the three conversion procedures, I refine a starting coordinate file against the MTZ file with three different selections:

  1. Default (F via Imean, phenix.refine standard default behavior)
  2. Truncate (F via Truncate=yes)
  3. Massage (F via Massaged data)
In each case the SUPERSET mtz file was used to refine the PDBs. Each PDB file was saved separately. I'm using:
phenix.refine start.pdb mydata_superset.mtz --overwrite
phenix.refine start.pdb mydata_superset.mtz --overwrite refinement.input.xray_data.labels="F,SIGF"
phenix.refine start.pdb mydata_superset.mtz --overwrite refinement.input.xray_data.labels="FM,SIGFM"
with all defaults. The first incarnation grabs IMEAN from the MTZ file and creates F from it.

Superset results - just the Remark 3 from the PDB file

1. (IMEAN F's)
REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842   
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943  
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 0.02  
REMARK   3   COMPLETENESS FOR RANGE        (%) : 91.38 
REMARK   3   NUMBER OF REFLECTIONS             : 46001     
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1927
REMARK   3   R VALUE            (WORKING SET) : 0.1898
REMARK   3   FREE R VALUE                     : 0.2485
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.08  
REMARK   3   FREE R VALUE TEST SET COUNT      : 2339      
This doesn't look too bad until you compare it to the others:

2. (TRUNCATE F's)
REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.841   
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943  
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 1.34  
REMARK   3   COMPLETENESS FOR RANGE        (%) : 97.10 
REMARK   3   NUMBER OF REFLECTIONS             : 48897     
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1958
REMARK   3   R VALUE            (WORKING SET) : 0.1930
REMARK   3   FREE R VALUE                     : 0.2475
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.10  
REMARK   3   FREE R VALUE TEST SET COUNT      : 2494      
3. (MASSAGE F's)
REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.841   
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943  
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 1.45  
REMARK   3   COMPLETENESS FOR RANGE        (%) : 97.11 
REMARK   3   NUMBER OF REFLECTIONS             : 48904     
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1933
REMARK   3   R VALUE            (WORKING SET) : 0.1905
REMARK   3   FREE R VALUE                     : 0.2447
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.10  
REMARK   3   FREE R VALUE TEST SET COUNT      : 2494      
And in fact the Massage data may be the marginal winner here so far. Again, notice the discrepancies between the Imean and either Truncate or Massage data in terms of completeness. The inclusion of more weak data in R-work and R-free in these two latter cases means that we expect the R-factors to increase in the absence of other effects. The R-work is in fact higher, but the R-free is already showing signs of being reduced.

Now in order to try and get them to use the same R-free reflections for the calculation I use the subset data. This is important since to do a clean comparison of R-free you have to use the same test set. I have to use the lowest common denominator to get past the F>0 cutoff internal to phenix.refine. I use PDB files refined against the individual supersets because these match their respective |F| the closest. The issue is that the bulk solvent correction is not read from the PDB file so I have to re-refine that. I turn off the xyz and ADP refinements so all it should do is an overall anisoB and bulk solvent parameter refinement. It will not modify the F-calc values derived directly from the model since x,y,z,B are not changing here. Here's the .def file I use to drive it (edited for each PDB and column label set)

refinement {
  crystal_symmetry {
    unit_cell = 70.44799805 49.14699936 90.79100037 90 111.6679993 90
    space_group = "P 1 21 1"
  }
  refine {
    strategy = individual_sites individual_sites_real_space rigid_body \
               individual_adp group_adp tls occupancies group_anomalous
  }
  input {
    pdb {
      file_name = "start_refine_massage.pdb"
    }
    xray_data {
      file_name = "mydata_subset.mtz"
      labels = "FM,SIGFM"
      r_free_flags {
        file_name = "mydata_subset.mtz"
        label = "FreeR_flag"
        test_flag_value = 0
      }
    }
  }
}
I ran a single test to check that I get the same stats from the above refinement for massage and the one using the output PDB and this protocol, and I do when using the superset data. With the subset data I have to chop out the hexdigest line in each PDB file that phenix.refine uses as a checksum on the contents of the test set. That's the extent of modifications to the coordinate file. What we expect is identical stats for the Imean F (phenix.refine default) case, and lower #s for the Truncate and Massage. Since the weakest data has been removed from the Truncate and Massage data it's not surprising that R-work decreases. The fact that R-free is lower for Truncate and Massage cases despite using the same free set as the Imean case is a clear indication that the model is better when including more data in refinement.

1. (IMEAN F's)

REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842   
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943  
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 0.02  
REMARK   3   COMPLETENESS FOR RANGE        (%) : 91.38 
REMARK   3   NUMBER OF REFLECTIONS             : 46001     
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1928
REMARK   3   R VALUE            (WORKING SET) : 0.1899
REMARK   3   FREE R VALUE                     : 0.2485
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.08  
REMARK   3   FREE R VALUE TEST SET COUNT      : 2339      
2. (TRUNCATE F's)
REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842   
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943  
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 1.43  
REMARK   3   COMPLETENESS FOR RANGE        (%) : 91.38 
REMARK   3   NUMBER OF REFLECTIONS             : 45998     
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1910
REMARK   3   R VALUE            (WORKING SET) : 0.1884
REMARK   3   FREE R VALUE                     : 0.2408
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.09  
REMARK   3   FREE R VALUE TEST SET COUNT      : 2339      
3. (MASSAGE F's)
REMARK   3  DATA USED IN REFINEMENT.
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842   
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 32.943  
REMARK   3   MIN(FOBS/SIGMA_FOBS)              : 2.00  
REMARK   3   COMPLETENESS FOR RANGE        (%) : 91.39 
REMARK   3   NUMBER OF REFLECTIONS             : 46003     
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.
REMARK   3   R VALUE     (WORKING + TEST SET) : 0.1885
REMARK   3   R VALUE            (WORKING SET) : 0.1859
REMARK   3   FREE R VALUE                     : 0.2385
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.08  
REMARK   3   FREE R VALUE TEST SET COUNT      : 2339      
The Massage data stats win again with a full 1% difference in R-free but the Truncate data is also quite a lot lower than the Imean-derived data. Because I don't think anything exotic is going on with the |F| values, I attribute this solely to the difference in the number of reflections we are refining against. Notice the R-free set counts. This is the SAME R-free set (Miller Indices) in each of the cases. The only difference is the F values which are in any case quite similar, and we've matched up the |F| sets to their respective PDB files. Imean results from superset and subset data should be the same since the difference in the datasets was only an explicit Imean > 0 cut that emulates the rejection criteria internally in phenix.refine. The external cut via SFTOOLS mimics the internal cut that phenix.refine applies to the data.

Conclusions ? It's easy - albeit tedious - to show that inclusion of the weakest data via Truncate improves the quality of the refinement, even if the numerical values of R-free and R-work appear higher when using Truncate data.