The title is a riff on Csh Programming Considered Harmful and also an overt indicator of why I'm writing this. You've been warned.
Side note: life evolves, and so do programs. The cutoff on the more recent version (1.6.2) of Phenix.refine now enforces a cutoff that is |F|≥0 not |F|>0 which certainly helps with this phenomenon.
I'm a convert to using phenix.refine for protein structure refinement since it has some nice feature combinations. When I was doing a low resolution refinement of the 20S proteasome structure I used a group B-factor and TLS model that was unavailable in another refinement program (CNS can't do TLS, REFMAC doesn't do grouped B-factors). While REFMAC might still perhaps rule the roost at high resolution I have the impression that phenix.refine may be top dog at lower resolutions. And I work mostly at lower resolutions.
Like other programs it takes a while to get used to the quirks and syntax. Phenix.refine syntax is quirkier than most - if I was going to slur it I'd say it looked like it was designed by computer scientists, but it looks positively idyllic if you've ever looked at MOLREP syntax. It's workable. The biggest problems I run into is the program trying to be smart. In some cases it may in fact be smarter than me, but in other cases it's clearly rather more stupid. This is one of these latter cases.My default method for going from SCALEPACK (HKL suite) to MTZ (CCP4 suite) is via SCALEPACK2MTZ and TRUNCATE, both programs in CCP4. SCALEPACK2MTZ is just a reformatting program. TRUNCATE (see French G.S. and Wilson K.S. Acta. Cryst. (1978), A34, 517) coverts intensities to structure factor moduli and has the option of modifying structure factor values so that the weak reflections whose intensities are negative end up with small positive F's. I use this option frequently. While this certainly squicks some crystallographers out - at least in part because of the assumptions in the intensity data distribution - empirically it doesn't significantly increase working R-factors (the new |F|'s are reasonable) and the great benefit is that you can now use all that weak data in refinement since that weak data now has |F| > 0. For anisotropic data - which has a lot of weak reflections at the higher resolution limits - this is somewhere between useful and essential. Additionally since we routinely push resolution limits in data processing where the outer shells have <I>/σI is about 2 (as is the case for the data discussed below), it's almost inevitable that a decent proportion of the data is weak and can have an intensity of < 0 just by statistical fluctuation in measurements. The "Massage" algorithm mentioned below in phenix.reflection_file_converter is broadly related to the French & Wilson approach but is of this form:
|Fnew| = sqrt((Io+sqrt(Io**2 +2sigma**2))/2.0).and at least in this particular test case gives the best results. "Massage" is their nomenclature, not mine, and I've got no opinion over whether it's better or worse than French and Wilson. In this single test case it appears to be a little better.
Problem is, phenix.refine tries really very hard to circumnavigate what you've done with TRUNCATE. When you convert .sca with SCALEPACK2MTZ it creates IMEAN and SIGIMEAN intensity data from the SCALEPACK file. When TRUNCATE converts the intensities to structure factor moduli it leaves the unmodified IMEAN data in the output MTZ file. That data stays there if you add Free R flags using the "uniqueify" script. If you then use this data in phenix.refine:
phenix.refine mydata_truncate-unique.mtz mydata.pdbthe first thing the program does is ignore that structure factor data and use the IMEAN data instead. On top of that it does a simple dumb conversion procedure of discarding all the data with IMEAN ≤ 0 which means phenix.refine will by default reject all that weak data you tried to hard to preserve using TRUNCATE. You can check for this by looking for a line like:
labels = "IMEAN,SIGIMEAN"in the .eff or .def files created by refinement. You can force phenix.refine to use the labels that you want it to by either editing the .def/.eff files or putting
refinement.input.xray_data.labels="F,SIGF"on the command line during refinement, i.e.:
phenix.refine mydata_truncate-unique.mtz mydata.pdb refinement.input.xray_data.labels="F,SIGF"Or you can simply run CCP4's CAD program and omit the intensity data. This is dumbing down your MTZ file for consumption by phenix.refine but it might not be the worst idea to prevent this undocumented stealth feature coming back to haunt you.
| 13: 2.1079 - 2.0524 0.97 2546 135 0.1862 0.2383 | 14: 2.0524 - 2.0023 0.96 2545 128 0.1874 0.2623 | 15: 2.0023 - 1.9568 0.96 2499 149 0.1920 0.2562 | 16: 1.9568 - 1.9152 0.92 2452 132 0.2106 0.2385 | 17: 1.9152 - 1.8769 0.95 2500 130 0.2169 0.2895 | 18: 1.8769 - 1.8415 0.83 2182 115 0.2403 0.2680which is a table of resolution range, completeness, reflection counts for work and free reflections and the R-factors for work and free sets. This next table is from the phenix.refine defaults (IMEAN converted to F, rejecting IMEAN ≤ 0):
| 12: 2.1292 - 2.0684 0.90 2536 123 0.1851 0.2381 | 13: 2.0684 - 2.0139 0.88 2462 123 0.1813 0.2654 | 14: 2.0139 - 1.9648 0.87 2400 136 0.1936 0.2637 | 15: 1.9648 - 1.9201 0.81 2273 130 0.2090 0.2789 | 16: 1.9201 - 1.8793 0.83 2303 118 0.2216 0.2761 | 17: 1.8793 - 1.8417 0.72 1998 106 0.2388 0.2790It's just the outermost data shells in each case. Obviously there's less data in the latter case. If you take a look at the "REMARK 3" lines in the PDB files you find out that the default (Imean) behavior has 45999 reflections in refinement of which 2339 are in the free set versus the Truncate data which has 48895 reflections of which 2494 are in the free set. So phenix.refine throws away 5.7% of your data overall, and 11% of it in the outermost shell.
I have to emphasize that this is pretty decent data collected to 1.84 Å resolution with a quite low degree of anisotropy - imagine what it's doing to your data if it isn't as good ?
But let's make a test case, given the current structure I'm working on. This data goes to 1.84 Å, the model is relatively final (i.e. I'm tinkering with the loops and to placate Molprobity). I'm using my usual method of converting the data with SCALEPACK2MTZ and TRUNCATE with the French and Wilson truncate model in action (truncate=yes). By request from Peter Zwart I also included a comparison with the "massage" option of phenix.reflection_file_converter. Because this is an active project the filenames are generic.
If you want to skip a couple of pages of test case, the bottom line is that inclusion of the weak data improves the model, and comparing the same (common) R-free subsets between three coordinate files refined against Imean, Truncate'd and Massage'd data indicates that the effect is quite large - a difference of up to 1% at an R-free of 24% between the worst case (phenix.refine default) and best case ("Massage" data) with the Truncate results close to the Massage results. We can haggle over the meaning of "large" but most of us would be pretty happy with that sort of model improvement at this stage of refinement. I'd assume that in more anisotropic cases where the proportion of weak data is larger this effect might be larger too, but this remains to be demonstrated. In any event it's hard to advocate for the current default (undocumented) behavior of phenix.refine.
phenix.reflection_file_converter mydata_truncate-unique.mtz --label='IMEAN,SIGIMEAN' --massage-intensities --write_mtz_amplitudes --mtz_root_label=FM --mtz=massage.mtzto generate the structure factors F's I shall call "Massage" in the rest of this via the alternative method. These are approximately equivalent to the TRUNCATE structure factors but use a different algorithm. Use CCP4's CAD program to merge F,SIGF,IMEAN,SIGIMEAN,Free_R_flag data from the original file and FM,SIGFM from the massaged data. Call this the "superset" dataset mydata_superset.mtz.
This is the snippet from MTZDUMP for this data:
1 ASC -38 35 0 100.00 -5.1 14.4 84.38 1.84 H H 2 NONE 0 26 0 100.00 9.7 9.7 84.38 1.84 H K 3 NONE 0 49 0 100.00 18.6 18.6 84.38 1.84 H L 4 NONE -246.1 39869.3 1489 97.05 678.60 682.10 32.94 1.84 J IMEAN 5 NONE 5.0 2335.7 1489 97.05 58.26 58.26 32.94 1.84 Q SIGIMEAN 6 NONE 20.9 1970.5 1489 97.05 194.95 194.95 32.94 1.84 F F 7 NONE 3.4 59.5 1489 97.05 17.29 17.29 32.94 1.84 Q SIGF 8 NONE 0.0 19.0 0 100.00 9.50 9.50 84.38 1.84 I FreeR_flag 9 NONE 2.0 199.8 1489 97.05 20.16 20.16 32.94 1.84 F FM 10 NONE 0.3 5.8 1489 97.05 1.71 1.71 32.94 1.84 Q SIGFMNote the presence of Imean < 0 in the third column which is the lower limit of the value range for the data.
Now I create an Imean > 0 subset of this data using SFTOOLS via "select col imean > 0" and "purge". Call this mydata_subset.mtz
1 ASC -38 35 0 100.00 -4.9 14.1 32.94 1.84 H H 2 NONE 0 26 0 100.00 9.6 9.6 32.94 1.84 H K 3 NONE 0 49 0 100.00 18.6 18.6 32.94 1.84 H L 4 NONE 0.0 39869.3 0 100.00 723.29 723.29 32.94 1.84 J IMEAN 5 NONE 5.0 2335.7 0 100.00 58.64 58.64 32.94 1.84 Q SIGIMEAN 6 NONE 28.8 1970.5 0 100.00 204.35 204.35 32.94 1.84 F F 7 NONE 3.4 59.5 0 100.00 16.98 16.98 32.94 1.84 Q SIGF 8 NONE 0.0 19.0 0 100.00 9.50 9.50 32.94 1.84 I FreeR_flag 9 NONE 3.1 199.8 0 100.00 21.11 21.11 32.94 1.84 F FM 10 NONE 0.3 5.8 0 100.00 1.64 1.64 32.94 1.84 Q SIGFMAnd there are no longer any IMEAN < 0, as desired.
Because of the slightly different F's in the three cases below by virtue of the three conversion procedures, I refine a starting coordinate file against the MTZ file with three different selections:
phenix.refine start.pdb mydata_superset.mtz --overwrite phenix.refine start.pdb mydata_superset.mtz --overwrite refinement.input.xray_data.labels="F,SIGF" phenix.refine start.pdb mydata_superset.mtz --overwrite refinement.input.xray_data.labels="FM,SIGFM"with all defaults. The first incarnation grabs IMEAN from the MTZ file and creates F from it.
REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842 REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 32.943 REMARK 3 MIN(FOBS/SIGMA_FOBS) : 0.02 REMARK 3 COMPLETENESS FOR RANGE (%) : 91.38 REMARK 3 NUMBER OF REFLECTIONS : 46001 REMARK 3 REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 R VALUE (WORKING + TEST SET) : 0.1927 REMARK 3 R VALUE (WORKING SET) : 0.1898 REMARK 3 FREE R VALUE : 0.2485 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.08 REMARK 3 FREE R VALUE TEST SET COUNT : 2339This doesn't look too bad until you compare it to the others:
REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.841 REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 32.943 REMARK 3 MIN(FOBS/SIGMA_FOBS) : 1.34 REMARK 3 COMPLETENESS FOR RANGE (%) : 97.10 REMARK 3 NUMBER OF REFLECTIONS : 48897 REMARK 3 REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 R VALUE (WORKING + TEST SET) : 0.1958 REMARK 3 R VALUE (WORKING SET) : 0.1930 REMARK 3 FREE R VALUE : 0.2475 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.10 REMARK 3 FREE R VALUE TEST SET COUNT : 24943. (MASSAGE F's)
REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.841 REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 32.943 REMARK 3 MIN(FOBS/SIGMA_FOBS) : 1.45 REMARK 3 COMPLETENESS FOR RANGE (%) : 97.11 REMARK 3 NUMBER OF REFLECTIONS : 48904 REMARK 3 REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 R VALUE (WORKING + TEST SET) : 0.1933 REMARK 3 R VALUE (WORKING SET) : 0.1905 REMARK 3 FREE R VALUE : 0.2447 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.10 REMARK 3 FREE R VALUE TEST SET COUNT : 2494And in fact the Massage data may be the marginal winner here so far. Again, notice the discrepancies between the Imean and either Truncate or Massage data in terms of completeness. The inclusion of more weak data in R-work and R-free in these two latter cases means that we expect the R-factors to increase in the absence of other effects. The R-work is in fact higher, but the R-free is already showing signs of being reduced. Now in order to try and get them to use the same R-free reflections for the calculation I use the subset data. This is important since to do a clean comparison of R-free you have to use the same test set. I have to use the lowest common denominator to get past the F>0 cutoff internal to phenix.refine. I use PDB files refined against the individual supersets because these match their respective |F| the closest. The issue is that the bulk solvent correction is not read from the PDB file so I have to re-refine that. I turn off the xyz and ADP refinements so all it should do is an overall anisoB and bulk solvent parameter refinement. It will not modify the F-calc values derived directly from the model since x,y,z,B are not changing here. Here's the .def file I use to drive it (edited for each PDB and column label set)
refinement { crystal_symmetry { unit_cell = 70.44799805 49.14699936 90.79100037 90 111.6679993 90 space_group = "P 1 21 1" } refine { strategy = individual_sites individual_sites_real_space rigid_body \ individual_adp group_adp tls occupancies group_anomalous } input { pdb { file_name = "start_refine_massage.pdb" } xray_data { file_name = "mydata_subset.mtz" labels = "FM,SIGFM" r_free_flags { file_name = "mydata_subset.mtz" label = "FreeR_flag" test_flag_value = 0 } } } }I ran a single test to check that I get the same stats from the above refinement for massage and the one using the output PDB and this protocol, and I do when using the superset data. With the subset data I have to chop out the hexdigest line in each PDB file that phenix.refine uses as a checksum on the contents of the test set. That's the extent of modifications to the coordinate file. What we expect is identical stats for the Imean F (phenix.refine default) case, and lower #s for the Truncate and Massage. Since the weakest data has been removed from the Truncate and Massage data it's not surprising that R-work decreases. The fact that R-free is lower for Truncate and Massage cases despite using the same free set as the Imean case is a clear indication that the model is better when including more data in refinement.
1. (IMEAN F's)
REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842 REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 32.943 REMARK 3 MIN(FOBS/SIGMA_FOBS) : 0.02 REMARK 3 COMPLETENESS FOR RANGE (%) : 91.38 REMARK 3 NUMBER OF REFLECTIONS : 46001 REMARK 3 REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 R VALUE (WORKING + TEST SET) : 0.1928 REMARK 3 R VALUE (WORKING SET) : 0.1899 REMARK 3 FREE R VALUE : 0.2485 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.08 REMARK 3 FREE R VALUE TEST SET COUNT : 23392. (TRUNCATE F's)
REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842 REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 32.943 REMARK 3 MIN(FOBS/SIGMA_FOBS) : 1.43 REMARK 3 COMPLETENESS FOR RANGE (%) : 91.38 REMARK 3 NUMBER OF REFLECTIONS : 45998 REMARK 3 REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 R VALUE (WORKING + TEST SET) : 0.1910 REMARK 3 R VALUE (WORKING SET) : 0.1884 REMARK 3 FREE R VALUE : 0.2408 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.09 REMARK 3 FREE R VALUE TEST SET COUNT : 23393. (MASSAGE F's)
REMARK 3 DATA USED IN REFINEMENT. REMARK 3 RESOLUTION RANGE HIGH (ANGSTROMS) : 1.842 REMARK 3 RESOLUTION RANGE LOW (ANGSTROMS) : 32.943 REMARK 3 MIN(FOBS/SIGMA_FOBS) : 2.00 REMARK 3 COMPLETENESS FOR RANGE (%) : 91.39 REMARK 3 NUMBER OF REFLECTIONS : 46003 REMARK 3 REMARK 3 FIT TO DATA USED IN REFINEMENT. REMARK 3 R VALUE (WORKING + TEST SET) : 0.1885 REMARK 3 R VALUE (WORKING SET) : 0.1859 REMARK 3 FREE R VALUE : 0.2385 REMARK 3 FREE R VALUE TEST SET SIZE (%) : 5.08 REMARK 3 FREE R VALUE TEST SET COUNT : 2339The Massage data stats win again with a full 1% difference in R-free but the Truncate data is also quite a lot lower than the Imean-derived data. Because I don't think anything exotic is going on with the |F| values, I attribute this solely to the difference in the number of reflections we are refining against. Notice the R-free set counts. This is the SAME R-free set (Miller Indices) in each of the cases. The only difference is the F values which are in any case quite similar, and we've matched up the |F| sets to their respective PDB files. Imean results from superset and subset data should be the same since the difference in the datasets was only an explicit Imean > 0 cut that emulates the rejection criteria internally in phenix.refine. The external cut via SFTOOLS mimics the internal cut that phenix.refine applies to the data.
Conclusions ? It's easy - albeit tedious - to show that inclusion of the weakest data via Truncate improves the quality of the refinement, even if the numerical values of R-free and R-work appear higher when using Truncate data.