Structure Validation at the RCSB PDB

In order to provide the community with high quality data, the RCSB has developed a number of tools that are used to deposit and process X-ray and NMR structures. Deposition is accomplished through ADIT, which collects all the information necessary for data processing. Data processing involves checking various aspects of the structure and the data collected through ADIT.

Some of these checks are available through the Validation Server, a new tool that allows the user to check the format of coordinate and structure factor files and to perform a variety of validation tests on the structure prior to deposition in the PDB. Available at http://pdb.rutgers.edu/validate/, these checks can be done independently by the user.

Validation involves two steps: the coordinate format check and structure validation. The coordinate format precheck produces a brief report identifying any changes that need to be made in a data file in order to obtain a validation report.

Structure validation presents the user with a validation summary letter that contains a collection of structural and nomenclature diagnostics, including bond distance and angle comparisons, chirality, close contacts, and sequence comparisons. This letter is designed to determine features of a structure that may require some special attention (e.g., close contacts) and to present this information in a concise summary report. The validation letter is produced by the RCSB program MAXit1.

In addition, the Validation Server presents an Atlas summary page, graphic images, and diagnostic reports from a variety of programs: PROCHECK v3.4.42, PROCHECK-NMR3, NUCHECK4, SFCHECK5, and CURVES 5.16. Future versions of the Validation Server will include output from WHAT IF7.

Once a structure is deposited to the PDB, it is checked using the same procedures that were available in the Validation Server and many corrections are made automatically. Other aspects of the structure require expert judgment to provide fully processed PDB entries. This is done by the data curators in collaboration with the depositors.

Structure Checks

In this section, we outline the results contained in the validation summary letter for both NMR and X-ray crystal structures.

1. Covalent Bond Distances and Angles

Covalent bond distances and angles for macromolecules are compared against standard values. For proteins, these are taken from Engh and Huber8. The standard values for nucleic acid bases are taken from Clowney, et al.9, and for nucleic acid sugar and phosphates from Gelbin, et al.10 Bonds and angles related to hydrogens are not checked.

For each type of bond (e.g., N-CA, N-C) or angle (e.g., N-CA-C, CA-CB-CG), the RMS deviation of that bond or angle (V_actual) relative to the standard value (V_standard) is:

RMSD = (sum(V_actual - V_standard)2/N)1/2

where N = the number of individual angles or bonds of a particular type included in the summation.
V_actual for a particular bond is listed as 4*RMSD violation if:

| (V_actual - V_standard) | > 4*RMSD

In addition, the validation summary provides the total average deviations from standard dictionaries. This offers an overall measure of agreement with the standard values.

Other methods exist to make this comparison; however, this approach tends to highlight only serious outliers. The selection of the cutoff value of 4*RMSD is used to maintain compliance with previous PDB practice. In consultation with the community and in response to various comments, including those of our advisors, we plan to change this value to 6*RMSD to ensure the reporting of truly serious outliers.

2. Stereochemical Validation

All chiral centers of proteins and nucleic acids are checked for correct stereochemistry. Violations of standard stereochemistry are reported for both proteins and nucleic acids using the following method:

Neighboring atoms a, b and c of the chiral center form vectors V_a,V_b and V_c with the center.

The chiral volume is:

VC = V_a * V_b * V_c

If the sign of the actual chiral volume is different from the standard chiral volume, a chirality violation is listed.

3. Atom nomenclature

The nomenclature of all atoms is checked for compliance with the current PDB standard. In some cases, this nomenclature is not in complete agreement with IUPAC 11 standards. This is particularly true for hydrogen atoms. Correspondence of PDB hydrogen nomenclature with the nomenclature used by IUPAC and many refinement programs is available at http://www.bmrb.wisc.edu/ref_info/atom_nom.tbl. In the near future, the hydrogen nomenclature will be brought into compliance with IUPAC standards.

In addition, particular attention is paid to the nomenclature of hydrogens on the ND2 atoms of Asn residues and/or the NE2 atoms of Gln as well as the NH1 and NH2 atoms of Arg residues for agreement with the standard for E/Z orientation presented by the IUPAC11. For nucleic acids, the atom labeling of O1P/O2P atoms are checked against the convention defined by the IUBMB12.

During processing, the nomenclature of all of the above atoms is adjusted if necessary.

4. Close contacts

MAXit calculates the distances between all atoms within the asymmetric unit of crystal structures and the unique molecule of NMR structures. For crystal structures, contacts between symmetry-related molecules are checked as well. These checks include ligand and solvent molecules in addition to the macromolecular structure. Atoms less than 2.2 Angstroms apart are listed as close contacts.

Interactions of atoms forming standard bonds, defined through PDB LINK records, or related by 1-4 contacts, are not listed as close contacts.

In crystal structures, atoms that have full occupancy and lie on special positions are listed as having close contacts to indicate that a lower occupancy is appropriate. Atoms with full occupancy related by a crystallographic symmetry element will be listed as having close contacts if they are less than 2.2 Angstroms apart.

If disulfide bridges are denoted in the coordinate file with SSBOND records, they will not be listed as close contacts.

In data annotation, close contacts corresponding to metal coordination are represented as LINK records in the PDB file.

5. Ligand and Atom Nomenclature

The names of residues and atoms are compared against the nomenclature used in the PDB dictionary: ftp://ftp.rcsb.org/pub/pdb/data/monomers/het_dictionary.txt, for all ligands as well as standard residues and bases. Unrecognized ligand groups are flagged and any discrepancies in known ligands are listed as extra or missing atoms.

When structures are processed, residue and atom nomenclature for existing HET groups are corrected to follow the residue and atom naming convention that is given in the PDB HETgroup dictionary at ftp://ftp.rcsb.org/pub/pdb/data/monomers/het_dictionary.txt. For new ligands, a residue name is assigned with the preference given to that provided by the author and the ligand is compared topologically against the dictionary, to find similar molecules. Such similar molecules are used to create the most appropriate atom nomenclature for the new group.

We are currently standardizing the existing HET group dictionary in order to make it more usable both for ourselves and the community. Significant effort is being made to classify the contents of the dictionary and to correct errors. For example, the same group may exist with multiple names or similar groups may have widely varying atom nomenclature. As soon as the new dictionary is completed, it will be made publicly available.

6. Sequence Comparison

The sequence given in the PDB SEQRES records is compared against the sequence derived from the coordinate records. This information is displayed in a table where any differences or missing residues are marked.

During structure processing, the sequence database references given by DBREF and SEQADV are checked for accuracy. If no reference is given, a BLAST13 search is used to find the best match. Any conflicts between the PDB SEQRES records and the sequence derived from the coordinate records are resolved by comparison with various sequence databases. Residues in disordered regions modeled as alanines are switched both in the SEQRES and in the coordinate section to their true residue names. In general, the sequence and coordinates are made to reflect the sequence of the protein studied, even if it was not possible to model every region.

7. Distant waters

MAXit calculates the distances between all water oxygen atoms and all polar atoms (oxygen and nitrogen) of the macromolecules, ligands, and solvent in the asymmetric unit. Waters further than 3.5 Angstroms are listed in the validation report. Thus, second, third, etc. hydration shell waters are excluded from the list. Water-water groupings which are as a whole distant from the macromolecule or ligands are included in the listing. Waters further than 5.0 Angstroms are listed in the PDB file.

For X-ray crystal structures, the validation summary also lists distant water (> 3.5 Angstroms from polar atoms of the macromolecules, ligands, or solvent of the asymmetric unit) that can be moved through the application of symmetry operations to be closer to the asymmetric unit. For example, if the closest contact that, the oxygen of a water molecule makes with a polar atom of the macromolecules, ligands, or solvent of the asymmetric unit is 5.5 Angstroms, and a symmetry operation (for example, 2_456 in spacegroup P 21) would place this water in closer proximity to a polar group of the asymmetric unit, then the water will be relocated.

For all structures, the Atlas Summary presents molecular graphic images so that the overall appearance of the structure can be checked. For X-ray crystal structures, molecular graphic images are generatedin GIF and VRML for the asymmetric unit and for crystal packing. For NMR structures, molecular images are generated for the structure and the ensemble.

Any questions regarding the use or the content of the Validation Server should be directed to help@rcsb.rutgers.edu.

  1. Macromolecular Exchange and Input Tool, a program originally developed by the Nucleic Acid Database Project to assist in the processing and curation of macromolecular structure data.
  2. R.A.Laskowski, et al., "PROCHECK: A Program To Check the Stereochemical Quality of Protein Structures," J. Appl. Cryst., 26 (1993): 283-291.
  3. R.A.Laskowski, et al., "AQUA and PROCHECK-NMR: Programs For Checking The Quality of Protein Structures Solved by NMR," J. of Biomol. NMR, 8 (1996): 477-486.
  4. Z.Feng, J.Westbrook, H.M.Berman, Rutgers University, New Brunswick, NJ, (1998): NDB-407.
  5. A.A.Vaguine, J.Richelle, S.J.Wodak, Acta Cryst. D, 55 (1999): 191-205.
  6. R.Lavery and H.Sklenar, "The Definition of Generalized Helicoidal Parameters and of Axis Curvature in Irregular Nucleic Acids," Biomol. Struct. Dynam., 6 (1998): 63-91.
  7. G.Vriend, "WHAT IF: A Molecular Modeling and Drug Design Program," J. Mol. Graph. 8 (1990): 52-56.
  8. R.A.Engh and R.Huber, "Accurate Bond and Angle Parameters for X-ray Protein structure refinement," Acta Crystallogr, A47 (1991): 392-400.
  9. L.Clowney et al., "Geometric Parameters in Nucleic Acids: Nitrogenous Bases," J. Am. Chem. Soc., 118 (1991) 509-518.
  10. A.Gelbin et al., "Geometric Parameters in Nucleic Acids: Sugar and Phosphate Constituents," J. Am. Chem. Soc., 118 (1996) 519-529.
  11. J.L.Markley, et al., "Recommendations for the Presentation of NMR Structures of Proteins and Nucleic Acids," Pure & Appl. Chem., 70 (1998): 117-142.
  12. C.Liebecq, Compendium of Biochemical Nomenclature and Related Documents, 2d ed., Portland Press: London and Chapel Hill, 1992.
  13. Z.Zhang, et al., "Protein sequence similarity searches using patterns as seeds," Nucleic Acids Res., 26 (1998): 3986-3990.