Plans for Releasing Data from the Uniformity Project

The goal of the PDB Data Uniformity project is to maintain the greatest possible consistency within the entire archive. Uniformity is a key prerequisite for any meaningful query or systematic analysis of the archive. Two complementary methods have been used to update and unify the data in the PDB archive.

The Data Uniformity project began by examining individual entries within groups of chemically related structures. During its long history the PDB format has undergone a number of changes. In the file-by-file uniformity processing, each entry is brought up to the current PDB format standard. This includes adding records that were not present in early entries where possible, correcting outstanding reported problems, and providing standard nomenclature. Each file is rechecked using our current validation software. Approximately 3000 entries containing nucleic acids, globins, retroviral proteases, and aspartic proteases have been processed in this way.

In addition to file-by-file uniformity procedures the Uniformity Project has also targeted key records within each PDB entry for archive-wide uniformity processing. This archive-wide approach has been used to update citation, R-factor, and resolution records. These results have been loaded into the PDB database where they can be accessed from one of the PDB query interfaces or viewed in the PDB Structure Explorer reports. Other records targeted for archive-wide uniformity processing include ligand descriptions, protein classification, sequence, and source data. Some of these records are now available on the PDB beta test website. This work complements the data clean-up project undertaken by the MSD group at the European Bioinformatics Institute. In the future, all of the records resulting from the archive-wide uniformity processing will be updated in the PDB entries as part of the file-by-file uniformity procedure using the plan described below.

One of the important issues for PDB's on-going data processing, including its Data Uniformity project, is the management of multiple nomenclatures. The problem of providing alternative nomenclatures within the PDB format is a well recognized problem. In assuming the stewardship for PDB archive, RCSB was charged with the responsibility for maintaining the greatest possible consistency within the entire archive. Unfortunately, uniformity considerations are often at odds with preferences of depositors who provide additional insight into the description of an entry that is outside traditional PDB practice. Recent discussions on the PDB list server regarding the assignment of chain identifiers to ligands and solvent provide important illustrations of this on-going problem.

In planning for the release of the entries from the various uniformity projects, PDB has sought a release scheme that would: (1) provide the flexibility to permit users of the archive to access alternative nomenclatures within the limitations of the existing PDB format, (2) integrate the results of archive-wide and file-by-file uniformity processing, and (3) preserve the integrity of the archival PDB format files available from the PDB ftp site. In consultation with the PDB Database Committee and the PDB Advisory Committee, we arrived at the following plan:

* Data will continue to be distributed in the current PDB data format from the RCSB PDB FTP site. The nomenclature including chain ID assignment will continue to follow the rules previously described.

* Data will also be distributed in mmCIF format from a new ftp area. mmCIF provides a detailed and fully parsable description of macromolecular structure and experiment. mmCIF is also equipped to deal with alternative nomenclatures. This mmCIF ftp area will be used to distribute the remediated data from the Data Uniformity project, and to distribute newly processed entries in mmCIF format with support for multiple nomenclatures.

* Software tools to create PDB format files from the new mmCIF files will be provided. These software tools will permit users to select the particular nomenclature to be written to a PDB format file. For instance, it will be possible to create a PDB file using either PDB hydrogen or IUPAC hydrogen atom name conventions or using either author or PDB chain ID conventions. An important benefit of this approach is that all of this flexibility can be provided using a single archival mmCIF file.

The new ftp area for mmCIF data will be implemented in the Fall. More details about this site will be provided in the near future.

Questions and comments should be sent to