The RCSB has reached out to the NMR community by bringing together experts in the field to form a task force to provide us with advice and guidance in those aspects of the PDB that are unique to the NMR structure files. Among these issues are protein nomenclature, multiple ID codes, ensemble data, constraint files, and validation concerns. Additionally, we desire task force counsel in our development of a new deposition tool and its underlying data dictionary. Task Force members are:
The Task Force held a short organizational meeting at the Keystone Frontiers of NMR in Molecular Biology VI Symposium in Breckenridge, CO, in January, followed by a one-day workshop in Rockville, MD, on April 23, 1999. The workshop format included morning sessions of short presentations and discussions and an afternoon session devoted to working on the deposition tool and validation needs.
Gary Gilliland, RCSB Project Team Member, NIST, welcomed the Task Force in his opening remarks. Helen Berman, director of the RCSB, presented an overview of the RCSB and stressed our desire to work with the community to develop a database that would best serve both the NMR and the user communities. John Westbrook presented an overview of the deposition/annotation system used by the RCSB and discussed the dictionary on which the system is built. He briefly reviewed the current validation procedures and discussed format issues that need to be addressed.
Eldon Ulrich, representing the BioMagResBank (BMRB: http://www.bmrb.wisc.edu/), provided an update on the BMRB. The BMRB has been receiving a substantial portion of their chemical shift data through PDB/EBI, and the RCSB will continue to accept chemical shift data and send it on to BMRB. The BMRB and the RCSB are committed to working together to develop and implement a single interface for NMR deposition of coordinates and experimental data.
Diane Hancock, RCSB Member, NIST, discussed the current deposition process, addressed format issues in detail, and described ADIT, the AutoDep Input Tool (http://pdb.rutgers.edu/adit/). ADIT's goal is to facilitate access to the maximum amount of deposited data while not placing an undue burden on the depositor. Discussion followed as to what are the essential data that must be collected from the depositor to populate the database, and what items, while not essential, would be useful to have in the database. It was clearly recognized that there are a substantial number of data items that, while useful, might preclude a timely deposition due to their number - i.e., if we ask for more, we might get less. The deposition form that is being developed relies heavily on pull down menus, so that many data entries require only the click of a mouse.
Highlights of Task Force discussions on a number of issues follow.
Task Force members were in consensus as to the advisability of adopting IUPAC nomenclature1,2 in the coordinate files. It was recognized that this would involve substantial changes in the files as IUPAC nomenclature will require:
Although these changes will establish consistency with the BMRB nomenclature, they will create inconsistencies with existing literature and constraint files. Efforts will have to be made to document these differences.
There was consensus among the Task Force members present for assignment of one accession code for a single study, i.e., put all members of the ensemble and the minimized average structure under one code rather than the multiple codes that now exist. In the old PDB files, the atoms have been numbered sequentially from model to model, with the result that the field limit for the atom number often necessitated the creation of multiple files with multiple accession codes, e.g., 1SAF, 1SAJ, 1SAH. In the newly deposited PDB files, the atom serial numbers are sequential within each model of the ensemble so that all models of the ensemble have the same atom serial numbers. The CONECT records, provided for the first model, are now applicable to all the models.
The minimized average structure will need to be identified so that those structures are not inadvertently downloaded and treated as just another member of the ensemble. A suffix tag, e.g. .ave and .rep could be used to identify the minimized average structure and the representative structures, respectively. Hence a single accession code might include the ensemble (.model1, .model2, .model3 ...) and the average (.ave), or representative (.rep) structures.
The following ensemble issues were raised:
There seemed to be no strong opinions regarding these issues and no consensus was reached.
The Task Force reviewed the current RCSB validation process for the NMR data, which relies on Procheck-NMR3 for protein ensembles, Procheck4 for the minimized average protein structures, and Nucheck for nucleic acid structures (ADIT Validation Server: http://pdb.rutgers.edu/validate/). It was felt that these programs did a good job of alerting authors to geometric anomalies that might signal a structural problem. (Note: The RCSB validation procedures were discussed in detail in the April 1999 PDB Newsletter.)
There was a limited discussion on what should be done for validation of the experimental data. Deposition of the chemical shift data and NOE peak tables including peak areas or volumes, would allow further validation. There was some interest in constructing a template for this data. Additionally, a listing of atoms that float during structure calculations would be valuable.
The question of how constraint data should be formatted was raised and the opinion was that adoption of a standard common format would not be possible since the program contents are different, e.g., some programs use pseudo atoms and others do not, the types of constraints vary, etc. Validation of experimental data will require further consideration, both with respect to what could/should be done and what additional data the community is willing to deposit.
The ADIT deposition/annotation system is built on the mmCIF dictionary and the NMR STAR extensions. It provides precise definitions and examples and defines data relationships, data type, range restrictions, allowed values, etc. The dictionary is organized in table-like structures called "categories." The Task Force critiqued this new deposition tool and proposed changes that are being incorporated. In the future, we hope that data harvesting can evolve to the state that the available data is increased along with a decrease in depositor effort but, in the meantime, we will have to rely on the goodwill of the depositors.
The RCSB PDB (citation) is managed by two members of the Research Collaboratory for Structural Bioinformatics:
RCSB PDB is a member of the
The RCSB PDB is funded by a grant from the
National Science Foundation, the
National Institutes of Health, and the
US Department of Energy.