Open Discussion of the RCSB PDB and its Role in Modeling and Structure Prediction December 13, 1998, Asilomar, CA

Leveraging the CASP3 meeting in Asilomar, California, the RCSB held a workshop on December 13, 1998. The aim of this workshop was to report on the status and plans for the PDB and to engage a large subset of users of the PDB in open discussions on future direction and policy. The meeting was chaired by Helen Berman and Phil Bourne and was attended by key representatives from Glaxo Wellcome, MRC, Columbia University, NIH, University College London, University of Wisconsin, Rutgers University, UC San Diego, EBI, Molecular Simulations, Inc. and Stanford University. The meeting began with introductions of all participants and a statement of purpose to provide information, address concerns, and gain input.

Helen Berman described the two grand challenges for the RCSB as the goals of (1) relating sequence, structure, and function and (2) helping researchers, educators, and industry. These challenges need to be met in the face of an increase in both the number and variety of users.

Helen then provided an overview of all aspects of RCSB PDB. The key features of the RCSB resource were described as rapid and reliable data processing, a commitment to uniform data, versatile query and reporting capability for individual structures across the archive, and links to other data.

RCSB data processing is a single integrated system for deposition, annotation, validation, database management, and archiving. The system is based on a community reviewed standard (CIF) and should scale automatically with new content and technology. The data-processing system is adaptable for different types of molecule (protein, nucleic acid), handles multiple input and output formats, can be used for primary deposition and for creating data uniformity, and can be customized according to the type of experiment (e.g., X-ray, NMR). The deposition system is designed to accept Web input and rapidly return annotated and validated entries. Information provided to the depositor includes information on sequence neighbors, structural neighbors, experimental validation, and a validation summary (ProCheck, Nucheck, SFCheck). In the future, it is expected that the deposition component will be distributed, the ligand database will be integrated, data harvesting approaches will be integrated, and more links to other resources will be added.

In discussion, it was pointed out that there is a strong synergy between data quality and obtaining reliable results from queries. A great deal of difficulty in obtaining reliable information from searches across the database can be avoided if the structure files were processed in a consistent manner. Query features supported by the new search engines include iterative queries, multiple structure analysis, and resources including structure alignments and neighboring.

Electronic distribution of the PDB will continue with SDSC as the primary distribution site and a network of mirrors and partial mirrors. A CD-ROM of the PDB will be distributed from NIST which will maintain the master archive. Outreach efforts will continue and involve the community of depositors, users, and software developers via the help desk, newsletter, and Web.

John Westbrook described data-processing requirements and the system being established at the RCSB. Requirements of the data-processing system include the ability to capture information in a flexible and efficient manner, store information in a well defined and uniform manner (facilitating query and exchange), and use methods of information technology that will scale well with volume and content. The data representation is the mmCIF standard, for which the dictionary has currently more than 1,700 definitions. This representation defines relationships between data items, types, range restrictions, and allowed values. The representation uses a simple table-like organization of data and data definitions. Furthermore, the dictionary is fully software-accessible. This standard representation received a detailed community expert review, is maintained by the IUCr, evolves with science, and is a foundation for data exchange and interoperability. A point that was stressed is that dictionaries are key to data processing: they provide a standard electronic description of all terminology, they make software extensible to changes and content in data, and they provide a framework in which to handle reference data (e.g., ligands, modified amino acids).

Phil Bourne provided an overview of the PDB query systems that are being implemented. They include systems that support basic and complex queries. Helge Weissig gave a computer demonstration of the SearchLite system that will be available in the first phase of the RCSB implementation.