PDB COMMUNITY FOCUS: Frank Allen, Cambridge Crystallographic Data Centre

Frank Allen was born in Reading, UK in 1944 and studied chemistry at Imperial College London, receiving a BSc in 1965 and a PhD in 1968. Following postdoctoral work at the University of British Columbia, Vancouver, he joined the University of Cambridge in 1970 to carry out small-molecule structure determinations. Work on the Cambridge Structural Database (CSD) had begun in 1965, and his interest in this project grew during the early 1970s. He has been involved in most major developments at the Cambridge Crystallographic Data Centre (CCDC) since then, including software development and database creation, but with a strong accent on research applications of the CSD. He received the Royal Society of Chemistry Prize for Structural Chemistry in 1994 and the Herman Skolnik Award of the American Chemical Society Division of Chemical Information in 2003. He is now Executive Director of the CCDC and a Visiting Professor of Chemistry at the University of Bristol.

Q: The CCDC turned forty this year. You have been involved for 35 of those 40 years – how have the CCDC and the CSD changed during that time?

A: The CSD has changed from being an academic idea to being an information resource that is used, and relied upon, by several thousand scientists in academia and industry. Annual CSD input has increased by a factor of 30, but amazingly electronic input has only taken over during the last decade. More than 200,000 of our current entries were re-typed from journals and deposition documents right up to the mid-1990s, but the CIF has now changed all that (for the better – although not all CIFs are perfect!). Chemically and crystallographically, we now process much larger structures, more of which are truly novel, and many more of which are metal-organic species with challenging problems of structure representation.

The most fundamental organizational change occurred in 1989, when the CCDC became financially self-supporting and self-managing. This has been both a challenge and a benefit, and one that has enabled us to develop carefully as the business has grown. From a maximum of about 15-20 staff in its agency-funded days, the CCDC is now a non-profit company employing 50 people. Those staff are now highly focused, with software being developed, released and supported to professional standards, and increasing automation being brought to bear on the creation, validation and maintenance of the CSD itself.

Q: The CSD now contains about 370,000 published structures.  However, far more structures than that have actually been determined but are not published, maybe approaching a million.  Is there any way in which some or all of these structures can be collected into the CSD?

A: This is indeed a serious problem. Very large numbers of high-quality small-molecule structures are lying dormant in the filing cabinets and internal archives of hundreds of laboratories across the world. The reasons for this situation are varied, and are discussed in an article I wrote recently in Crystallography Reviews (vol. 10, 3-15, 2004). Principally, service crystallographers perform small-molecule structures for chemists, give them the results, and that is as far as many of them get. The original structural problem is solved and synthetic or other work can go ahead (or not!). It is really a question of 'ownership' of the crystal structure data, and the will (and time) to place them in the literature, an open archive or a database. Every crystal structure is valuable, and often for reasons that are unrelated to the original research goals. So, it is a major problem for structure-based science that so much valuable data is well on the way to being lost forever. This may also become an issue for macromolecular structures as well, and it is to be hoped that some of the developments noted below may mitigate the problem. Chemoinformatics and bioinformatics approaches to a whole host of structural problems depend on fully comprehensive (and accurate) data resources. These informatics approaches cannot be maximally effective if the flow of available data is attenuated, as it clearly is now for small molecules.

The CCDC has made strong efforts to attract unpublished data into the CSD, and more than 2,000 unpublished structures have been deposited directly into the CSD in the past 5 years. Section E of Acta Cryst. has also benefited the community by publishing several thousand structures which might otherwise not have seen the light of day. However these numbers are a drop in the ocean, and the CCDC has supported e-Science initiatives at Southampton, UK and Indiana, USA to encourage unpublished structural data into the public domain. We also welcome proposals announced by the NIH and the UK Research Councils for public archiving of research results obtained with their funding. We must wait to see exactly how these e-Science and funding agency initiatives develop, and will be talking to the organizers of the archives to see how we can be involved, so as to maximize the value of the CSD to the scientific community.

Q: How have your collaborations affected the CCDC?

A: Wholly positively. The CCDC has collaborated with many institutions on both the research applications of the CSD, and in the co-development of products. Research collaborations enable CCDC staff to be involved in novel uses of the CSD and suggest improvements to the distributed software that arise from the work. A recent example is the CCDC's involvement as a partner with Cambridge University and Pfizer Inc. in the Pfizer Institute for Pharmaceutical Materials Science. This has brought together experimental, computational and chemoinformatics approaches to address typical problems faced by development and formulation chemists in the pharmaceutical industry. The research is suggesting new ways of organizing and searching CSD data, and is giving rise to novel software to accomplish these aims. On the product development side, we have co-developed the protein-ligand docking program GOLD with GlaxoSmithKline and the University of Sheffield, Relibase and Relibase+ with Merck, Germany and the University of Marburg, and the DASH software for structure solution from powder diffraction data with the Rutherford Appleton Laboratory in the UK. This has not only broadened the scientific interests of the CCDC, it has also generated additional income which has helped both the CCDC and its academic collaborators, and has enabled us to keep subscriber costs for the CSD itself as low as possible over the past decade.

Q: It would be of great value to users of both the RCSB PDB and the CSD if there was a direct two-way connection between the two databases. What do you foresee as possible collaborations between the CSD and the RCSB PDB and when do you think it could happen?

A: There are a number of ways in which the RCSB PDB and the CSD could work together. It seems to me that the best way forward is for the CCDC and the RCSB PDB to discuss possibilities together and then bring forward some joint proposals, taking account of their scientific value and the available resources at both organizations. Only then can we determine a way forward and realistic timescales. I don't think it wise for either organization to make unilateral statements.