The Protein Data Bank holdings contain considerable redundancy in sequence and structure. An option that allows users to select a subset of structures from which homologous sequences have been largely removed is now available from the primary PDB Web site and its mirrors. This option, which is available from all search interfaces, filters subsets of structures that match a particular query.
Removing sequence homologues from queries via the home page and SearchLite returns representatives of protein structures with less than 90% sequence similarity. SearchFields provides the option of selecting either 50, 70, or 90% similarity as cut-off values. The user can then toggle between the complete set of results and the reduced subset by using the options menu at the top of the Query Result Browser.
While sequence homology is defined on a per chain basis, results are returned on a structure basis. Results may differ from other non-redundant sets outside the PDB. The CD-HIT algorithm (Cluster Database at High Identity with Tolerance) is used to remove redundant sequences and leave only the representatives (Li, W., Jaroszewski, L. and Godzik, A.; Bioinformatics, (2001) 17:282-283). CD-HIT can be found at http://bioinformatics.ljcrf.edu/cd-hi/.
Further information about this new feature is available at http://www.rcsb.org/pdb/redundancy.html. Questions or comments on this feature may be sent to firstname.lastname@example.org.
The RCSB PDB (citation) is managed by two members of the Research Collaboratory for Structural Bioinformatics:
RCSB PDB is a member of the
The RCSB PDB is funded by a grant from the
National Science Foundation, the
National Institutes of Health, and the
US Department of Energy.