Redundancy in the Protein Data Bank
The following table shows the number of non-redundant sequences as determined by blastclust at several levels of sequence identity.
|Method||Description||# of Clusters|
Blast clustering is performed with the following parameters (example
-p T -b T -S 95
The '-b T' parameter in BLASTClust means that the sequence identity threshold is enforced over both members of a sequence pair.
The '-p T' parameter means that both input sequences are protein sequences.
The '-S 95' here: the percent identity threshold used to include two sequences in a cluster.
Note: BLASTClust uses the default parameter -L, which specifies the length coverage threshold for including in a cluster. It set to 0.9 by default. This means that two sequences need to have >= 90% coverage in the alignment for clustering them together.
See also BLASTClust documentation at NCBI.
Raw (unsorted) blast clusters of protein chains in the PDB are posted at
(Note: This also contains sorted cd-hit clusters, but they are not used on this web site.)
Blast clusters that are sorted as described on the left are available via
RESTful web services
As the single worldwide repository for macromolecular structures, the Protein Data Bank holds a body of data that contains considerable redundancy in regard to both sequence and structure. We have incorporated into the query interface the ability to select a subset of structures from which similar sequences have been largely removed. In most cases, the selected subset will contain far fewer structures than the complete result set. However, the following caveats should be kept in mind:
- Sequence similarity is defined on a chain basis, but results are returned on a structure basis.
- Many structures in the PDB contain multiple protein chains, or even hybrids of DNA and protein chains.
- Sequence similarity is only assessed for protein chains.
- The relationship between sequence similarity and structure similarity is complex. Users seeking structure similarity should refer to the options available on the Structure Summary page under "External Links" (in the left hand navigation menu) in the "Structure Classification" section.
The primary purpose of this feature is to filter a list of likely highly similar structures to provide one or more representatives. Results may differ from other so-called non-redundant sets (e.g. PDB_SELECT [Hobohm U., and Sander C.,Protein Science, 3: 522-524, 1994]).
Sequence clustering in the PDB is performed via blastclust.
Detailed information for the clusters containing a given structure is available on the "Seq. Similarity" tab of
the Structure Summary page for the respective structure, e.g.
Algorithm for Removing Similar Sequences
The query implementation for removing similar sequences is based on pre-calculated clusters of protein chains. All protein chains of at least 20 amino acids are clustered by blastclust at 100%, 95%, 90%, 70%, 50%, 40%, and 30% sequence identity (defined as number of identical residues out of total in the sequence alignment).
In each cluster, the chains are sorted (i.e. ranked) according to the following criteria (in this order):
A simple quality factor, calculated as
1/resolution - R-value
- Deposition date (newer structures have higher ranks)
This ranking has the following implications:
- Higher quality structures (better resolution, lower R-value) are preferred.
- Structures determined by X-ray crystallography are preferred over NMR structures.
The selection of representative structures based on the clusters of chains is performed as follows:
All structures that only contain protein chains of 20 or more amino
acids are processed as follows:
- Generate list of all protein chains of 20 or more amino acids in the set of structures.
- Obtain the cluster # and rank # for each chain.
- From each cluster, pick the highest ranked chain (smallest rank #). This comprises a non-redundant set of chains.
- Return every structure Id present in this non-redundant set of chains.
- All structures that contain other chains types (e.g. nucleic acid chains, polypeptides with fewer than 20 amino acids) automatically represent themselves.
- The combined set of structures from A and B is returned as the selected set of structures.