169963 Biological Macromolecular Structures Enabling Breakthroughs in Research and Education


Redundancy in the Protein Data Bank


The following table shows the number of non-redundant sequences as determined by blastclust at several levels of sequence identity.

Method Description # of Clusters
blast 100% identity 94393
blast 95% identity 65719
blast 90% identity 61734
blast 70% identity 53376
blast 50% identity 45863
blast 40% identity 41428
blast 30% identity 36477
Notes on Blast Clustering

Sequence clustering in blastclust is performed with the following parameters (example 95%):

-p T -b T -S 95

The '-b T' parameter in BLASTClust means that the sequence identity threshold is enforced over both members of a sequence pair.

The '-p T' parameter means that both input sequences are protein sequences.

The '-S 95' here: the percent identity threshold used to include two sequences in a cluster.

Note: BLASTClust uses the default parameter -L, which specifies the length coverage threshold for including in a cluster. It set to 0.9 by default. This means that two sequences need to have >= 90% coverage in the alignment for clustering them together.

See also BLASTClust documentation at NCBI.

Related Links and References

As the single worldwide repository for macromolecular structures, the Protein Data Bank holds a body of data that contains considerable redundancy in regard to both sequence and structure. We have incorporated into the query interface the ability to select a subset of structures from which similar sequences have been largely removed. In most cases, the selected subset will contain far fewer structures than the complete result set. However, the following caveats should be kept in mind:

  • Sequence similarity is defined on a chain basis, but results are returned on a structure basis.
  • Many structures in the PDB contain multiple protein chains, or even hybrids of DNA and protein chains.
  • Sequence similarity is only assessed for protein chains.
  • The relationship between sequence similarity and structure similarity is complex. Users seeking structure similarity should refer to the options available on the Structure Summary page under "External Links" (in the left hand navigation menu) in the "Structure Classification" section.

The primary purpose of this feature is to filter a list of likely highly similar structures to provide one or more representatives. Results may differ from other so-called non-redundant sets (e.g. PDB_SELECT [Hobohm U., and Sander C.,Protein Science, 3: 522-524, 1994]).

Sequence clustering in the PDB is performed via mmseqs2 (formerly via blastclust). Detailed information for the clusters containing a given structure is available on the "Seq. Similarity" tab of the Structure Summary page for the respective structure, e.g.

Algorithm for Removing Similar Sequences

The query implementation for removing similar sequences is based on pre-calculated clusters of protein chains. All protein chains of at least 20 amino acids are clustered by blastclust at 100%, 95%, 90%, 70%, 50%, 40%, and 30% sequence identity (defined as number of identical residues out of total in the sequence alignment).

In each cluster, the chains are sorted (i.e. ranked) according to the following criteria (in this order):

  1. A simple quality factor, calculated as 1/resolution - R-value
  2. Deposition date (newer structures have higher ranks)
  3. Alphabetically

This ranking has the following implications:

  • Higher quality structures (better resolution, lower R-value) are preferred.
  • Structures determined by X-ray crystallography are preferred over NMR structures.

The selection of representative structures based on the clusters of chains is performed as follows:

  1. All structures that only contain protein chains of 20 or more amino acids are processed as follows:
    1. Generate list of all protein chains of 20 or more amino acids in the set of structures.
    2. Obtain the cluster # and rank # for each chain.
    3. From each cluster, pick the highest ranked chain (smallest rank #). This comprises a non-redundant set of chains.
    4. Return every structure Id present in this non-redundant set of chains.
  2. All structures that contain other chains types (e.g. nucleic acid chains, polypeptides with fewer than 20 amino acids) automatically represent themselves.
  3. The combined set of structures from A and B is returned as the selected set of structures.