News

Sequence Clustering Update

05/03

Sequence clustering based on polymer entity IDs has been relaxed from 90% to 80%Sequence clustering based on polymer entity IDs has been relaxed from 90% to 80%

Sequence cluster groups enable exploration of sets of homologous sequences and can reveal trends across hundreds of related proteins.

RCSB.org offers data files that contain the results of the weekly clustering of protein sequences in the PDB by MMseqs2 at 30%, 40%, 50%, 70%, 90%, 95%, and 100% sequence identity. Note that these files use polymer entity identifiers, instead of chain identifiers to avoid redundancy. The files are plain text with one cluster per line, sorted from largest to smallest cluster.

The Advanced Search Group option also simplifies PDB searching by generating a non-redundant search result set based on sequence identity clustering (as well as UniProt ID, and group depositions).

The clustering requires a meaningful overlap between sequences (in addition to their sequence identity). This coverage requirement has been relaxed from 90% to 80%, which is the coverage threshold used by UniRef. This change addresses some unintuitive clusterings, where highly similar sequences were assigned to different clusters.

Consequently, the sequence clusters offered are slightly larger on average and fewer in number. Some group identifiers have changed. Redundancy-filtered result sets (see example), which collapse similar polymer entities into groups, can now be navigated more efficiently as there are fewer groups to explore.

User guides are available for Grouping Structures and Sequence-based Clustering.

News Index