What's New

The September 29, 2015 release offers the following features:

Validation Track on Protein Feature View

Validation Track on Protein Feature View

Mapping validation annotations to sequence

Protein Feature View graphically summarizes a full-length protein sequence from UniProt and how it corresponds to PDB entries and annotations from external resources in different "tracks".

A new track shows the quality of a protein chain as described in the wwPDB Validation Report mapped onto the amino acid sequence.

The validation track is color-coded to indicate the number of angle and bond outliers, as well as clashes for a given residue. A red icon indicates RSRZ outliers, indicating a bad fit to the electron density map. Mouse-over the validation track displays further details of the outliers.

For example, chain A of PDB ID 4HHB, a hemoglobin protein structure from 1984, is displayed in Protein Feature View. The validation track shows a lot of red residues, indicating many geometric outliers.

4HHB validation track

For comparision, here is the validation track of PDB ID 2W72 chain A, a hemoglobin structure from 2008. While fewer angle and bond outliers as shown, this entry has several residues that fit badly to the electron density map (RSRZ >2).

2w72 validation track

A description of wwPDB validation reports can be found at the wwPDB website and in the recommendations of the wwPDB X-ray Validation Task Force.

An Overview of the Protein Feature View is available.

Mutation Track on Protein Feature View

Mutation Track on Protein Feature View

View differences between PDB and UniProt sequences

The Mutation Track visually summarizes expression tags, cloning artifacts, and many other details about sequence mismatches between the studied protein sequences and the reference UniProt sequences.

4FOH PV view of RuBisCO

In an overlay to the PDB track, new icons represent a number of different sequence modifications that can be observed in PDB files. Here a few examples:

  • The 'T' icon T represents expression tags that have been added to the sequence.
  • The 'E' icon E represents an engineered mutation.
  • The '<>' icon <> represents microheterogeneity.
  • The 'CRO' icon CRO represents chromophores.
Besides these, there are many other icons. For more information about the meaning and exact position of a sequence modification, move the cursor over the icon.

An Overview of the Protein Feature View is available.

Example: Gag-Pol polyprotein - P12497

Drill-down by UniProt Molecule Name

Drill-down by UniProt Molecule Name

Browse PDB structures organized by the most frequent molecule names

Drilldown for whole archive

A new drill-down identifies the most common UniProt molecule names related to PDB entries. Above is a summary of the drill-down of the whole PDB archive.

Update to Pfam Version 28

Update to Pfam version 28

Get up-to-date Pfam domain annotations

We updated our Pfam annotation pipeline to use the latest Pfam version 28. Pfam is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models. Since this is a major update of Pfam, which was several years in the making, some of the search results have changed. E.g. Identification of protein kinases can be done now on the Pfam-Family level, e.g., Protein kinase domain - PF00069, or Protein tyrosine kinase - PF07714. We currently do not offer grouping of results on a Pfam-Clan level (e.g. PKinase CL0016 ).

When a new structure is released we perform a search against Pfam using the HMMER web services API. The PDB sequence records are used for this scan. Once the Pfam domain annotations have been calculated, they are mapped onto the PDB-ATOM coordinates (the PDB residue numbers) thereby ensuring the atomic coordinates are available. You can access these Pfam-PDB annotations via the RCSB PDB RESTful API in the following way:

Details of the Pfam to PDB annotation pipeline are described in an article at the Pfam blog.

Searching and Reporting by Chain Identifier

Chain-based analysis to facilitate accurate searching and reporting

A search by PDBId.ChainId(s) has been added to the Advanced Search system. Users can enter a comma separated PDBId.ChainId list and get the results for the specified polymer chains. Furthermore, tabular reports can be generated based on these chain identifiers. The chain-based summary reports such as Sequence Report, Biological Details Report, and Sequence Clusters Report can be retrieved with one-click from the Summary Reports drop down list. Users can also pick the chain based report fields in the Customizable Table to take advantage of this feature.

Search example:

Chain based advanced search example

Get results:

Chain based search result list

Generate chain-based custom report:

Chain based custom report

Sequence Cluster Report

Compare protein chains with similar sequence

A new Sequence Cluster Report has been added to the tabular report system. The report includes sequence identity clusters from 100%, 90% to 30%. Sequence clusters contain protein chains grouped by sequence identity. For example, the 90% sequence cluster groups protein sequences that are at least 90% identical. It also includes UniProt Recommended and Alternative Names, Gene Name, and Macromolecular Name and Synonyms. Below is a complete list of fields in this report.

Sequence Cluster Report field list

Custom Report Web Services Improvements

Create tabular reports with RESTful Web Services

All queries and reports that can be generated on the website are available using RESTful Web Services, including the Sequence Cluster Report.

To better support user workflow and data analysis requirements, the reports generated by our RESTful services maintain the PDBId and PDBId.ChainId in the same order as the input query string.

The following example generates a custom report with PDBId.ChainId combinations and select fields from the Sequence Cluster Report. The report lists the PDBId.ChainId in the same order as the input query string and the output is in CSV format.

  • PDBId.ChainId list: 4HHB.A,2CPK.E,3WHM.B,2D5Z.A
  • Fields: Cluster Number for 100% sequence identity, Cluster Number for 90% sequence identity, Cluster Number for 70% sequence identity, UniProt Recommended Name, Gene Names, Taxonomy ID, Taxonomy, Chain Length
  • Generate example report with the above parameters

Custom report Web Services: All tabular report fields | Documentation | Java Example