mutation3D

Overview

mutation3D is a functional prediction and visualization tool for studying the spatial arrangement of amino acid substitutions on protein models and structures. It is intended to be used to identify clusters of amino acid substitutions arising from somatic cancer mutations across many patients in order to identify functional hotspots and fuel downstream hypotheses. It is also useful for clustering other kinds of mutational data, or simply as a tool to quickly assess relative locations of amino acids in proteins.

Cite mutation3D

If you find mutation3D useful in your research, please cite:

Mutation3D: Cancer Gene Prediction Through Atomic Clustering of Coding Variants in the Structural Proteome.
Hum Mutat. 2016 Feb 3. doi: 10.1002/humu.22963. [Epub ahead of print]

Algorithm

The mutation3D clustering procedure is based on complete-linkage clustering, and is performed in 3D using the coordinates of α-carbons in the protein backbones from models and crystal structures. Users can tune the algorithm parameters to find clusters spanning larger or smaller areas of the protein by adjusting the complete-linkage parameter (the maximum cluster diameter).

Cluster Significance

Statistical significance (P-value) of found clusters is computed using a bootstrapping approach to account for non-uniformity of protein backbone orientations. A number of random amino acid substitions equal to the number of observed substitutions are chosen at random along the protein backbone. Over many iterations, complete-linkage clustering is performed on the randomized substitutions. P-values of clusters in observed data are computed empirically from the resulting distribution of randomized cluster diameters.
Download & Usage

The software used to compute clusters is available for download as C++ source-code (last updated December, 2015).

Binary compilation may depend on system architecture.
Example compliation:
g++ -std=c++0x mutation3d.cpp -o mutation3d

General Usage:
./mutation3d <pdb_file> <amino acid substitutions> <CL-distance> <protein length> <numer of bootstrapping iterations>

Example Usage, searching for clusters in a PDB model of KRAS. Note that by repeating 12 and 13 twice, we indicate that two mutations were observed at these loci.
./mutation3d 4EPV.pdb 12,12,13,13,49,61 15 189 10000

Output:
12,12,13,13,61 14.3862 0.0031

From this tab-separated output, we can see that amino acid residues 12, 12, 13, 13, and 61 form a cluster with a diameter of 14.3862 Angstroms and P-value of 0.0031. Note that due to the random-sampling approach for calculating statistical significance, P-values will vary slighly each time the program is run. If additional clusters are found, they will be listed on subsequent lines.

Data Sources

The data used to build mutation3D have been obtained from the following open-access sources—please use the links below to visit each individual resource for complete information on their curation guidelines. mutation3D maintains local copies of relevant data from each source and is set to refresh its stored data periodically to include the latest structural, categorical, or mutational data. Complete details on site maintenance are available in the maintenance section.

COSMIC

The Catalogue of Somatic Mutations in Cancer (COSMIC) is a vast repository of published cancer sequencing studies reporting somatic mutations (tumor-normal DNA sequence differences) in small scale studies and whole genome and whole exome studies. mutation3D filters COSMIC to find all missense mutations and makes these mutations available for clustering on the advanced clustering page.

ModBase

ModBase is a database of comparative protein structure models produced through the ModPipe comparative modeling pipeline. mutation3D archives all models of verified human proteins (SwissProt) meeting a minimum quality requirement of MPQS ≥ 0.5 (though the default clustering parameter for MPQS is set higher). Additionally, models considered redundant for clustering purposes are not retained by mutation3D. For example, consider two models of the same protein: model A and model B. If model A is of lower quality than model B (as judged by MPQS), covers a section of the modeled protein also covered by model B, and both models are derived from the same template PDB structure, model A will not be retained. If these criteria is not met, it is possible that mutation3D will use multiple models per protein, especially as most models do not cover entire proteins.

PDB

The Protein Databank (PDB) is a repository containing experimentally-determined crystal structures of proteins and other molecules. mutation3D uses sections of these models that align to the verified human proteome based on residue level mappings provided by SIFTS. When determining 3D distances between amino acids it is important that all amino acids between the two targets are contained in the wild-type protein as even a slight change in the angle of an intermediate bond could drastically change the coordinates of distant amino acids in the chain. Thus, mutation3D retains crystal structures with complete and uninterrupted mappings to SwissProt proteins or any section of protein that meets this requirement.

PFAM

The Pfam Protein Families Database contains the predicted locations of functional and/or structural elements of proteins called domains. mutation3D provides these as visual overlays on linear protein models as a courtesy to our users. These annotations are not functionally related to the clustering algorithm.

UniProt

UniProt’s set of verified human proteins (SwissProt) is used as the default keys for accession of data and also to determine which proteins are contained within structures and models.

Input Formats

Identifier Types

mutation3D is able to accept several types of identifiers including gene/protein symbol, UniProt ID, Ensembl Transcript ID, and GRCh38 genomic coordinates. Below is a table showing examples of equivalent identifiers for several canonical cancer proteins.

Symbol	UniProt	Ensembl
KRAS	P01116	ENST00000256078
TP53	P04637	ENST00000269305
PIK3CA	P42336	ENST00000263967

To convert other types of identifiers to UniProt IDs, please visit UniProt’s website and use their ID Mapping tool.

Mutation Formats

mutation3D is able to accept several popular formats denoting sequence positions and mutations. Positions accompanying all identifiers are indexed starting at 1 and may be input in any form simply as integers separated by any delimiter or line breaks. mutation3D allows, but does not require users to input mutation information to accompany position information (if no residue change is provided, mutation3D will simply assume a missense mutation has occurred at this location). For either nucleotide or amino acid changes from residue X to Y at position 99 in a sequence indexed starting at 1, the following formats are acceptable: X99Y, 99X>Y, p.X99Y, p.99X>Y, c.X99Y, c.99X>Y. Note that while some of these formats are more commonly used to denote either amino acid changes or nucleotide changes, mutation3D allows their use for either. mutation3D will then interpret whether the identifier was meant to be a nucleotide change or an amino acid change based on the accompanying identifier. For instance, any mutation format accompanying a UniProt ID will be assumed to indicate an amino acid change, whereas any mutation format accompanying an Ensembl Transcript ID will be assumed to indicate a nucleotide change. Mixing of different mutation formats and mutations and positions only is allowed.

Genomic Coordinates

Clustering based on GRCh38 genomic coordinates is available via the Advanced Cluster form with data source selected as “User”. Here, a user may upload a file of genomic coordinates or paste them into the box provided. The coordinates should be in the format of chr22:26070464 or with mutations in the format of chr22:26070464A>T. These can be arranged in any format within the text box or in an uploaded file as long as there are delimiters and/or line breaks separating different positions. Note that as with other mutation formats, it is not necessary to indicate the residue change that occurs at the genomic position. If no change is indicated, mutation3D will assume a missense mutation has occurred due to a nucleotide mutation at this position.

Quick Cluster

The Quick Cluster form is available on the mutation3D home page. Users may input a single identifier (including UniProt, Gene/Protein Symbol, or Ensembl Transcript) in the first box, and a list of corresponding sequence positions and/or positions and mutations based on the formats above in the second box. Searches here will take users directly to a 3D view of their protein of interest.

Advanced Cluster

On the Advanced Cluster page, users may set clustering parameters and select a subset of protein structures and models to include. They may also choose

Batch Search

To initiate a batch search with user data, click on the “User” button under the Data Source select option (selected by default). Provide data in the input box or by uploading a file. Both methods require the same format of one identifier followed by a list of positions and/or mutations per line. The only exception to this rule is genomic positions which may be provided on any number of lines provided they are separated by delimiters such as a space or a comma. Users may mix and match identifier types. Below are a few examples of acceptable content for either the input box or as an uploaded file.

ENST00000256078 35G>A;104C>T;183A>C;216G>A
ENST00000369535 183A>T
ENST00000323571 647G>A
ENST00000323571 1040G>A
POLM p.P311R
POLM p.A278V
KMT2C L4219V
MKT2C M3097I
P00519 290M>V
chr7:140477806T>C
chr7:140477807C>G
chr7:140477810T>A
chr7:140453155C>T

COSMIC Search

A search of available COSMIC data allows a user to anlayze data from the literature. To initiate a batch search of mutations from a publication, click on the “COSMIC” button under the Data Source select option. A table of publications will appear. Search for publications based on a combination of the following: pubmed ID, tissue(s) studied, last name of the first author, and journal name. When you have found a publication to analyze, click on the corresponding row and, if the study covers more than one tissue, select a single tissue to analyze in the box below.

ModBase Model Filters

Mutation3D supplements its crystal structure coverage of the proteome with homology models from ModBase. ModBase provides many quality metrics to determine the accuracy of each of their models, several of which we have included for filtering purposes. The user may select a subset of models by setting these parameters on the advanced page. By default, all parameters are set to their most relaxed levels, except for MPQS, as this parameters encompasses all other parameters, therefore allowing some leniency in other parameters (i.e. one low quality parameter can be compensated for by several high quality parameters). ModBase too states that a model should be considered to have a reliable fold assignment if any one of parameters 3-5 (below) fall in an acceptable range. Therefore, setting all parameters to their suggested thresholds may unnecessarily remove models from consideration. However, individual users may still choose to filter on different or additional parameters by setting any combination of the parameters below.

Protein Coverage: The fraction of the full length UniProt protein covered by the model, regardless of identity of the accuracy of the amino acids in the model. There is no suggested cutoff for Protein Coverage as it can vary based on an individual’s requirements.
Sequence Identity: The fraction of a model’s amino acid sequence that is identical to the Uniprot amino acid sequence, on a scale of 0 to 1. Note that this measure is independent of the Protein Coverage (i.e. Sequence Identity can still equal 1 if all of the amino acids included in the model are identical to amino acids in the covered region of the protein regardless of how large this region is). There is no suggested cutoff for Sequence Identity as it can vary based on an individual’s requirements.
e-value: The significance of the alignment between the template PDB structure sequence and the target Uniprot sequence as reported by NCBI’s PSI-BLAST program or similar alignment score calculated by ModPipe2. The ModBase-suggested quality cutoff using e-value alone is e-value < 10-4.
Discrete Optimized Protein Energy (DOPE) score: Also known as zDOPE, it is a derived atomic distance-dependent statistical potential calculated by ModBase from a sample of native structures. The ModBase publication describes this measure in great detail. Lower z-DOPE scores indicate a more accurate model. The ModBase-suggested quality cutoff using zDOPE score alone is zDope < 0.
ModPipe Quality Score (MPQS): A composite score calculated by ModBase comprising several measures including measures 1-4. Since this score covers all previous scores, mutation3D uses the ModBase-suggested threshold for a high quality model of MPQS ≥ 1.1 by default.

Clustering Parameters

These parameters define the properties of an acceptable cluster in mutation3D. Suggested values are pre-set in both the Quick Cluster interface and the advanced page, but a user may choose to change the parameters from the advanced page.

Minimum Number of Substitutions: The minimum number of amino acid substitutions required to form a cluster. This refers to the absolute number of substitutions, regardless of whether they exist at the same amino acid index or arise from the same underlying nucleotide mutation across multiple clinical samples. By default, Minimum Number of Substitutions = 3.
Minimum Number of Unique Substitutions: The minimum number of amino acid substitutions existing at unique amino acid indices required to form a cluster. In this sense, multiple mutations in one codon are counted as a single unique substitution, regardless of whether they give rise to different mutant amino acid residues. By default, Minimum Number of Unique Substitutions = 2.
Maximum Intracluster Distance: The maximum allowable distance in Angstroms between any two amino acid substitutions in a cluster. This is the same as the CL-distance in Complete Linkage Clustering and can be thought of as the maximum allowable diameter of a sphere containing all points in a cluster. By default, Maximum Intracluster Distance = 15 Å.

Output Formats

A single query executed from the main page will initiate a page of interactive 3D structure displays and linear protein models showing the location of amino acid substitutions on protein backbones. An advanced query of many proteins (either from a batch user query or from COSMIC data) will display a table of proteins containing clusters.

Protein Models and Structures Page

3D Protein Models
3D models show representations of atomic coordinates based on corresponding ModBase models or PDB structures with locations of amino acid substitutions indicated as spheres in the place of a-carbons for each mutated residue. Red spheres indicate substitutions in the cluster currently being viewed and blue spheres indicate substitutions not in the current cluster (however they may exist in a different cluster). If more than one cluster is found in the currently displayed model, select alternative clusters to show in red by clicking on them in the panel to the right. Standard molecular viewer mouse and keyboard shortcuts can be used to zoom (Shift + drag, or scroll wheel), pan (ctrl+click and drag), and rotate (click and drag) the model. Alternate models and structures can be loaded by clicking the Select Model button and selecting one of the linear protein models below. The molecular viewer is based purely in JavaScript and is powered by GLmol, an open source library for displaying molecular structures in WebGL-enabled browsers. Most major browsers, such as Firefox and Chrome, support WebGL automatically. To check if your browser supports WebGL and for more information on how to enable WebGL in other browsers, visit http://get.webgl.org/. PyMol-compatible structure files are available for download and offline viewing by clicking on the PyMOL Session file in the model details panel below the primary linear model.

Linear Protein Models
The primary linear protein model shows information corresponding to the currently displayed 3D model above. The full length of the linear model represents all amino acids in the protein according to Uniprot. The linear region highlighted in green indicates the region of the full-length protein that is represented in the above 3D model while the grey region is the remainder of the protein not represented in the currently displayed 3D model (as most models and structures only cover a fraction of a full protein). Vertical red lines indicate the sequence position of amino acid substitutions. Hovering over each of these lines will highlight the corresponding amino acid in the 3D model. Protein domains (from Pfam) are indicated as light blue transparent boxes along the length of the primary linear model. Hover over each domain to see its name and the amino acids contained within it. Selecting a new linear model will refresh the 3D model to match. To select a new model to view, press the Select Model button below the currently displayed linear model. A new panel will expand showing all available models as lines corresponding to the section of the protein each model covers. Hovering over each selectable model will show the area of the protein covered in green on the primary linear model. Selectable models shown in red contain clusters and hovering over them will display their clusters as text in the cluster selection box. Selectable models in black contain no clusters. One selectable model will appear green-this is the currently selected model. Click to select a new model and display it in the 3D box and refresh other statistics on the page.

Tabular Output Page

Advanced clustering results are first displayed as a table with one row for each protein containing one or more clusters based on the parameters received via the Advanced Clustering Page. Results are initially sorted by the P-value of the most statistically significant cluster for each protein, but can be reordered by any of the available columns. The final column, CGC, gives the status of the shown protein with reference to the Cancer Gene Census. If the CGC symbol shows up in dark blue, the protein is contained in this set of known cancer genes; a grey shadow of the symbol indicates the protein’s absence from this set. To view the locations of substitutions in clusters, click on the arrow at the far left of any table entry. In the table that appears below the entry, sort the shown clusters on any of the given cluster and structure/model-specific values. To view a graphical representation of a cluster, click on the corresponding link icon.

Maintenance

mutation3D is updated at intervals corresponding with the release of COSMIC, a new version of which is typically released every three months. At this time, mutation3D refreshes all data underlying the site’s functionality, including models from ModBase and structures from the PDB, and pre-calculates clusters in all publications in COSMIC containing missense mutations.

Contact

This project is maintained by the Yu Group at Cornell University. For contact information and information on mutation3D and other tools, please visit us at yulab.org.