The r Rpackage("LowMACA")
package is a simple suite of tools to investigate and
analyse the profile of the somatic mutations provided by the cBioPortal
(via the r Rpackage("cgdsr")
). r Rpackage("LowMACA")
evaluates the functional impact of somatic
mutations by statistically assessing the number of alterations that accumulates on the same residue
(or residue that are conserved in Pfam domains).
For example, the known driver mutations G12,G13 and Q61 in KRAS can be found on the
corresponding residues of other proteins in the RAS family (PF00071) like NRAS and HRAS,
but also in less frequently mutated genes like RRAS and RRAS2.
The corresponding residues are identified via multiple sequence alignment.
Thanks to this approach the user can identify
new driver mutations that occur at low frequency at single protein level but emerge
at Pfam level. In addition, the impact of known driver mutations can be transferred to other proteins that share
a high degree of sequence similarity (like in the RAS family example).
You can conduct an hypothesis driven exploratory analysis using our package simply providing a set of genes and/or pfam domains of your interest. The user is able to choose the kind of tumor and the type of mutations (like missense, nonsense, frameshift etc.). The data are directly downloaded from the largest cancer sequencing projects and aggregated by LowMACA to evaluate the possible functional impact of somatic mutations by spotting the most conserved variations in the cohort of cancer samples. By connecting several proteins that share sequence similarity via consensus alignment, this package is able to statistically assessing the occurrence of mutations on the same residue and ultimately see:
LowMACA relies on two external resources to work properly.
Clustal Omega is a fast aligner that can be downloaded from the link above. For both Unix and Windows users, remember to have "clustalo" in your PATH variable. In case you cannot set "clustalo" in the PATH, you can always set the clustalo command from inside R, after creating a LowMACA object:
#Given a LowMACA object 'lm' lm <- newLowMACA(genes=c("TP53" , "TP63" , "TP73")) lmParams(lm)$clustal_cmd <- "/your/path/to/clustalo"
If you cannot install clustalomega, we provide a wrapper around EBI web service (http://www.ebi.ac.uk/Tools/webservices/services/msa/clustalo_soap). You just need to set your email as explained in section setup, but you have a limit of 2000 input sequences and perl must be installed with the modules LWP and XML::Simple.
Ghostscript is an interpreter of postscript language and a pdf reader that is used by the R library grImport.
More details can be found here: http://pgfe.umassmed.edu/BioconductorGallery/docs/motifStack/motifStack.html
r Rpackage("LowMACA")
needs an internet connection to:
First of all, we have to define our target genes or pfam domains that we wish to analyse.
library(LowMACA) #User Input Genes <- c("ADNP","ALX1","ALX4","ARGFX","CDX4","CRX" ,"CUX1","CUX2","DBX2","DLX5","DMBX1","DRGX" ,"DUXA","ESX1","EVX2","HDX","HLX","HNF1A" ,"HOXA1","HOXA2","HOXA3","HOXA5","HOXB1","HOXB3" ,"HOXD3","ISL1","ISX","LHX8") Pfam <- "PF00046"
#Construct the object lm <- newLowMACA(genes=Genes, pfam=Pfam) str(lm , max.level=3)
Now we have created a r Rpackage("LowMACA")
object. In this case, we want to analyse
the homeodomain fold pfam (PF00046), considering 28 genes that belong to this clan.
If we don't specify the pfam parameter, r Rpackage("LowMACA")
proceeds to analyse
the entire proteins passed by the genes parameter (we map only canonical proteins, one per gene).
Vice versa, if we don't specify the genes parameter, r Rpackage("LowMACA")
looks for all the proteins that contain the specified pfam and analyse just the portion of the protein assigned to the domain.
A LowMACA object is composed by four slots. The first slot is arguments and is filled at the very creation of the object. It contains information as Uniprot name for the proteins associated to the genes, the amino acid sequences, start and end of the selected domains and the default parameters that can be change to start the analysis.
#See default parameters lmParams(lm) #Change some parameters #Accept sequences even with no mutations lmParams(lm)$min_mutation_number <- 0 #Changing selected tumor types #Check the available tumor types in cBioPortal available_tumor_types <- showTumorType() head(available_tumor_types) #Select melanoma, stomach adenocarcinoma, uterine cancer, lung adenocarcinoma, #lung squamous cell carcinoma, colon rectum adenocarcinoma and breast cancer lmParams(lm)$tumor_type <- c("skcm" , "stad" , "ucec" , "luad" , "lusc" , "coadread" , "brca")
lm <- alignSequences(lm)
This method is basically self explained. It aligns the sequences in the object. If you didn't install clustalomega yet, you can use the web service of clustalomega that we wrapped in our R package. The limit is set to 2000 sequences and it is slower than a local installation. Rememeber to put your own email in the mail command to activate this option since is required by the EBI server.
lm <- alignSequences(lm , mail="lowmaca@gmail.com")
#Access to the slot alignment myAlignment <- lmAlignment(lm) str(myAlignment , max.level=2 , vec.len=2)
lm <- getMutations(lm) lm <- mapMutations(lm)
These commands produce a change in the slot mutation and provide the results from R cgdsr package.
#Access to the slot mutations myMutations <- lmMutations(lm) str(myMutations , vec.len=3 , max.level=1)
If we want to check what are the most represented genes in terms of number of mutations divided by tumor type, we can simply run:
myMutationFreqs <- myMutations$freq #Showing the first genes myMutationFreqs[ , 1:10]
This can be useful for a stratified analysis in the future.
To simplify this setup process, you can use directly the command setup to launch alignSequences, getMutations and mapMutations at once
#Local Installation of clustalo lm <- setup(lm) #Web Service lm <- setup(lm , mail="lowmaca@gmail.com")
If you have your own data and you don't need to rely on the cgdsr package, you can use the getMutations or setup method with the parameter repos, like this:
#Reuse the downloaded data as a toy example myOwnData <- myMutations$data #How myOwnData should look like for the argument repos str(myMutations$data , vec.len=1) #Read the mutation data repository instead of using cgdsr package #Following the process step by step lm <- getMutations(lm , repos=myOwnData) #Setup in one shot lm <- setup(lm , repos=myOwnData)
In this step we calculate the general statistics for the entire consensus profile
lm <- entropy(lm) #Global Score myEntropy <- lmEntropy(lm) str(myEntropy) #Per position score head(myAlignment$df)
With the method entropy, we calculate the entropy score and a pvalue against the null hypothesis that the mutations are distributed randomly accross our consensus protein. In addition, a test is performed for every position of the consensus and the output is reported in the slot alignment. The position 4 has a conservation score of 0.88 (highly conserved) and the corrected pvalue is significant (qvalue below 0.01). There are signs of positive selection for the position 4. To retrieve the original mutations that generated that cluster, we can use the function lfm
significant_muts <- lfm(lm) #Display original mutations that formed significant clusters (column Multiple_Aln_pos) head(significant_muts) #What are the genes mutated in position 4 in the consensus? genes_mutated_in_pos4 <- significant_muts[ significant_muts$Multiple_Aln_pos==4 , 'Gene_Symbol']
sort(table(genes_mutated_in_pos4))
The position 4 accounts for mutations in 13 different genes. The most represented one is ISX (ISX_HUMAN, Intestine-specific homeobox protein).
bpAll(lm)
This barplot shows all the mutations reported on the consensus sequence divided by protein/pfam domain
lmPlot(lm)
This four layer plot encompasses:
#This plot is saved as a png image on a temporary file tmp <- tempfile(pattern = "homeobox_protter" , fileext = ".png") protter(lm , filename=tmp)
knitr::include_graphics(tmp)
A request to the Protter server is sent and a png file is downloaded with the possible sequence structure of the protein and the significant positions colored in orange and red
An alternative use of r Rpackage("LowMACA")
consists in analysing all the Pfams and single sequences encompassed by a specific set of mutations. For example, it is possible to analyse mutations derived from a cohort of patients to see which Pfams and set of mutations are enriched, following the LowMACA statistics.
The function allPfamAnalysis takes as input a data.frame or the name of a file which contains the set of mutations, analyse all the Pfams that are represented and reports all the significant mutations as output. Moreover, the function allPfamAnalysis analyses individually all the mutated genes and reports the significant mutations found by this analysis as part of the output.
#Load Homeobox example data(lmObj) #Extract the data inside the object as a toy example myData <- lmMutations(lmObj)$data #Run allPfamAnalysis on every mutations significant_muts <- allPfamAnalysis(repos=myData) #Show the result of alignment based analysis head(significant_muts$AlignedSequence) #Show all the genes that harbor significant mutations unique(significant_muts$AlignedSequence$Gene_Symbol) #Show the result of the Single Gene based analysis head(significant_muts$SingleSequence) #Show all the genes that harbor significant mutations unique(significant_muts$SingleSequence$Gene_Symbol)
The parameter allLowMACAObjects can be used to specify the name of the file where all the Pfam analyses will be stored (by default this information is not stored, because the resulting file can be huge, according to the size of the input dataset). In this case, all the analysed Pfams are stored as r Rpackage("LowMACA")
objects and they can be loaded and analysed with the usual r Rpackage("LowMACA")
workflow.
Copy and paste on your R console and perform the entire analysis by yourself. You need Ghostscript to see all the plots.
library(LowMACA) Genes <- c("ADNP","ALX1","ALX4","ARGFX","CDX4","CRX" ,"CUX1","CUX2","DBX2","DLX5","DMBX1","DRGX" ,"DUXA","ESX1","EVX2","HDX","HLX","HNF1A" ,"HOXA1","HOXA2","HOXA3","HOXA5","HOXB1","HOXB3" ,"HOXD3","ISL1","ISX","LHX8") Pfam <- "PF00046" lm <- newLowMACA(genes=Genes , pfam=Pfam) lmParams(lm)$tumor_type <- c("skcm" , "stad" , "ucec" , "luad" , "lusc" , "coadread" , "brca") lmParams(lm)$min_mutation_number <- 0 lm <- setup(lm , mail="lowmaca@gmail.com") lm <- entropy(lm) lfm(lm) bpAll(lm) lmPlot(lm) protter(lm)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.