knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%" )
rseAnalysis
(RNA structure and Expression analysis) package includes series of utility function including stander file reader, structural prediction, RNA distance calculation, and analysis package. rseAnalysis
provides an all-in-one solution for gene expression and secondary structure mutation correlation analysis by automating the data processing and analysis of gene expression and mutation data from the well-acknowledged database such as mirBase and TCGA.
To download MPLNClust, use the following commands:
# install.packages("devtools") devtools::install_github("JackXu2333/rseAnalysis")
The function vcf2df, fasta2df, and bed2df provides measures to read file with vcf, fasta, and bed extensions. The output file structures is as followed:
fasta object
\itemize{
\item NAME - Corresponding of the RNA sequence
\item SEQ - The original RNA sequence
}
bed object
\itemize{
\item CHROM - Located in chromosome CHROM
\item STAPOS - The start location of the sequence
\item ENDPOS - The end location of the sequence
\item DIR - The direction of the sequence
\item TYPE - The type of the sequence
\item ID - ID of the RNA (if avilable)
\item ALIAS - The alias of the sequence
\item NAME - The name of the sequence
}
vcf object
\itemize{
\item CHROM - Located in chromosome CHROM
\item POS - The start position of the mutation
\item ID - ID of the mutation (if available)
\item REF - The reference base(s) on the mutation point
\item ALT - The alternative base(s) appeared in during mutation
\item CONSEQUENCE - Record the effect of mutation under biological studies
\item OCCURRENCE - Record the associated cancer and its distribution
\item affected_donors - Record the total number of donor affected
\item project_count - Number of project associated with current studies
}
#Source library library(rseAnalysis) library(ggplot2) #Load sample data file vcf <- rseAnalysis::vcf2df(system.file("extdata", "hsa_GRCh37.vcf", package = "rseAnalysis")) fasta <- rseAnalysis::fasta2df(system.file("extdata", "hsa_GRCh37.fasta", package = "rseAnalysis")) bed <- rseAnalysis::bed2df(system.file("extdata", "hsa_GRCh37.bed", package = "rseAnalysis")) #Inspect the imported file head(vcf) head(fasta) head(bed)
Find mutated RNA sequence based on the fasta, bed and vcf files
``` {r mutate, warning=FALSE}
RNA.mutated <- RNA.validate(fasta = fasta, vcf = vcf, bed = bed)
Noted that message stated that the mutation has matching rate 0.714, this represents that only 71.4% of the mutation reference matches the gene sequence on sequence provided (fasta), and the remaining 28.6% of the sequence failed to match due to misalignment, differences in genome assemble in bed and fasta file, or potentially different representation of wildtype among vcf file and fasta file. For tutorial purposes, we will skip the mismatches and predict the secondary structure for the remaining. ``` {r structure} # ================== Sample code for RNA secondary structure prediction ========================== # # struct.ori <- suppressMessages(predictStructure(executable.path = "../inst/extdata/exe" # , rna.name = RNA.mutated$NAME, rna.seq = RNA.mutated$SEQ)) # struct.alt <- suppressMessages(predictStructure(executable.path = "../inst/extdata/exe" # , rna.name = RNA.mutated$NAME, rna.seq = RNA.mutated$MUT.SEQ)) # Read prerun result from the predictStructure RNA.mutated <- subset(RNA.mutated, MATCH)[1:200,] struct.ori <- read.csv(system.file("extdata", "vignetteSampleORI.csv", package = "rseAnalysis")) struct.alt <- read.csv(system.file("extdata", "vignetteSampleALT.csv", package = "rseAnalysis")) head(struct.ori) head(struct.alt)
Calculation RNADistance based on the result from structural prediction is straightforward, the result help determines how much of a structural difference there are among the original and mutated RNA sequence. Executable.path is omitted for mac or Unix user who has RNAStructure installed see more at ?predict.distance
``` {r distance, message=FALSE, warning=FALSE}
RNA.distance <- predictDistance(name = RNA.mutated$NAME , struct.ori = struct.ori$struct.ori , struct.alt = struct.alt$struct.alt , method = "gsc")
## RNA distance and gene expression analysis ``` {r analysis, warning=FALSE} #Load expression data expression <- read.csv(system.file("extdata", "test.csv", package = "rseAnalysis"), header = TRUE) #Use only standardize read expression <- subset(expression, Read.Type == "reads_per_million_miRNA_mapped")[1:200, ] result <- Analysis.DISEXP(dis.name = RNA.mutated$NAME, dis.distance = RNA.distance, exp.tumor = expression$Sample, exp.sample = expression$Normal, method = "linear", showPlot = FALSE) #Display statistical result result$stats #Display images from result #result$plots
Analysis.DISEXP uses the absolute difference in expression in modeling change in expression between the tumour and normal samples from BRCA patients. Here we use "reads_per_million_miRNA_mapped" as the input because it is standardized and can use to compare between cases. !Beware that for the analysis to work, one has to make sure the gene expression data and distance data are collected from the same type of mutation, or from the same sample (usually a sample will be insufficient in retrieving both expression and sequencing information), or it affects the outcome of the analysis dramatically.
Gere the Analysis.DISEXP generate both text and graphical output, with text output indicating the beta and p_value of the resulting regression model. From this example, the correlation between RNA distance is -0.11 with p-value of 0.12. The graphical output shows the prediction model and confidence interval on the scatter plot, RNA distance distribution by RNA type and boxplots showing the potential outliers from gene expression and RNA distance data set.
Kozomara, A., & Griffiths-Jones, S. (2011). miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic acids research, 39(Database issue), D152–D157. https://doi.org/10.1093/nar/gkq1027
Wickham, H. and Bryan, J. (2019). R Packages (2nd edition). Newton, Massachusetts: O'Reilly Media. https://r-pkgs.org/
TCGA Research Network: https://www.cancer.gov/tcga.
Zhiwen. T, Sijie Xu (2020) miRNA Motif Analysis https://github.com/Deemolotus/BCB330Y-and-BCB430Y/tree/master/Main
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.