knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
NxtIRFcore will no longer be supported after Bioconductor version 3.16. Its full functionality (plus heaps more) is replaced by SpliceWiz which will be available on Bioconductor version 3.16 onwards.
This section provides instructions for installation and a quick working example to demonstrate the important functions of NxtIRF. NxtIRFcore is the command line utility for NxtIRF.
For detailed explanations of each step shown here, refer to chapter 2: "Explaining the NxtIRF workflow" in this vignette. For a list of ready-made "recipes" for typical-use NxtIRF in real datasets, refer to chapter 3: "NxtIRF cookbook"
To install NxtIRFcore, start R (version "4.1") and enter:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("NxtIRFcore")
(Optional) For MacOS users, make sure OpenMP libraries are installed
correctly. We recommend users follow this
guide, but the
quickest way to get started is to install libomp
via brew:
```{bash eval=FALSE} brew install libomp
### Loading NxtIRF ```r library(NxtIRFcore)
A NxtIRF reference requires a genome FASTA file (containing genome nucleotide sequences) and a gene annotation GTF file (preferably from Ensembl or Gencode).
NxtIRF provides an example genome and gene annotation which can be accessed via the NxtIRFdata package installed with NxtIRF:
# Provides the path to the example genome: chrZ_genome() # Provides the path to the example gene annotation: chrZ_gtf()
Using these two files, we construct a NxtIRF reference as follows:
ref_path = file.path(tempdir(), "Reference") BuildReference( reference_path = ref_path, fasta = chrZ_genome(), gtf = chrZ_gtf() )
NxtIRF provides an example set of 6 BAM files to demonstrate its use via this vignette.
Firstly, retrieve the BAM files from ExperimentHub using the NxtIRF helper
function NxtIRF_example_bams()
. This makes a copy of the BAM files to the
temporary directory:
bams = NxtIRF_example_bams() bams
Finally, run NxtIRF/IRFinder as follows:
irf_path = file.path(tempdir(), "IRFinder_output") IRFinder( bamfiles = bams$path, sample_names = bams$sample, reference_path = ref_path, output_path = irf_path )
First, collate the IRFinder output files using the helper function
Find_IRFinder_Output()
expr = Find_IRFinder_Output(irf_path)
This creates a 3-column data frame with sample name, IRFinder gzipped text output, and COV files. Compile these output files into a single experiment:
nxtirf_path = file.path(tempdir(), "NxtIRF_output") CollateData( Experiment = expr, reference_path = ref_path, output_path = nxtirf_path )
The NxtSE
is a data structure that inherits SummarizedExperiment
se = MakeSE(nxtirf_path)
colData(se)$condition = rep(c("A", "B"), each = 3) colData(se)$batch = rep(c("K", "L", "M"), 2)
The code below will contrast condition:B in respect to condition:A
# Requires limma to be installed: require("limma") res_limma = limma_ASE( se = se, test_factor = "condition", test_nom = "B", test_denom = "A", ) # Requires DESeq2 to be installed: require("DESeq2") res_deseq = DESeq_ASE( se = se, test_factor = "condition", test_nom = "B", test_denom = "A", ) # Requires DoubleExpSeq to be installed: require("DoubleExpSeq") res_DES = DoubleExpSeq_ASE( se = se, test_factor = "condition", test_nom = "B", test_denom = "A", )
Filter by visibly-different events:
res_limma.filtered = subset(res_limma, abs(AvgPSI_A - AvgPSI_B) > 0.05)
Plot individual samples:
p = Plot_Coverage( se = se, Event = res_limma.filtered$EventName[1], tracks = colnames(se)[c(1,2,4,5)], ) as_egg_ggplot(p)
Display the plotly interactive version of the coverage plot (not shown here)
# Running this will display an interactive plot p$final_plot
Plot by condition:
p = Plot_Coverage( se = se, Event = res_limma.filtered$EventName[1], tracks = c("A", "B"), condition = "condition", stack_tracks = TRUE, t_test = TRUE, ) as_egg_ggplot(p)
# Reset to prevent interfering with next section colData(se)$condition = NULL colData(se)$batch = NULL
This section explains the working example provided in the prior "Quick-Start" section, to demonstrate how to use NxtIRF.
NxtIRF first needs to generate a set of reference files. The NxtIRF reference is used to quantitate IR and alternative splicing, as well as in downstream visualisation tools.
First, load the NxtIRF package:
library(NxtIRFcore)
NxtIRF generates a reference from a user-provided genome FASTA and genome annotation GTF file, and is optimised for Ensembl references but can accept other reference GTF files. Alternatively, NxtIRF accepts AnnotationHub resources, using the record names of AnnotationHub records as input.
We will first demonstrate a runnable example using the included example NxtIRF genome. This was created using 7 example genes (SRSF1, SRSF2, SRSF3, TRA2A, TRA2B, TP53 and NSUN5). The SRSF and TRA family of genes all contain poison exons flanked by retained introns. Additionally, NSUN5 contains an annotated IR event in its terminal intron. Sequences from these 7 genes were aligned into one sequence to create an artificial chromosome Z (chrZ). The gene annotations were modified to only contain the 7 genes with the modified genomic coordinates.
ref_path = file.path(tempdir(), "Reference") GetReferenceResource( reference_path = ref_path, fasta = chrZ_genome(), gtf = chrZ_gtf() ) BuildReference(reference_path = ref_path) # or equivalently, in a one-step process: BuildReference( reference_path = ref_path, fasta = chrZ_genome(), gtf = chrZ_gtf() )
The first function, GetReferenceResource(), requires 3 arguments:
(1) fasta
The file (or url) of the genome FASTA file,
(2) gtf
The file (or url) of the gene annotation GTF file, and
(3) reference_path
The directory (ideally an empty directory) to create
the NxtIRF reference
GetReferenceResource()
processes the given source genome / annotation files,
downloading them where necesssary, and saves a local copy of the GTF file as
well as a compressed version of the FASTA sequence file (as a TwoBitFile which
is a binary compressed version of the genome sequence).
The next function, BuildReference()
, uses the resources to build the NxtIRF
reference. NxtIRF builds its own version of alternative splicing annotations
by comparing differential combinations of splice junctions between the
transcripts of each gene. Events annotated include Skipped Exons (SE),
Mutually Exclusive Exons (MXE), Alternative First / Last Exons (AFE / ALE),
Alternative 5' or 3' Splice Sites (A5SS / A3SS), and Intron Retention (IR).
For IR events, every intron is considered as a potential Intron Retention event,
assuming most IR events are not annotated.
Additionally, BuildReference assesses the coding sequence changes that arise if each individual IR event occurs, annotating the IR events as NMD-inducing if the inclusion of the intron results in premature termination codons (PTCs, as defined by a stop codon located 55 bases upstream of the last exon junction).
NxtIRF adopts the IRFinder algorithm to measure IR in aligned BAM files. The IRFinder algorithm also provides spliced junction counts that is used by NxtIRF to quantitate alternate splicing events.
In this vignette, we provide 6 example BAM files. These were generated based on aligned RNA-seq BAMs of 6 samples from the Leucegene AML dataset (GSE67039). Sequences aligned to hg38 were filtered to only include genes aligned to that used to create the chrZ chromosome. These sequences were then re-aligned to the chrZ reference using STAR.
First, we download the example bam files using the following convenience function:
bam_path = file.path(tempdir(), "bams") # Copies NxtIRF example BAM files to `bam_path` subdirectory; returns BAM paths example_bams(path = bam_path)
Often, alignment pipelines process multiple samples. NxtIRF provides convenience
functions to recursively locate all the BAM files in a given folder, and tries
to ascertain sample names. Often sample names can be gleaned when:
The BAM files are named by their sample names, e.g. "sample1.bam",
"sample2.bam". In this case, level = 0
The BAM files have generic names but are contained inside parent directories
labeled by their sample names, e.g. "sample1/Unsorted.bam",
"sample2/Unsorted.bam". In this case, level = 1
# as BAM file names denote their sample names bams = Find_Bams(bam_path, level = 0) # In the case where BAM files are labelled using sample names as parent # directory names (which oftens happens with the STAR aligner), use level = 1
This convenience function retrieves a data frame with the first and second columns as the sample names and paths of all the BAM files found. We use this to run IRFinder on all the BAM files using the following:
irf_path = file.path(tempdir(), "IRFinder_output") IRFinder( bamfiles = bams$path, sample_names = bams$sample, reference_path = ref_path, output_path = irf_path, n_threads = 2, overwrite = FALSE, run_featureCounts = FALSE )
This runs IRFinder using 2 threads. Multi-threading is performed using OpenMP
(where available). If OpenMP is not installed when NxtIRFcore was compiled,
BiocParallel will be used instead. Setting overwrite = FALSE
ensures
if IRFinder files were generated by a previous run, these would not be
overwritten.
Also (optionally), users can generate gene counts from the same
reference using run_featureCounts = TRUE
(requires the Rsubread package).
The featureCounts output is stored in the "main.FC.Rds" file which can be
retrieved using below:
# Re-run IRFinder without overwrite, and run featureCounts require(Rsubread) IRFinder( bamfiles = bams$path, sample_names = bams$sample, reference_path = ref_path, output_path = irf_path, n_threads = 2, overwrite = FALSE, run_featureCounts = TRUE ) # Load gene counts gene_counts <- readRDS(file.path(irf_path, "main.FC.Rds")) # Access gene counts: gene_counts$counts
IRFinder produces output on individual samples. NxtIRF makes it easy to collate the output from multiple samples. As some splicing events may occur in some samples but not others, it is important to unify the data and organise it into a single data structure. NxtIRF uses a specialized class (called the NxtSE object) to contain multiple arrays of data. This is used to contain Included and Excluded event counts as well as associated quality control data (including intronic "coverage" which is the fraction of the intron covered by RNA-seq reads in IR events).
Having run IRFinder on all individual BAM files of an experiment, we can collate this data using the CollateData() function. Again, we use a convenience function to generate a list of IRFinder output files:
expr <- Find_IRFinder_Output(irf_path)
expr
is a 3-column data frame. The first two columns contain the sample
names and IRFinder output file path (as a ".txt.gz" - gzipped text file).
This contains all of the original output from vanilla IRFinder,
as well as sample QC readouts). Feel free to unzip a file to see what output
is generated.
The third column contains "COV" file paths. COV files are generated by NxtIRF and contains RNA-seq read coverage data in a novel compressed format. COV files compress data much better than BigWig files, and also contains RNA-seq coverage data from individual strands. Currently, only NxtIRF is able to read COV files but we anticipate other packages can use COV files via NxtIRF (in future NxtIRF releases)! We will demonstrate how to use COV files in a later section.
Now that we have an organised list of IRFinder output files, we parse this into
CollateData()
nxtirf_path = file.path(tempdir(), "NxtIRF_output") CollateData( Experiment = expr, reference_path = ref_path, output_path = nxtirf_path, IRMode = "SpliceOverMax", n_threads = 1 )
CollateData extracts IR quantitation and junction counts and organises these counts into unified arrays of data. CollateData also compiles QC parameters for all samples, including read depth and strandedness (directionality).
IRMode
is a parameter that specifies how IRFinder calculates percentage
intron retention (PIR). Previously, IRFinder estimates spliced (or exonic)
abundance by including junction reads that involve either flanking exon
SpliceMax = max(SpliceLeft, SpliceRight)
.
This is done to correct for the possibility
of alternate exons flanking the intron being assessed. NxtIRF extends this
estimate by accounting for the possibility that BOTH flanking exons may be
alternate exons, thereby accounting for splice events that overlap the intron
but does not involve the junctions of either flanking exons. We call this
parameter SpliceOverMax
, which includes SpliceMax
with the
addition of distant splice events.
Now that CollateData()
has created the unified data in the \code{"NxtSE"}
directory, we can retrieve the data as a NxtSE object:
se = MakeSE(nxtirf_path, RemoveOverlapping = TRUE)
Two things of note:
(1) By default, MakeSE constructs the NxtSE object using all the samples in
the collated data. It is possible (and particularly useful in large data sets)
to read only a subset of samples. In this case, construct a data.frame object
with the first column containing the desired sample names and parse this into
the colData
parameter as shown:
subset_samples = colnames(se)[1:4] df = data.frame(sample = subset_samples) se_small = MakeSE(nxtirf_path, colData = df, RemoveOverlapping = TRUE)
(2) In complex transcriptomes including those of human and mouse, alternative
splicing implies that introns are often overlapping. Thus, algorithms run the
risk of over-calling intron retention where overlapping introns are assessed.
NxtIRF removes overlapping introns by considering only introns belonging to
the major splice isoforms. It estimates a list of introns of major isoforms
by assessing the compatible splice junctions of each isoform, and removes
overlapping introns belonging to minor isoforms. To disable this functionality,
set RemoveOverlapping = FALSE
.
Often, the gene annotations contain isoforms for all discovered splicing events. Most annotated transcripts are not expressed, and their inclusion in differential analysis complicates results including adjusting for multiple testing. It is prudent to filter these out using various approaches, akin to removing genes with low gene counts in differential gene analysis. We suggest using the default filters which generally excludes splicing events whereby the total included / excluded event counts less than 20 RNA-seq reads. There are other quality-control filters included but these are out of the scope of this vignette (please see the documentation for details).
To filter the NxtSE object using default filters:
expressed_events <- apply_filters(se, get_default_filters()) se.filtered = se[expressed_events,]
In the above code, get_default_filters()
retrieves the default filters
suggested by NxtIRF. apply_filters
returns a vector containing which
events are included if the filters are applied to the NxtSE object. Finally,
we obtain a NxtSE object by subsetting the original NxtSE using the vector
of expressed_events
.
To view a description of what these filters actually do, simply display them:
get_default_filters()
To construct custom filters, create a NxtFilter object as shown:
f1 = NxtFilter(filterClass = "Data", filterType = "Depth", minimum = 20) f1
For more information about the filters available, refer to the manual for details:
?NxtFilter
Before performing differential analysis, we must annotate our samples. Currently there are no annotations:
colData(se.filtered)
To populate the NxtSE object with annotations, we can modify colData just as we would for a SummarizedExperiment object.
colData(se.filtered)$condition = rep(c("A", "B"), each = 3) colData(se.filtered)$batch = rep(c("K", "L", "M"), 2) # To display what colData is: as.data.frame(colData(se.filtered))
We can use either limma to perform differential alternative splicing analysis
require("limma") # Compare by condition-B vs condition-A res_limma_condition <- limma_ASE( se = se.filtered, test_factor = "condition", test_nom = "B", test_denom = "A", ) # Compare by condition-B vs condition-A, batch-corrected by "batch" res_limma_treatment_batchcorrected <- limma_ASE( se = se.filtered, test_factor = "condition", test_nom = "B", test_denom = "A", batch1 = "batch" )
NB: we can also use DESeq2 or DoubleExpSeq to perform the same analysis using
DESeq_ASE()
or DoubleExpSeq_ASE()
. The differences are as follows:
The following guide demonstrates how to plot standard figures including scatter plots of average percent-spliced-in (PSI) values, volcano plots of differentially expressed IR / AS events, and heatmaps of PSI values.
These functions return a data frame with the differential analysis results. These include average percent-spliced-in values for both conditions, which can be used directly to produce a scatter plot comparing the two conditions:
library(ggplot2) ggplot(res_limma_condition, aes(x = 100 * AvgPSI_B, y = 100 * AvgPSI_A)) + geom_point() + xlim(0, 100) + ylim(0, 100) + labs(title = "PSI values across conditions", x = "PSI of condition B", y = "PSI of condition A")
Note the columns in the results are defined by the names of the conditions "A" and "B".
To filter for specific splicing events (e.g. IR only), simply filter the results data.frame by EventType:
ggplot(subset(res_limma_condition, EventType == "IR"), aes(x = 100 * AvgPSI_B, y = 100 * AvgPSI_A)) + geom_point() + xlim(0, 100) + ylim(0, 100) + labs(title = "PIR values across conditions (IR Only)", x = "PIR of condition B", y = "PIR of condition A")
limma_ASE
and DESeq_ASE
contain direct output from limma or DESeq2.
The P.Value and pvalue of limma and DESeq2 output denote nominal P values,
whereas multiple-correction is performed via the adj.P.Val and padj columns of
limma and DESeq2, respectively. To plot these:
ggplot(res_limma_condition, aes(x = logFC, y = -log10(adj.P.Val))) + geom_point() + labs(title = "Differential analysis - B vs A", x = "Log2-fold change", y = "BH-adjusted P values (-log10)")
NxtIRF provides convenience functions to retrieve the splicing ratios via the NxtSE object. To produce a heatmap, first we create a matrix of splicing ratios of the top 10 differential events:
mat = make_matrix( se.filtered, event_list = res_limma_condition$EventName[1:10], method = "PSI" )
make_matrix()
has the option of supplying values as logit-transformed. This
is useful to contrast where the majority of values are near the 0- or 1-
boundary.
Also, when the dataset contains low-abundance splicing depths, samples with
splicing depths below a certain threshold can be excluded (replaced with NA).
Simply set depth_threshold
to a level below which sample-events will be
converted to NA. Also, events can be excluded from display if a certain fraction
of samples have low coverage (i.e. return NA values). This filter can be set
using na.percent.max
. This is useful as events with high number of NA
values can return errors in heatmap functions that also perform clustering
With the matrix of values, one can produce a heatmap as shown:
library(pheatmap) pheatmap(mat, annotation_col = as.data.frame(colData(se.filtered)))
NxtIRF is able to produce RNA-seq coverage plots of analysed samples. Coverage data is compiled simultaneous to the IR and junction quantitation performed by the IRFinder C++ routine. This data is saved in "COV" files, which is a BGZF compressed and indexed file. COV files show compression and performance gains over BigWig files.
Additionally, NxtIRF performs coverage plots of multiple samples combined based on user-defined experimental conditions. This is a powerful tool to illustrate group-specific differential splicing or IR. NxtIRF does this by normalising the coverage depths of each sample based on transcript depth at the splice junction / intron of interest. By doing so, the coverage depths of constitutively expressed flanking exons are normalised to unity. As a result, the intron depths reflect the fraction of transcripts with retained introns and can be compared across samples.
We will first demonstrate by plotting the RNA-seq coverage of a single gene
by a single sample. Plot_Coverage()
performs the calculations and generates
a compound object containing both static and interactive plots. We can coerce
this to a static plot using as_egg_ggplot()
res = Plot_Coverage( se = se.filtered, Gene = "TP53", tracks = colnames(se.filtered)[1] ) as_egg_ggplot(res)
The interactive plot (not run here) can be displayed by directly calling
the final_plot
element of the object returned by Plot_Coverage
# Running this will display an interactive plot res$final_plot
There are many transcripts in TP53! This is because by default, NxtIRF displays all annotated transcripts. For clarity, one can either collapse the transcripts at a per-gene level, by setting condense_tracks = TRUE:
res = Plot_Coverage( se = se.filtered, Gene = "TP53", tracks = colnames(se.filtered)[1], condense_tracks = TRUE ) as_egg_ggplot(res)
Alternatively, for fine control, one can supply a vector containing the transcript names to be displayed:
res = Plot_Coverage( se = se.filtered, Gene = "TP53", tracks = colnames(se.filtered)[1], selected_transcripts = c("TP53-201", "TP53-204") ) as_egg_ggplot(res)
In the heatmap in the previous section, we can see some retained introns in NSUN5 that are more highly expressed in "02H003" and "02H025". To demonstrate this, we first introduce a new condition that groups these two samples:
colData(se.filtered)$NSUN5_IR = c(rep("High", 2), rep("Low", 4))
Performing differential analysis will confirm this to be the case:
require("limma") res_limma_NSUN5 <- limma_ASE( se = se.filtered, test_factor = "NSUN5_IR", test_nom = "High", test_denom = "Low", ) head(res_limma_NSUN5$EventName)
We can now visualise the top hit, NSUN5 intron 8. To visualise the IR event across conditions, we specify the type of condition to contrast. The names of the tracks will be the names of the nominator (test) and denominator (control) conditions
res = Plot_Coverage( se = se.filtered, Event = res_limma_NSUN5$EventName[1], condition = "NSUN5_IR", tracks = c("High", "Low"), ) as_egg_ggplot(res)
Note the lines represent mean coverages of all samples in the labelled experimental conditions. These are achieved by normalising their individual coverages by their transcript abundance, which is calculated by summing the abundances of spliced and intron-retaining transcripts. Thus, the normalised coverage at the specified splice junction is approximately 1.0.
The grey shading represent 95% confidence intervals. It is possible that the shaded areas are larger as we traverse away from the splice junction. This may be because of differences in 3'-sequencing bias between different samples of the same condition.
Although NSUN5 intron 8 is the most differentially retained in the analysis, the difference isn't very apparent. This is because the difference between the absolute values is very small. In contrast, intron 2 looks more prominently different in the heatmap, so we can explore this as well.
We can further compare the coverage by plotting the two conditions on the same
track. This done by setting stack_tracks = TRUE
. Further, we can add a track
that tests the per-nucleotide statistical difference between the normalised
coverage between the two conditions. To do this, set t_test = TRUE
:
res = Plot_Coverage( se = se.filtered, Event = "NSUN5/ENST00000252594_Intron2/clean", condition = "NSUN5_IR", tracks = c("High", "Low"), stack_tracks = TRUE, t_test = TRUE ) as_egg_ggplot(res)
On the first track, the different coloured traces represent the two conditions. The second track plots the -log10 transformed p values, using Students T-test to test the difference between the coverages between the conditions, at per-nucleotide resolution.
Note that the t-test track does not replace formal statistical analysis using
the limma_ASE
or DESeq_ASE
functions. Instead, they provide a useful
adjunct to assess the adequacy of the normalisation of the group coverages.
There are more complex customisation options with NxtIRF's Plot_Coverage tool which will be explored in subsequent tutorials (TODO).
library(NxtIRFcore)
First, define the path to the directory in which the reference should be stored. This directory will be made by NxtIRF, but its parent directory must exist, otherwise an error will be returned.
ref_path = "./Reference"
Note that setting genome_path = "hg38"
will prompt NxtIRF to use the default
files for nonPolyA and Mappability exclusion references in the generation of its
reference. Valid options for genome_path
are "hg38", "hg19", "mm10" and "mm9".
BuildReference( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "hg38" )
The following will first download the genome and gene annotation files from the online resource and store a local copy of it in a file cache, facilitated by BiocFileCache. Then, it uses the downloaded resource to create the NxtIRF reference.
FTP = "ftp://ftp.ensembl.org/pub/release-94/" BuildReference( reference_path = ref_path, fasta = paste0(FTP, "fasta/homo_sapiens/dna/", "Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"), gtf = paste0(FTP, "gtf/homo_sapiens/", "Homo_sapiens.GRCh38.94.chr.gtf.gz"), genome_type = "hg38" )
AnnotationHub contains Ensembl references for many genomes. To browse what is available:
require(AnnotationHub) ah = AnnotationHub() query(ah, "Ensembl")
For a more specific query:
query(ah, c("Homo Sapiens", "release-94"))
We wish to fetch "AH65745" and "AH64631" which contains the desired FASTA and GTF files, respectively. To build a reference using these resources:
BuildReference( reference_path = ref_path, fasta = "AH65745", gtf = "AH64631", genome_type = "hg38" )
BuildReference
will recognise the inputs of fasta
and gtf
as AnnotationHub
resources as they begin with "AH".
For human and mouse genomes, we highly recommend specifying genome_type
as
the default mappability file is used to exclude intronic regions with repeat
sequences from intron retention analysis. For other species, one could
generate a NxtIRF reference without this reference:
BuildReference( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "" )
Alternatively, if STAR
is available on the computer or server where R/RStudio
is being run, we can use the one-line function BuildReference_Full
. This
function will:
Prepare the resources from the given FASTA and GTF files
Generate a STAR genome
Use the STAR genome and the FASTA file to de-novo calculate and define low
mappability regions
Build the NxtIRF reference using the genome resources and mappability file
BuildReference_Full( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf", genome_type = "", use_STAR_mappability = TRUE, n_threads = 4 )
n_threads
specify how many threads should be used to build the STAR reference
and to calculate the low mappability regions
Finally, if STAR
is not available, Rsubread
is available on Bioconductor to
perform mappability calculations. The example code in the manual is displayed
here for convenience, to demonstrate how this would be done:
# (1a) Creates genome resource files ref_path = file.path(tempdir(), "Reference") GetReferenceResource( reference_path = ref_path, fasta = chrZ_genome(), gtf = chrZ_gtf() ) # (1b) Systematically generate reads based on the NxtIRF example genome: Mappability_GenReads( reference_path = ref_path ) # (2) Align the generated reads using Rsubread: # (2a) Build the Rsubread genome index: setwd(ref_path) Rsubread::buildindex(basename = "./reference_index", reference = chrZ_genome()) # (2b) Align the synthetic reads using Rsubread::subjunc() Rsubread::subjunc( index = "./reference_index", readfile1 = file.path(ref_path, "Mappability", "Reads.fa"), output_file = file.path(ref_path, "Mappability", "AlignedReads.bam"), useAnnotation = TRUE, annot.ext = chrZ_gtf(), isGTF = TRUE ) # (3) Analyse the aligned reads in the BAM file for low-mappability regions: Mappability_CalculateExclusions( reference_path = ref_path, aligned_bam = file.path(ref_path, "Mappability", "AlignedReads.bam") ) # (4) Build the NxtIRF reference using the calculated Mappability Exclusions BuildReference(ref_path)
To use STAR
to align FASTQ files, one must be using a system with STAR
installed.
This software is not available in Windows. To check if STAR
is available:
STAR_version()
ref_path = "./Reference" # Ensure genome resources are prepared from genome FASTA and GTF file: if(!file.exists(file.path(ref_path, "resources"))) { GetReferenceResource( reference_path = ref_path, fasta = "genome.fa", gtf = "transcripts.gtf" ) } # Generate a STAR genome reference: STAR_BuildRef( reference_path = ref_path, n_threads = 8 )
STAR_align_fastq( STAR_ref_path = file.path(ref_path, "STAR"), BAM_output_path = "./bams/sample1", fastq_1 = "sample1_1.fastq", fastq_2 = "sample1_2.fastq", n_threads = 8 )
Experiment = data.frame( sample = c("sample_A", "sample_B"), forward = file.path("raw_data", c("sample_A", "sample_B"), c("sample_A_1.fastq", "sample_B_1.fastq")), reverse = file.path("raw_data", c("sample_A", "sample_B"), c("sample_A_2.fastq", "sample_B_2.fastq")) ) STAR_align_experiment( Experiment = Experiment, STAR_ref_path = file.path("Reference_FTP", "STAR"), BAM_output_path = "./bams", n_threads = 8, two_pass = FALSE )
To use two-pass mapping, set two_pass = TRUE
. We recommend disabling this
feature, as one-pass mapping is adequate in typical-use cases.
To conveniently find all BAM files recursively in a given path:
bams = Find_Bams("./bams", level = 1)
This convenience function returns the putative sample names, either from BAM
file names themselves (level = 0
), or from the names of their parent
directories (level = 1
).
To use IRFinder using 4 OpenMP threads:
# assume NxtIRF reference has been generated in `ref_path` IRFinder( bamfiles = bams$path, sample_names = bams$sample, reference_path = ref_path, output_path = "./IRFinder_output", n_threads = 4, Use_OpenMP = TRUE )
Sometimes one may wish to create a COV file from a BAM file without running the IRFinder algorithm. One reason might be because a NxtIRF/IRFinder reference is not available.
To convert a list of BAM files, run BAM2COV()
. This is a function structurally
similar to IRFinder()
but without the need to give the path to the NxtIRF
reference:
BAM2COV( bamfiles = bams$path, sample_names = bams$sample, output_path = "./IRFinder_output", n_threads = 4, Use_OpenMP = TRUE )
Assuming the NxtIRF reference is in ref_path
, after running IRFinder as shown
in the previous section, use the convenience function Find_IRFinder_Output()
to tabulate a list of samples and their corresponding IRFinder outputs:
expr = Find_IRFinder_Output("./IRFinder_output")
This data.frame can be directly used to run CollateData
:
CollateData( Experiment = expr, reference_path = ref_path, output_path = "./NxtIRF_output" )
Then, the collated data can be imported as a NxtSE
object, which is an object
that inherits SummarizedExperiment
and has specialized containers to hold
additional data required by NxtIRF.
se = MakeSE("./NxtIRF_output")
Please refer to chapters 1 and 2 for worked examples using the NxtIRF example dataset.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.