High throughput sequencing RNA-Seq data comes in a multitude of flavours, i.e. even from a single provider, protocol - e.g. strand specific, paired-end - reads characteristics - e.g. read length - will vary.
The easyRNASeq simpleRNASeq method will infer these information based on excerpts sampled from the data. However, it is always best to provide these information, as 1. the inference is done on small excerpt and can fail 2. it is always good to document an analysis
By default easyRNASeq simpleRNASeq will keep the inferred parameters over the user-provided parameters if these do not agree and emit corresponding warnings. The choice to rely on inferred parameters over user-provided one is to enforce user to cross-validate their knowledge of the data characteristics, as these are crucial for an adequate processing. Remember GIGO[^2].
If the automatic inference does fail, please let me know, so that I optimise it. Meanwhile, you can use the override argument to enforce the use of user-passed parameters.
To reproduce the results from Robinson, Delhomme et al. [@Robinson:2014p6362], we first need to download an excerpt of the data.
We first retrieve the file listing and md5 codes
download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/RnaSeqTutorial/", "data/star/md5.txt"), destfile="md5.txt")
In this part of the vignette, we will NOT process all the data, albeit it would be possible, but for the sake of brevity, we will only retrieve the six first datasets. We get these from the sample information contained within this package.
data(RobinsonDelhomme2014) lapply(RobinsonDelhomme2014[1:6,"Filename"],function(f){ # BAM files download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/", "RnaSeqTutorial/data/star/",f), destfile=as.character(f)) # BAM index files download.file(url=paste0("ftp://ftp.plantgenie.org/Tutorials/", "RnaSeqTutorial/data/star/",f,".bai"), destfile=as.character(paste0(f,".bai"))) })
# THIS IS A subset of the data (chr 19 only) used to speed up # the vignette creation while still testing capabilities data(RobinsonDelhomme2014) lapply(RobinsonDelhomme2014[1:6,"Filename"],function(f){ # BAM files file.copy( dir(vDir,pattern=paste0(as.character(f),"$"),full.names=TRUE) ,file.path(".",f)) # BAM index files file.copy( dir(vDir,pattern=paste0(as.character(f),".bai"),full.names=TRUE) ,file.path(".",paste0(f,".bai"))) })
These six files - as the rest of the dataset - have been sequenced on an Illumina HiSeq 2500 in paired-end mode using a non-strand specific library protocol with a read length of 100 bp. The raw data have been processed as described in the aforementioned guidelines[^1] and as such have been filtered for rRNA sequences, trimmed for adapters and clipped for quality. The resulting reads (of length 50-100bp) have then been aligned using STAR [Dobin:2013p5293].
Using these information, we finally generate the BamParam object.
bamParam <- BamParam(paired = TRUE, stranded = FALSE)
A third parameter yieldSize can be set to speed up the processing on multi-CPU or multi-core computers. It splits and processed the BAM files in chunk of size yieldSize with a default of 1M reads.
Finally, we create the list of BAM files we just retrieved.
bamFiles <- getBamFileList(dir(".","*\\.bam$"), dir(".","*\\.bai$"))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.