knitr::opts_chunk$set( collapse = TRUE, cache=TRUE, autodep=TRUE, comment = "#>" )
BioData is a R6 R class to handle different kinds of biological data. At the moment a BioData object is for NGS expression data and will e.g. use DEseq2 for statistical analysis; a SingleCells object is meant to store SingleCell NGS data statistics for this type of data is not implemented any more; and finally Microarray is an old implemtaion using limma for the startistics.
BioData is an R6 class and therefore objects are modified by functions. Please keep that in mind when you use it.
``` {r, , echo=FALSE} library(BioData)
```r m = matrix(0,ncol=10, nrow=10) colnames(m) = paste('cell', 1:10) rownames(m) = paste('gene', 1:10) test <- as_BioData( m ) test reduceTo(test, what='row', to=rownames(test$dat)[1:5]) test
BioData uses S4 method dispatching and not the R6 object$method() syntax.
This document is not only for the user, but also for a more advanced user type that wants to understand the mechanics of the package and increase its usability.
In the first example you have seen that you can generate BioData objects from a simple marix objects.
But not only matrix objects can be used: you can create BioData objects from a seurat object, a cellexalvrR object or from simply Rsubread list objects. In addition you can also read sqlite databases as produced by my Chromium_SingleCell_Perl pipeline and the text files from a cellranger run. The function call is always
TM_Heart_10x <- as_BioData( system.file("extdata", "droplet_Heart_TabulaMuris.sqlite", package = "BioData")) TM_Heart_10x$name = 'TM_Heart_10x' TM_Heart_10x
The BioData object is containing many possible interesting datasets that I will document here:
I have two important data access function which actually use the R6 method style: 1. BioData$data() returns the most analyzed data in the obejct 2. BioData$rawData() returns the most primitive data in the dataset
The raw / processed data is stored in $raw, $dat and $zsored as sparse Matrix. There are two important data.tables $samples and $annotation that store the respective data and are always kept in sync to the three data tables.
In addition there are two lists that grow while the data is processed: 1. $stats every time a statistics is run the results get added to this list. The entries of this list are also in sync with the data tables. 2. $usedObj this list contains all kind of objects and will be discussed when needed.
Normally BioData objects contain e.g. one 10X sample run. When you want to compare your data to e.g. a second run of a similar sample or a opublished similar analysis, you need to merge these data parts together.
Used the merge function:
TM_Aorta_facs <- as_BioData( system.file("extdata", "facs_Aorta_TabulaMuris.sqlite", package = "BioData")) TM_Aorta_facs$name = 'TM_Aorta_facs' TM_Aorta_facs merged <- merge( TM_Heart_10x, TM_Aorta_facs ) merged
This will merge the raw data and all entries in the annotation and samples tables. In addition data in the usedObj list is stored. It will drop all normalized and z.scored data as these steps have to be re-calculated for the merged dataset. It will not keep any additional information like stats or the usedObj list of the source objects.
A BioData object can store different types of expression data. The type of data is linked to the sub class name. You can check the class of a BioData object by
class(BioDataObject)
The functions normalize, z.score and createStats applies different functions to the different classes.
```r{ class(merged)
class(merged) = c( 'MicroArray', 'BioData', 'R6')
class(merged) = c( 'SingleCells', 'BioData', 'R6')
We keep the SingleCells class here as that is what this dataset is. ## Normalization and z score Depending on the class of the BioData object different normalization steps are used: 1. SingleCells: Use re-sampling to normalize the reads. This normalization is loosing information as I have experienced, that this is the only way to make single cell data really comparable. 2. BioData: Use DEseq2 to normalize the data 3. MicroArray: Use DEseq2 to normalize the data (missing implementation of anything else) Example for our merged dataset here: ```r merged = TestData normalize(merged) z.score(merged) addCellCyclePhase(merged, gnameCol='gname' ) saveObj(merged) ## required to allow the starts to access the new Phase column runStats_inThread( merged, 'sname' ) runStats_inThread( merged, 'Phase' ) ## wait for a significant time (e.g. 10 min) runStats_inThread( merged, 'sname' ) runStats_inThread( merged, 'Phase' )
The normalization did efficiently remove any influence of the library size on the raw data. This is one of the main differences to e.g. Seurat which only removes the influence from the PCA data (correct me if I am wrong stefan.lang@med.lu.se).
In a normal setting there are multiple samples in one dataset and therefore the most straight forward analysis path is to check for differentail genes in this set.
I have experienced the cell cycle phase as a huge problem and therefore I always access the Seurat Phase and access differentials for this grouping, too:
merged
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.