knitr::opts_chunk$set( fig.path='figure/graphics-', cache.path='cache/graphics-', fig.align='center', external=TRUE, echo=TRUE, warning=FALSE # fig.pos="H" )
Thanks for checking out rTASSEL! In this document, we will go over the functionalities used to work with the TASSEL software via R.
TASSEL is a software package used to evaluate traits associations, evolutionary patterns, and linkage disequilibrium. Strengths of this software include:
The opportunity for a number of new and powerful statistical approaches to association mapping such as a General Linear Model (GLM) and Mixed Linear Model (MLM). MLM is an implementation of the technique which our lab's published Nature Genetics paper - Unified Mixed-Model Method for Association Mapping - which reduces Type I error in association mapping with complex pedigrees, families, founding effects and population structure.
An ability to handle a wide range of indels (insertion & deletions). Most software ignore this type of polymorphism; however, in some species (like maize), this is the most common type of polymorphism.
More information can be found in the following paper:
Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. (2007) TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633-2635.
Detailed documentation and source code can be found on our website:
https://www.maizegenetics.net/tassel
The main goal of developing this package is to construct an R-based front-end to connect to a variety of highly used TASSEL methods and analytical tools. By using R as a front-end, we aim to utilize a unified scripting workflow that exploits the analytical prowess of TASSEL in conjunction with R's popular data handling and parsing capabilities without ever having the user to switch between these two environments.
Due to the experimental nature of this package's lifecycle, end functionalities are prone to change after end-user input is obtained in the near future.
Since TASSEL is written in Java, Java JDK will need to be installed on your
machine. Additionally, for R to communicate with Java, the R package rJava
will need to be installed. Detailed information can be found using the
following links, depending on your OS:
On a MAC: if you previously had rJava
working through RStudio, then you
upgraded your Java and it now longer works, try the following:
At the command line type:
R CMD javareconf
Then check for a left over symbolic link via:
ls -ltr /usr/local/lib/libjvm.dylib
If the link exists, remove it, then create it fresh via these commands:
rm /usr/local/lib/libjvm.dylib sudo ln -s $(/usr/libexec/java_home)/lib/server/libjvm.dylib /usr/local/lib
You should now be able to enter RStudio and setup rJava.
After you have rJava
up and running on your machine, install rTASSEL
by
installing the source code from our BitBucket repository:
if (!require("devtools")) install.packages("devtools") devtools::install_bitbucket( repo = "bucklerlab/rtassel", ref = "master", build_vignettes = TRUE )
After source code has been compiled, the package can be loaded using:
library(rTASSEL)
Or, if you want to use a function without violating your environment you can
use rTASSEL::<function>
, where <function>
is an rTASSEL
function.
Before we begin analyzing data, optional parameters can be set up to make
rTASSEL
more efficient. To prevent your R console from being overloaded with
TASSEL logging information, it is highly recommended that you start a logging
file. This file will house all of TASSEL's logging output which is
beneficial for debugging and tracking the progress of your analytical workflow.
To start a logging file, use the following command:
rTASSEL::startLogger(fullPath = NULL, fileName = NULL)
If the rTASSEL::startLogger()
parameters are set to NULL
, the logging file
will be created in your current working directory. If you are unsure of what
your working directory is in R, use the base getwd()
command.
Since genome-wide association analyses can use up a lot of computational
resources, memory allocation to rTASSEL
can be modified. To change the amount
of memory, use the base options()
function and modify the following parameter:
options(java.parameters = c("-Xmx<memory>", "-Xms<memory>"))
Replace <memory>
with a specified unit of memory. For example, if I want to
allocate a maximum of 6 GB of memory for my operations, I would use the input
"-Xmx6g"
, where g
stands for gigabyte (GB). More information about memory
allocation can be found here.
Additionally, since this is a general walkthrough, certain intricaces of each
function may glossed over. If you would like to study a function in full,
refer to the R documentation by using ?<function>
in the console, where
<function>
is an rTASSEL
-based function.
Like TASSEL, rTASSEL
will read two main types of data:
This data can be read in several different ways. In the following examples,
we will demonstrate various ways genotype and phenotype information can be
loaded into rTASSEL
objects.
Currently, reading in genotype data to rTASSEL
is based off of file
locations as paths. Genotype/sequencing data can be stored in a variety of
formats. rTASSEL
can read and store a wide variety of file types:
To load this genotype data, simply store your file location as a string object
in R. For this example, we will load two toy data sets - one being a VCF file
and the other being a hapmap file. These data sets can be accessed via the
rTASSEL
package itself:
# Load hapmap data genoPathHMP <- system.file( "extdata", "mdp_genotype.hmp.txt", package = "rTASSEL" ) genoPathHMP # Load VCF data genoPathVCF <- system.file( "extdata", "maize_chr9_10thin40000.recode.vcf", package = "rTASSEL" ) genoPathVCF
Now that we have the file paths to this data, we can pass this to TASSEL and
create a formal TasselGenotypePhenotype
class object in R using the
following:
# Load in hapmap file tasGenoHMP <- rTASSEL::readGenotypeTableFromPath( path = genoPathHMP ) # Load in VCF file tasGenoVCF <- rTASSEL::readGenotypeTableFromPath( path = genoPathVCF )
When we call these objects, a summary of the data will be posted to the R console:
tasGenoHMP
This summary details the number of Taxa (Taxa
) and marker positions
(Positions
) within the data set. Additionally, since we can load both
genotype and phenotype information into this object, a helpful check will be
displayed to show what is populating the object ([x] or [ ]
).
In general, this S4 class data object houses "slot" information relating to TASSEL/Java pointers of the respective data.
class(tasGenoHMP) slotNames(tasGenoHMP)
Technically, this object does not contain the full information of the data
represented in R space, but merely contains addresses to the memory store of
the reference TASSEL object ID. For example, if we wanted to extract the
GenotypeTable
with the S4 @
operator, we would get something that looks
like this:
tasGenoHMP@jGenotypeTable
This entity is a rJava
internal identifier. It isn't until we call
downstream rTASSEL
functions where we will bring the TASSEL data into the R
environment.
Similar to reading in genotype data, phenotype data can also be read in via paths. If you already have preconstructed phenotype data in a file, this option will most likely work best for you. One caveat to this is how the data file is constructed in terms of columns and trait data for TASSEL analyses. More information about how these files can be found at this link under the Numerical Data section.
Loading this type of data is very similar to how genotype data is loaded.
here, we will use the rTASSEL::readPhenotypeFromPath()
function:
# Read from phenotype path phenoPath <- system.file("extdata", "mdp_traits.txt", package = "rTASSEL") phenoPath # Load into rTASSEL `TasselGenotypePhenotype` object tasPheno <- rTASSEL::readPhenotypeFromPath( path = phenoPath ) # Inspect object tasPheno
The object output is very similar to the genotype table output with some minor additions to which traits are displayed in the file.
In some cases you might want to first modify your phenotype data set in R and
then load it into the TASSEL environment. If you wish to choose this route, you
will need to use the rTASSEL::readPhenotypeFromDataFrame()
function along
with a couple of parameters. First, we will construct an R data frame and load
it with this function:
# Create phenotype data frame phenoDF <- read.table(phenoPath, header = TRUE) colnames(phenoDF)[1] <- "Taxon" # Inspect first few rows head(phenoDF) # Load into rTASSEL `TasselGenotypePhenotype` object tasPhenoDF <- rTASSEL::readPhenotypeFromDataFrame( phenotypeDF = phenoDF, taxaID = "Taxon", attributeTypes = NULL ) # Inspect new object tasPhenoDF
The phenotypeDF
parameter is for the R data frame object. The taxaID
parameter is needed to determine which column of your data frame is your
TASSEL taxa data. The final parameter (attributeTypes
) is optional. If this
parameter is set to NULL
, all remaining data frame columns will be classified
as TASSEL data
types. If this is not the case for your data (e.g. if you
have covariate or factor data in your experiment), you will need to
specify which columns are what TASSEL data type (i.e. data
, covariate
,
or factor
). This will have to be passed as an R vector of string elements
(e.g. c("data", "factor", "covariate")
). Currently, this data type needs to
be entered in the same order as they are found in the data frame.
In association studies, we are interested in combining our genotype and
phenotype data. To usually run this operation in TASSEL, an intersect
combination between the two data sets is needed. To run this in rTASSEL
, we
can use the rTASSEL::readGenotypePhenotype()
function. The parameter input
needed for this function is, of course, a genotype and phenotype object. For
genotype input, the following can be used:
TasselGenotypePhenotype
object For phenotype input, the following can be used:
TasselGenotypePhenotype
objectFor example, if we wanted to read the prior TasselGenotypePhenotype
genotype and phenotype objects from earlier:
tasGenoPheno <- rTASSEL::readGenotypePhenotype( genoPathOrObj = tasGenoHMP, phenoPathDFOrObj = tasPheno ) tasGenoPheno
We can also use a combination of the above parameter options (e.g. load a
genotype path and a phenotype data frame, etc.). One caveat though, if you
load in a phenotype data frame object with this function, the prior parameters
from the rTASSEL::readPhenotypeFromDataFrame
will be needed (i.e. the
taxaID
and attributeTypes
parameters):
tasGenoPhenoDF <- rTASSEL::readGenotypePhenotype( genoPathOrObj = genoPathHMP, phenoPathDFOrObj = phenoDF, taxaID = "Taxon", attributeTypes = NULL ) tasGenoPhenoDF
If you want to bring in genotype data into the R environment, you can use
the rTASSEL::getSumExpFromGenotypeTable()
function. All this function needs
is a TasselGenotypePhenotype
class object containing a genotype table:
tasSumExp <- rTASSEL::getSumExpFromGenotypeTable( tasObj = tasGenoPheno ) tasSumExp
As you can see above, the gentoype object is returned as a
SummarizedExperiment
class R object. More information about these objects can
be found here.
From this object, we extract taxa information and their respective TASSEL integer locations:
SummarizedExperiment::colData(tasSumExp)
We can also extract the allelic marker data using
SummarizedExperiment::rowData()
:
SummarizedExperiment::rowData(tasSumExp)
And get marker coordinates using SummarizedExperiment::rowRanges()
SummarizedExperiment::rowRanges(tasSumExp)
If you want to bring in phenotype data into the R environment, you can use
the rTASSEL::getPhenotypeDF()
function. All this function needs
is a TasselGenotypePhenotype
class object containing a phenotype table:
tasExportPhenoDF <- rTASSEL::getPhenotypeDF( tasObj = tasGenoPheno ) tasExportPhenoDF
As shown above, an R tibble
-based data frame is exported with converted
data types translated from TASSEL. See the following table what TASSEL data
types are tranlated into within the R environment:
| TASSEL Data | Converted R Data type | |:------------|:----------------------| | taxa | character | | data | numeric | | covariate | numeric | | factor | factor |
Prior to association analyses, filtration of genotype data may be necessary. In TASSEL, this accomplished through the Filter menu using two primary plugins:
In rTASSEL
, this can also be accomplished using the follwing functions:
rTASSEL::filterGenotypeTableSites()
rTASSEL::filterGenotypeTableTaxa()
These objects take a TasselGenotypePhenotype
class object. For example, in
our genotype data set, if we want to remove monomorphic and low coverage sites,
we could use the following parameters in rTASSEL::filterGenotypeTableSites()
:
tasGenoPhenoFilt <- rTASSEL::filterGenotypeTableSites( tasObj = tasGenoPheno, siteMinCount = 150, siteMinAlleleFreq = 0.05, siteMaxAlleleFreq = 1.0 ) tasGenoPhenoFilt
We can then compare this to our original pre-filtered data set:
tasGenoPheno
These functions can work on any TasselGenotypePhenotype
class object that
contains genotypic data, regardless of single or combined TASSEL objects.
In TASSEL, for mixed linear model analyses, a kinship matrix calculated from
genotype data is necessary. This can be accomplished by calculating a kinship
TASSEL object using the function rTASSEL::kinshipMatrix()
. The main
parameter input is a TasselGenotypePhenotype
class object that contains a
genotype data set:
tasKin <- rTASSEL::kinshipMatrix(tasObj = tasGenoPheno)
Since this object is essentially a TASSEL reference object, it is relatively not of use until association analysis. We can coerce this TASSEL object by using the following function:
# Get full R matrix tasKinRMat <- rTASSEL::kinshipToRMatrix(tasKin) # Inspect the first 5 rows and columns tasKinRMat[1:5, 1:5]
As you can see, we can take the initial TASSEL object and convert it into a
standard R object of matrix
class.
Very similar to kinship matrix calculation, a distance matrix can also be calculated using genotype data
tasDist <- rTASSEL::distanceMatrix(tasObj = tasGenoPheno)
And just like the kinship matrix, we can also coerce this into an R matrix
# Get full R matrix tasDistRMat <- rTASSEL::distanceToRMatrix(tasDist) # Inspect the first 5 rows and columns tasDistRMat[1:5, 1:5]
One of TASSEL's most powerful functionalities is its capability of performing a variety of different association modeling techniques. If you have started reading the walkthrough here it is strongly suggested that you read the other components of this walkthrough since the following parameters require what we have previously created!
If you are not familar with these methods, more information about how these operate in base TASSEL can be found at following links:
The rTASSEL::assocModelFitter()
function has several primary components:
tasObj
: a TasselGenotypePhenotype
class R objectformula
: an R-based linear model formulafitMarkers
: a boolean parameter to differentiate between BLUE and GLM
analyseskinship
: a TASSEL kinship objectfastAssociation
: a boolean parameter for data sets that have many traitsProbably the most important concept of this function is formula
parameter.
If you are familar with standard R linear model functions, this concept is
fairly similar. In TASSEL, a linear model is composed of the following scheme:
y ~ A
...where y
is any TASSEL data
type and A
is any TASSEL covariate
and / or factor
types:
<data> ~ <covariate> and/or <factor>
This model can be written out in several ways. With an example phenotype data, we can have the following variables that are represented in TASSEL in the following way:
Taxon
<taxa>
EarHT
<data>
dpoll
<data>
EarDia
<data>
location
<factor>
Q1
<covariate>
Q2
<covariate>
Q3
<covariate>
Using this data, we could write out the following formula in R
list(EarHT, dpoll, EarDia) ~ location + Q1 + Q2 + Q3
In the above example, we use a base list()
function to indicate analysis
on multiple numeric data types. For covariate and factor information, we use
+
operator. One problem with this implementation is that it can become
cumbersome and prone to error if we want to analyze the entirety of a large
data set or all data and/or factor and covariate types.
A work around for this problem is to utilize a special character to indicate
all elements within the model (.
). By using the .
operator we can simplify
the above model into the following:
. ~ .
This indicates we want to analyze the whole data set and leave nothing out. If we want to analyze all data types and only a handful of factor and/or covariates, we can use something like this:
. ~ location + Q1 + Q2
Or vice-versa:
list(EarHT, dpoll) ~ .
Take note we can be very specific with what we want to include in our trait
model! In the above example we have deliberately left out EarDia
from our
model.
Additionally, we can also fit marker and kinship data to our model which can
change our analytical methods. Since these options in TASSEL are binary,
additional parameters are passed for this function. In this case,
genotype/marker data is fitted using the fitMarker
parameter and kinship is
fitted using the kinship
parameter.
Fast Association implements methods described by Shabalin (2012). This method provides an ordinary least squares solution for fixed effect models. For this method to proper work it is necessary that your have:
NOTE: since we are working with "toy" data, empirical insight will not be elucidated upon in the following steps. This is simply to show the user how properly use these functions and the outputs that they give.
In the following examples, we will run example data and in return, obtain
TASSEL association table reports in the form of an R list
object containing
tibble
-based R data frames.
To caclulate best linear unbiased estimates (BLUEs), numeric phenotype data
can be used along with covariate and factor data only if it is intended to
control for field variation. Since genotype data is not needed for this
method, we can leave the fitMarkers
, kinship
, and fastAssociation
to
NULL
or FALSE
:
# Read in phenotype data phenoPathCov <- system.file("extdata", "mdp_phenotype.txt", package = "rTASSEL") tasPhenoCov <- rTASSEL::readPhenotypeFromPath(phenoPathCov) # Calculate BLUEs tasBLUE <- rTASSEL::assocModelFitter( tasObj = tasPhenoCov, formula = . ~ ., # <- All data is used! fitMarkers = FALSE, kinship = NULL, fastAssociation = FALSE ) # Return BLUE output tasBLUE
Similar to BLUEs, we can fit a generalized linear model (GLM) by simply
fitting marker data to our model. For this, we need a genotype data set
combined with our phenotype data in a TasselGenotypePhenotype
class object:
# Calculate GLM tasGLM <- rTASSEL::assocModelFitter( tasObj = tasGenoPheno, # <- our prior TASSEL object formula = list(EarHT, dpoll) ~ ., # <- only EarHT and dpoll are ran fitMarkers = TRUE, # <- set this to TRUE for GLM kinship = NULL, fastAssociation = FALSE ) # Return GLM output tasGLM
Adding to our complexity, we can fit a mixed linear model (MLM) by adding kinship to our analysis. In addition to the prior parameters, we will also need a TASSEL kinship object (see Create a kinship matrix object in the Analysis - Relatedness section):
# Calculate MLM tasMLM <- rTASSEL::assocModelFitter( tasObj = tasGenoPheno, # <- our prior TASSEL object formula = EarHT ~ ., # <- run only EarHT fitMarkers = TRUE, # <- set this to TRUE for GLM kinship = tasKin, # <- our prior kinship object fastAssociation = FALSE ) # Return GLM output tasMLM
Finally, we can run fast association analysis in our GLM model by setting
the fastAssociation
parameter to TRUE
. NOTE: this is only really
effective if you have many phenotype traits:
# Read data - need only non missing data! phenoPathFast <-system.file( "extdata", "mdp_traits_nomissing.txt", package = "rTASSEL" ) # Creat rTASSEL object - use prior TASSEL genotype object tasGenoPhenoFast <- rTASSEL::readGenotypePhenotype( genoPathOrObj = tasGenoHMP, phenoPathDFOrObj = phenoPathFast ) # Calculate MLM tasFAST <- rTASSEL::assocModelFitter( tasObj = tasGenoPhenoFast, # <- our prior TASSEL object formula = . ~ ., # <- run all of the phenotype data fitMarkers = TRUE, # <- set this to TRUE for GLM kinship = NULL, fastAssociation = TRUE # <- set this to TRUE for fast assoc. ) # Return GLM output tasFAST
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.