We have created a resource having the VCF file of our study on asthma as previously described. The name of the resource is brge_vcf
the phenotypes are available in another resource called brge
that is a .txt file. The GWAS analysis is then perform as follows
We first start by preparing login data
builder <- newDSLoginBuilder() builder$append(server = "study1", url = "https://opal-demo.obiba.org", user = "dsuser", password = "password", resource = "RSRC.brge_vcf") logindata <- builder$build() conns <- datashield.login(logins = logindata, assign = TRUE, symbol = "res")
In this case we have to assign to different resources. One for the VCF (obesity_vcf) and another one for the phenotypic data (obesity). To this end, the datashield.assign.resource
function is required before assigning any object to the specific resource. Notice that the VCF resource can be load into R as a GDS thanks to our extension of existing resources in the r BiocStyle::CRANpkg("resourcer")
datashield.assign.resource(conns, symbol = "vcf.res", resource = list(study1 = "RSRC.brge_vcf")) datashield.assign.expr(conns, symbol = "gds", expr = quote(as.resource.object(vcf.res))) datashield.assign.resource(conns, symbol = "covars.res", resource = list(study1 = "RSRC.brge")) datashield.assign.expr(conns, symbol = "covars", expr = quote(as.resource.data.frame(covars.res)))
These are the objects available in the Opal server
ds.ls()
We can use r Githubpkg("datashield/dsBaseClient")
functions to inspect the variables that are in the covars
data.frame. The variables are
ds.colnames("covars")
The asthma
variable has this number of individuals at each level (0: controls, 1: cases)
ds.table("covars$asthma")
There may be interest in only studying certain genes, for that matter, the loaded VCF resource can be subsetting as follows
genes <- c("A1BG","A2MP1") ds.getSNPSbyGen("gds", genes)
The previous code will over-write the VCF with the SNPs corresponding to the selected genes, if the intention is to perform studies with both the complete VCF and a subsetted one, the argument name
can be used to create a new object on the server with the subsetted VCF, preserving the complete one.
genes <- c("A1BG","A2MP1") ds.getSNPSbyGen("gds", genes = genes, name = "subset.vcf")
Then, an object of class GenotypeData
must be created at the server side to perform genetic data analyses. This is a container defined in the r Biocpkg("GWASTools")
package for storing genotype and phenotypic data from genetic association studies. By doing that we will also verify whether individuals in the GDS (e.g VCF) and covariates files have the same individuals and are in the same order. This can be performed by
ds.GenotypeData(x='gds', covars = 'covars', columnId = 1, sexId = 2, male_encoding = "2", female_encoding = "1", newobj.name = 'gds.Data')
Before performing the association analyses, quality control (QC) can be performed to the loaded data. Three methodologies are available; 1) Principal Component Analysis (PCA) of the genomic data, 2) Hardy-Weinberg Equilibrium (HWE) testing and 3) Allelic frequency estimates. The QC methods 2 and 3 have as inputs a GenotypeData object, created with a covariates file that has a gender column; while method 1 has as input a VCF.
To perform the PCA, a pruning functionality is built inside so that redundant SNPs are discarted (there is an extra argument ld.threshold
which controls the pruning, more information about it at the SNPRelate documentation), speeding up the execution time
pca <- ds.PCASNPS("gds", prune = TRUE) plotPCASNPS(pca)
To perform QC methodologies 2 and 3, the name of the gender column as well as the keys to describe male or female have to be provided. Remember that we can visualize the names of the variables from our data by executing ds.colnames("covars")
. In our case, this variable is called "gender", and the levels of this variable are 1 for male and 2 for female as we can see here (NOTE: we cannot use ds.levels
since gender variable is not a factor):
ds.table1D("covars$gender")$counts
The HWE test can be performed to selected chromosomes using the argument chromosomes
, only the autosomes can be selected when performing a HWE test, the encoding of the autosomes present on the object can be fetched with
ds.genoDimensions('gds.Data')$chromosomes
Therefore, HWE can be performed by:
ds.exactHWE("gds.Data", chromosome = "20")
The same test can be conducted only for the control individuals (on this example taking the asthma covariates columnas as a cas/control) by:
ds.exactHWE("gds.Data", chromosome = "20", controls = TRUE, controls_column = "asthma")
Similarly, allele frequencies estimates can be estimated by:
ds.alleleFrequency("gds.Data")
In the future, more functions will be created to perform quality control (QC) for both, SNPs and inviduals.
Association analysis for a given SNP is performed by simply
ds.glmSNP(snps.fit = "rs11247693", model = asthma ~ gender + age, genoData='gds.Data')
The analysis of all available SNPs is performed when the argument snps.fit
is missing. The function performs the analysis of the selected SNPs in a single repository or in multiple repositories as performing pooled analyses (it uses ds.glm
DataSHIELD function). As in the case of transcriptomic data, analyzing all the SNPs in the genome (e.g GWAS) will be high time-consuming. We can adopt a similar approach as the one adopted using the r Biocpkg("limma")
at each server. That is, we run GWAS at each repository using specific and scalable packages available in R/Bioc. In that case we use the r Biocpkg("GWASTools")
and r Biocpkg("GENESIS")
packages. The complete pipeline is implemented in this function
ans.bioC <- ds.GWAS('gds.Data', model=asthma~age+country)
This close the DataSHIELD session
datashield.logout(conns)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.