In progenetix/pgxRpi: R wrapper for Progenetix

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)

Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette provides a comprehensive guide on accessing genomic variant data within the Progenetix database.

If your focus lies in cancer cell lines, you can access data from cancercelllines.org by setting the domain parameter to "https://cancercelllines.org" in pgxLoader function. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

Load library

library(pgxRpi)
library(SummarizedExperiment) # for pgxmatrix data

`pgxLoader` function

This function loads various data from Progenetix database via the Beacon v2 API with some extensions (BeaconPlus).

The parameters of this function used in this tutorial:

type: A string specifying output data type. "g_variants" and "cnv_fraction" are used in this tutorial.
output: A string specifying output file format. The available options depend on the type parameter. When type is "g_variants", available options are NULL (default), "pgxseg", or "seg"; when type is "cnv_fraction", available options are NULL (default) or "pgxmatrix".
biosample_id: Identifiers used in the query database for identifying biosamples.
individual_id: Identifiers used in the query database for identifying individuals.
filters: Identifiers used in public repositories, bio-ontology terms, or custom terms such as c("NCIT:C7376", "PMID:22824167"). For more information about filters, see the documentation.
codematches: A logical value determining whether to exclude samples from child concepts of specified filters in the ontology tree. If TRUE, only samples exactly matching the specified filters will be included. Do not use this parameter when filters include ontology-irrelevant filters such as PMID and cohort identifiers. Default is FALSE.
limit: Integer to specify the number of returned profiles. Default is 0 (return all).
skip: Integer to specify the number of skipped profiles. E.g. if skip = 2, limit=500, the first 2*500 =1000 profiles are skipped and the next 500 profiles are returned. Default is NULL (no skip).
dataset: A string specifying the dataset to query from the Beacon response. Default is NULL, which includes results from all datasets.
save_file: A logical value determining whether to save variant data as a local file instead of direct return. Only used when the parameter type is "g_variants". Default is FALSE.
filename: A string specifying the path and name of the file to be saved. Only used if the parameter save_file is TRUE. Default is "variants" in current work directory.
num_cores: Integer to specify the number of cores used for the variant query. Only used when the parameter type is "g_variants". Default is 1.
domain: A string specifying the domain of the query data resource. Default is "http://progenetix.org".
entry_point: A string specifying the entry point of the Beacon v2 API. Default is "beacon", resulting in the endpoint being "http://progenetix.org/beacon".

Retrieve segment variants

Because of a time-out issue, segment variant data can only be accessed by biosample id instead of filters. To speed up this process, you can set the num_cores parameter for parallel processing. For more information about filters and how to get biosample ids, see the vignette Introduction_1_loadmetadata.

Get biosample id

# get 2 samples for demonstration
biosamples <- pgxLoader(type="biosamples", filters = "PMID:20229506", limit=2)

biosample_id <- biosamples$biosample_id

biosample_id

There are three output formats.

The first output format (by default)

The default output format extracts variant data from the Beacon v2 response, containing variant id and associated analysis id, biosample id and individual id. The CNV data is represented as copy number change class following the GA4GH Variation Representation Specification (VRS).

variant_1 <- pgxLoader(type="g_variants", biosample_id = biosample_id)
head(variant_1)

The second output format (`output` = "pgxseg")

This format accesses data from Progenetix API services. The '.pgxseg' file format contains segment mean values (in log2 column), which are equal to log2(copy number of measured sample/copy number of control sample (usually 2)). A few variants are point mutations represented by columns reference_bases and alternate_bases.

variant_2 <- pgxLoader(type="g_variants", biosample_id = biosample_id,output = "pgxseg")
head(variant_2)

The third output format (`output` = "seg")

This format accesses data from Progenetix API services. This format is similar to the general '.seg' file format and compatible with IGV tool for visualization. The only difference between this file format and the general '.seg' file format is the fifth column. It represents variant type in this format while in the general '.seg' file format, it represents number of probes or bins covered by the segment.

variant_3 <- pgxLoader(type="g_variants", biosample_id = biosample_id,output = "seg")
head(variant_3)

Export variants data for visualization

Setting save_file to TRUE in the pgxLoader function will save the retrieved variants data to a file rather than returning it directly. By default, the data will be saved in the current working directory, but you can specify a different path using the filename parameter. This export functionality is only available for variants data (when type='g_variants').

Upload 'pgxseg' file to Progenetix website

The following command creates a '.pgxseg' file with the name "variants.pgxseg" in "~/Downloads/" folder.

pgxLoader(type="g_variants", output="pgxseg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.pgxseg")

To visualize the '.pgxseg' file, you can either upload it to this link or use the byconaut package for local visualization when dealing with a large number of samples.

Upload '.seg' file to IGV

The following command creates a special '.seg' file with the name "variants.seg" in "~/Downloads/" folder.

pgxLoader(type="g_variants", output="seg", biosample_id=biosample_id, save_file=TRUE, 
          filename="~/Downloads/variants.seg")

You can upload this '.seg' file to IGV tool for visualization.

Retrive CNV fraction of biosamples

Because segment variants are not harmonized across samples, Progenetix provides processed CNV features, known as CNV fractions. These fractions represent the proportion of genomic regions overlapping one or more CNVs of a given type, facilitating sample-wise comparisons. The following query is based on filters, but biosample id and individual id are also available for sample-specific CNV fraction queries. For more information about filters, biosample id and individual id, as well as the use of parameters skip, limit, and codematches, see the vignette Introduction_1_loadmetadata.

Across chromosomes or the whole genome

cnv_fraction <- pgxLoader(type="cnv_fraction", filters = "NCIT:C2948")

This includes CNV fraction across chromosome arms, whole chromosomes, or the whole genome.

names(cnv_fraction)

The CNV fraction across chromosomal arms looks like this

head(cnv_fraction$arm_cnv_frac)[,c(1:4, 49:52)]

The row names are analyses ids from samples that belong to the input filter NCIT:C2948. There are 96 columns. The first 48 columns are duplication fraction across chromosomal arms, followed by deletion fraction. CNV fraction across whole chromosomes is similar, with the only difference in columns.

The CNV fraction across the genome (hg38) looks like this

head(cnv_fraction$genome_cnv_frac)

The first column is the total called fraction, followed by duplication fraction and deletion fraction.

Across genomic bins

cnvfrac_matrix <- pgxLoader(type="cnv_fraction", output="pgxmatrix", filters = "NCIT:C2948")

The returned data is stored in RangedSummarizedExperiment object, which is a matrix-like container where rows represent ranges of interest (as a GRanges object) and columns represent analyses derived from biosamples. The data looks like this

cnvfrac_matrix

You could get the interval information by rowRanges function from SummarizedExperiment package.

rowRanges(cnvfrac_matrix)

To access the CNV fraction matrix, use assay accesssor from SummarizedExperiment package

assay(cnvfrac_matrix)[1:3,1:3]

The matrix has 6212 rows (genomic regions) and 47 columns (analysis profiles derived from samples belonging to the input filter). The rows comprised 3106 intervals with “gain status” plus 3106 intervals with “loss status”.

The value is the fraction of the binned interval overlapping with one or more CNVs of the given type (DUP/DEL). For example, if the value in the second row, the first column is 0.2, it means that one or more duplication events overlapped with 20% of the genomic bin located in chromosome 1: 400000-1400000 in the first analysis profile.

To get associated biosample id and filters for analyses, use colData function from SummarizedExperiment package:

colData(cnvfrac_matrix)

analysis_id is the identifier for individual analysis, biosample_id is the identifier for individual biosample. It is noted that the number of analysis profiles does not necessarily equal the number of samples. One biosample id may correspond to multiple analysis ids. group_id corresponds to the meaning of filters.

Session Info

sessionInfo()

progenetix/pgxRpi documentation built on Nov. 4, 2024, 11:31 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

progenetix/pgxRpi
R wrapper for Progenetix

In progenetix/pgxRpi: R wrapper for Progenetix

Load library

`pgxLoader` function

Retrieve segment variants

Get biosample id

The first output format (by default)

The second output format (`output` = "pgxseg")

The third output format (`output` = "seg")

Export variants data for visualization

Upload 'pgxseg' file to Progenetix website

Upload '.seg' file to IGV

Retrive CNV fraction of biosamples

Across chromosomes or the whole genome

Across genomic bins

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

progenetix/pgxRpi R wrapper for Progenetix

In progenetix/pgxRpi: R wrapper for Progenetix

Load library

pgxLoader function

Retrieve segment variants

Get biosample id

The first output format (by default)

The second output format (output = "pgxseg")

The third output format (output = "seg")

Export variants data for visualization

Upload 'pgxseg' file to Progenetix website

Upload '.seg' file to IGV

Retrive CNV fraction of biosamples

Across chromosomes or the whole genome

Across genomic bins

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

progenetix/pgxRpi
R wrapper for Progenetix

`pgxLoader` function

The second output format (`output` = "pgxseg")

The third output format (`output` = "seg")