knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)

Progenetix is an open data resource that provides curated individual cancer copy number variation (CNV) profiles along with associated metadata sourced from published oncogenomic studies and various data repositories. This vignette offers a comprehensive guide on accessing and visualizing CNV frequency data within the Progenetix database. CNV frequency is pre-calculated based on CNV segment data in Progenetix and reflects the CNV pattern in a cohort. It is defined as the percentage of samples showing a CNV for a genomic region (1MB-sized genomic bins in this case) over the total number of samples in a cohort specified by filters.

If your focus lies in cancer cell lines, you can access data from cancercelllines.org by setting the domain parameter to "https://cancercelllines.org" in pgxLoader function. This data repository originates from CNV profiling data of cell lines initially collected as part of Progenetix and currently includes additional types of genomic mutations.

Load library

library(pgxRpi)
library(SummarizedExperiment) # for pgxmatrix data
library(GenomicRanges) # for pgxfreq data

pgxLoader function

This function loads various data from Progenetix database via the Beacon v2 API with some extensions (BeaconPlus).

The parameters of this function used in this tutorial:

Retrieve CNV frequency

The first output format (output = "pgxfreq")

freq_pgxfreq <- pgxLoader(type="cnv_frequency", output ="pgxfreq",
                         filters=c("NCIT:C3058","NCIT:C3493"))

freq_pgxfreq

The returned data is stored in GRangesList container which consists of multiple GRanges objects. Each GRanges object stores CNV frequency from samples pecified by a particular filter. Within each GRanges object, there are annotation columns that provide information on CNV frequencies for each row, expressed as percentage values across samples (%). These percentages represent level-specific gains and losses that overlap the corresponding genomic intervals.

These genomic intervals are derived from the partitioning of the entire genome (GRCh38). Most of these bins have a size of 1MB, except for a few bins located near the telomeres. In total, there are 3106 intervals encompassing the genome.

To access the CNV frequency data from specific filters, you could access like this

freq_pgxfreq[["NCIT:C3058"]]

To get metadata such as count of samples used to calculate frequency, use mcols function from GenomicRanges package:

mcols(freq_pgxfreq)

The second output format (output = "pgxmatrix")

Choose 8 NCIt codes of interests that correspond to different tumor types

code <-c("C3059","C3716","C4917","C3512","C3493","C3771","C4017","C4001")
# add prefix for query
code <- sub(".",'NCIT:C',code)

load data with the specified codes

freq_pgxmatrix <- pgxLoader(type="cnv_frequency",output ="pgxmatrix",filters=code)
freq_pgxmatrix

The returned data is stored in RangedSummarizedExperiment object, which is a matrix-like container where rows represent ranges of interest (as a GRanges object) and columns represent filters.

To get metadata such as count of samples used to calculate frequency, use colData function from SummarizedExperiment package:

colData(freq_pgxmatrix)

To access the CNV frequency matrix, use assay accesssor from SummarizedExperiment package

head(assay(freq_pgxmatrix,"lowlevel_cnv_frequency"))

The matrix has 6212 rows (genomic regions) and 8 columns (filters). The rows comprised 3106 intervals with “gain status” plus 3106 intervals with “loss status”.

The value is the percentage of samples from the corresponding filter having one or more CNV events in the specific genomic intervals. For example, if the value in the second row and first column is 8.65, it means that 8.65% samples from the corresponding filter NCIT:C3059 having one or more duplication events in the genomic interval in chromosome 1: 400000-1400000.

You could get the interval information by rowRanges function from SummarizedExperiment package.

rowRanges(freq_pgxmatrix)

Note: it is different from CNV fraction matrix introduced in Introduction_2_loadvariants. Value in this matrix is percentage (%) of samples having one or more CNVs overlapped with the binned interval while the value in CNV fraction matrix is fraction in individual samples to indicate how much the binned interval overlaps with one or more CNVs in one sample.

Calculate CNV frequency

segtoFreq function

This function computes the binned CNV frequency from segment data.

The parameters of this function:

Suppose you have segment data from several biosamples:

# access variant data
variants <- pgxLoader(type="g_variants",biosample_id = c("pgxbs-kftvhmz9", "pgxbs-kftvhnqz","pgxbs-kftvhupd"),output="pgxseg")
# only keep segment cnv data
segdata <- variants[variants$variant_type %in% c("DUP","DEL"),]
head(segdata)

You can then calculate the CNV frequency for this cohort based on the specified CNV calls. The output will be stored in the "pgxfreq" format.

In the following example, the CNV frequency is calculated based on "DUP" and "DEL" calls:

segfreq1 <- segtoFreq(segdata,cnv_column_idx = 6, cohort_name="c1")
segfreq1

For level-specific CNV calls, set the cnv_column_idx parameter to the column containing CNV states represented as EFO codes.

segfreq2 <- segtoFreq(segdata,cnv_column_idx = 9,cohort_name="c1")
segfreq2

Visualize CNV frequency

pgxFreqplot function

This function provides CNV frequency plots by genome or chromosomes as you request.

The parameters of this function:

' If specified, the frequencies are plotted with one panel for each chromosome. Default is NULL.

CNV frequency plot by genome

Input is pgxfreq object

pgxFreqplot(freq_pgxfreq, filters="NCIT:C3058")

Input is pgxmatrix object

pgxFreqplot(freq_pgxmatrix, filters = "NCIT:C3493")

CNV frequency plot by chromosomes

pgxFreqplot(freq_pgxfreq, filters='NCIT:C3058',chrom=c(7,9), layout = c(2,1))  

CNV frequency circos plot

The circos plot supports multiple group comparison

pgxFreqplot(freq_pgxmatrix,filters= c("NCIT:C3493","NCIT:C3512"),circos = TRUE) 

Session Info

sessionInfo()


progenetix/pgxRpi documentation built on Nov. 4, 2024, 11:31 p.m.