knitr::opts_chunk$set(
  comment = "##"
)

Installation

We recommend installing the stable release version of PhIPData in Bioconductor. This can be done using BiocManager:

if (!require("BiocManager"))
    install.packages("BiocManager")

BiocManager::install("PhIPData")

To load the package:

library(PhIPData)

Introduction

The PhIPData class is used to store experimental results from phage-immunoprecipitation sequencing (PhIP-set) experiments in a matrix-like container.

Building on the RangedSummarizedExperiment class, PhIPData contains all of functionality of SummarizedExperiments and includes additional operations to facilitate analysis with PhIP-seq data. Like SummarizedExperiments, a key feature of PhIPData is the coordination of metadata when subsetting PhIPData objects. For example, if you wanted to examine experimental data for peptides from one particular virus, you can subset the experimental data and the associated peptide annotation with one command. This ensures all metadata (for samples, peptides, etc.) remain synced with the experimental data throughout analysis.

Components of a PhIPData Object

As reflected in the figure below, the structure of a PhIPData object is nearly identical to the structure of a SummarizedExperiment/RangedSummarizedExperiment object.

Each object contains at least three assays of data. These assays are:

Though counts typically contain integer values for the number of reads aligned to each peptide, PhIPData only requires that stored values are non-negative numeric values. Pseudocounts or non-integer count values can also be stored in the counts assay.

The rows of a PhIPData object represent peptides of interest and the columns represent samples. Sample and peptide metadata are stored in DataFrames. Each row of the metadata DataFrame specifies the peptide/sample, and the columns represent different features associated with the peptides/samples.

In addition to sample- and peptide-specific metadata, experimental metadata such as associated papers, experimental parameters, sequencing dates, etc. are stored in a list-like component named metadata.

knitr::include_graphics("extras/PhIPData.png")

Creating a PhIPData object

library(dplyr)
library(readr)

To demonstrate the PhIPData class and functions, we will use a simulated example data set. Suppose we have PhIP-seq data from 5 individuals for 1288 peptides derived from known human viruses.

set.seed(20210120)

# Read in peptide metadata -------------
virscan_file <- system.file("extdata", "virscan.tsv", package = "PhIPData")
virscan_info <- readr::read_tsv(virscan_file,
                                col_types = readr::cols(
                                  pep_id = readr::col_character(),
                                  pro_id = readr::col_character(),
                                  pos_start = readr::col_double(),
                                  pos_end = readr::col_double(),
                                  UniProt_acc = readr::col_character(),
                                  pep_dna = readr::col_character(),
                                  pep_aa = readr::col_character(),
                                  pro_len = readr::col_double(),
                                  taxon_id = readr::col_double(),
                                  species = readr::col_character(),
                                  genus = readr::col_character(),
                                  product = readr::col_character()
                                )) %>%
  as.data.frame()

# Simulate data -------------
n_samples <- 5L
n_peps <- nrow(virscan_info)

counts_dat <- matrix(sample(1:1e6, n_samples*n_peps, replace = TRUE),
                 nrow = n_peps)
logfc_dat <- matrix(rnorm(n_samples*n_peps, mean = 0, sd = 10), 
                    nrow = n_peps)
prob_dat <- matrix(rbeta(n_samples*n_peps, shape1 = 1, shape2 = 1), 
                   nrow = n_peps)

# Sample metadata -------------
sample_meta <- data.frame(sample_name = paste0("sample", 1:n_samples),
                          gender = sample(c("M", "F"), n_samples, TRUE),
                          group = sample(c("ctrl", "trt", "beads"), n_samples, TRUE))

# Set row/column names -------------
rownames(counts_dat) <- rownames(logfc_dat) <-
  rownames(prob_dat) <- rownames(virscan_info) <-
  paste0("pep_", 1:n_peps)

colnames(counts_dat) <- colnames(logfc_dat) <-
  colnames(prob_dat) <- rownames(sample_meta) <-
  paste0("sample_", 1:n_samples)

# Experimental metadata -------------
exp_meta <- list(date_run = as.Date("2021/01/20"), 
                 reads_per_sample = colSums(counts_dat))

To create a PhIPData object, we will use the homonymous constructor PhIPData().

phip_obj <- PhIPData(counts_dat, logfc_dat, prob_dat, 
                     virscan_info, sample_meta, 
                     exp_meta)

phip_obj

The PhIPData() constructor is quite flexible; mismatched dimension names across assays and metadata are automatically corrected, and missing assays are initialized with empty matrices of the same dimensions. For more details on the constructor, type help(PhIPData).

Accessing and modifying components of PhIPData object

Assays

Assays store matrix-like data. For PhIPData objects, the assays counts, logfc, and prob are required. If any of these matrices were missing from the constructor, they are initialized with empty matrices of the same dimensions. Experimental data can be accessed via assays(phip_obj) or assay(phip_obj, i) command. assays(phip_obj) returns a list of all assays in the object, and list items can be accessed using the $ or [[ operators.

assays(phip_obj)

head(assays(phip_obj)$counts)   # Returns the same as assays(phip_obj)[["counts"]]

While assays(phip_obj) returns a list of all assays in the PhIPData object, assay(phip_obj, i) returns a matrix of the specified assay. If i is missing, assay(phip_obj) defaults to the first assay (counts). i can be a character specifying the assay name or a numeric index of the assay.

head(assay(phip_obj, "logfc"))   # Returns the same as assay(phip_obj, 2)

Since all PhIPData objects must contain the counts, logfc, and prob assays, we have defined three homonyous function to conveniently access and modify these assays.

head(counts(phip_obj))
head(logfc(phip_obj))
head(prob(phip_obj))

After a PhIPData object has been created, data for new and existing assays can be set using <-. Dimension names of the replacement assays are automatically corrected to be identical to the object names. As we expect assays to contain homogenous data, replacement assays are coerced into matrices. Replacement assays must also be on the same dimension of the existing object.

replacement_dat <- matrix(1, nrow = n_peps, ncol = n_samples)

# Replace the counts matrix -------------
head(counts(phip_obj))
counts(phip_obj) <- replacement_dat
head(counts(phip_obj))

# Add a new assay -----------
head(assay(phip_obj, "new_assay"))
assay(phip_obj, "new_assay") <- replacement_dat
head(assay(phip_obj, "new_assay"))

# Returns and error because `counts`, `logfc`, and `prob` must be in the 
# assays of a PhIPData object
assays(phip_obj) <- list(new_assay1 = replacement_dat, 
                         new_assay2 = replacement_dat)

Peptide metadata

Information associated with peptides can be accessed using peptideInfo(phip_obj) or the inherited rowRanges(phip_obj) function. Both functions return a GRanges object. GRanges objects behave similar to matrices and can be subsetted using the usual 2-dimensional methods. More information about GRanges objects can be found here.

Information about peptide positions in the protein sequence are stored as IRanges in the GRanges object. These are specified by pos_start and pos_end columns in the peptide metadata in the constructor. If these columns do not exist, the start and end positions are set to 0 by default.

# Only showing 2 columns for easier viewing
peptideInfo(phip_obj)[, 8:9]

Sample metadata

Sample metadata describing the samples can be extracted from the PhIPData object using sampleInfo(phip_obj) or the inherited function colData(phip_obj). Both functions return a DataFrame where each row corresponds to a sample and each column corresponds to some sample description.

sampleInfo(phip_obj)

Like in SummarizedExperiments/RangedSummarizedExperiments, sample metadata can be accessed using the $ operator from the object. As demonstrated in the Subsetting section, this makes subetting data for a subgroup of samples very easy.

phip_obj$group

Experimental metadata

Data associated with the experiment such as papers, date the samples were run, etc. can be accessed via the metadata(phip_obj) function. Experimental metadata is stored as a list, so it can be used to store any type of data.

metadata(phip_obj)

Common operations on PhIPData objects

Subsetting

Like subsetting matrices and dataframes, [ can be used for two-dimensional subsetting of PhIPData objects.

phip_obj[1:10, 1:2]

As described in the Sample metadata section, $ operates on the sample metadata column names, so we can also use $ to select samples of a particular subgroups.

phip_obj[, phip_obj$group == "beads"]

In addition to subsetting by row indices, PhIPData supports subsetting rows using peptide metadata information. subset(phip_obj, row_condition, column_condition) returns a PhIPData object with rows where the specified condition holds.

ebv_sub <- subset(phip_obj, grepl("Epstein-Barr virus", species))

peptideInfo(ebv_sub)[, "species"]

To subset all beads-only samples from a PhIPData object, we can use the convenient wrapper function subsetBeads():

subsetBeads(phip_obj)

PhIPData summaries

To assess the quality of the data, we are often interested in the number of reads per sample. This can be done using the librarySize() function. Names can be removed by setting the withDimnames parameter to FALSE.

librarySize(phip_obj)

The proportion of sample reads pulled by each peptide can also be obtained via propReads(). Like librarySize(), names can also be removed by setting withDimnames to FALSE.

head(propReads(phip_obj))

Using template libraries

Rather than re-importing peptide annotations, PhIPData allows the user to create and reuse existing libraries. To save a library for future use, we can use makeLibrary(object, name_of_library). The peptide metadata should be in a matrix-like form such as a DataFrame or data.frame.

# Save virscan_info as the human_virus library
makeLibrary(virscan_info, "human_virus")

The stored library can be accessed using getLibrary(library_name). We can then use the stored library to construct a new PhIPData object as follows.

PhIPData(counts = counts_dat, 
         sampleInfo = sample_meta,
         peptideInfo = getLibrary("human_virus"))

Libraries can be removed with the removeLibrary(library_name) command.

removeLibrary("human_virus")

Using aliases

Peptides are often derived from species with long virus names. To quickly search up all viruses with "[Hh]uman immunodeficiency virus" in the species, you can create an alias of "HIV" to encode for the corresponding regex of interest. Alias's can be managed with:

# Create alias for HIV ----------
setAlias("hiv", "[Hh]uman immunodeficiency virus")

# Use alias ----------
hiv_sub <- subset(phip_obj, grepl(getAlias("hiv"), species))
peptideInfo(hiv_sub)[, "species"]

# Remove alias from database -----------
deleteAlias("hiv")

# The following command returns an error that the virus does
# not exist in the alias database. 
subset(phip_obj, grepl(getAlias("hiv"), species))

Alias-pattern combinations are case senstive, so an entry with key "hiv" would differ from an entry with key "HIV." Like libraries, by default the location of the alias database is in the extdata folder of the PhIPData package. The location to an .rda file with the key-pattern values in a dataframe called alias can be retreived and specified using the getAliasPath() and setAliasPath() functions, respectively.

Coercion from PhIPData to other containers

As packages for identifying differential experession are also commonly used in analyzing PhIP-seq data, the PhIPData package supports coercion from PhIPData objects to Lists, lists, DataFrame, and DGELists. The function as(phip_obj, "object_type") converts a PhIPData object to a object of object type.

For DGELists, the group slot is automatically populated with the group column in the sample metadata, if such a column exists.

# PhIPData to DGEList
as(phip_obj, "DGEList")
as(phip_obj, "DataFrame")

While PhIPData objects can be roughly constructed from Lists, lists, and DGELists, the coercion functions only construct a bare bones PhIPData object (with only counts, logfc, and prob assays and sample and peptide information). Any additional list/object components may be discarded.

sessionInfo()

sessionInfo()


athchen/PhIPData documentation built on Feb. 10, 2022, 1:34 a.m.