map_data: Convert single-cell data

View source: R/map_data.R

map_dataR Documentation

Convert single-cell data

Description

Convert a single-cell data object across-species (gene orthologs) or within-species (gene synonyms).

Usage

map_data(
  obj,
  gene_map = NULL,
  input_col = "input_gene",
  output_col = "ortholog_gene",
  standardise_genes = FALSE,
  input_species = NULL,
  output_species = input_species,
  method = c("homologene", "gprofiler", "babelgene"),
  drop_nonorths = TRUE,
  non121_strategy = "drop_both_species",
  agg_fun = "sum",
  mthreshold = Inf,
  as_sparse = TRUE,
  as_delayedarray = FALSE,
  sort_rows = FALSE,
  test_species = NULL,
  chunk_size = NULL,
  verbose = TRUE,
  ...
)

Arguments

obj

A single-cell data object belonging to one of the following classes:

  • SummarizedExperiment

  • SingleCellExperiment

  • Seurat

  • AnnData

  • Matrix or data.frame or data.table

gene_map

A data.frame that maps the current gene names to new gene names. This function's behaviour will adapt to different situations as follows:

  • gene_map=<data.frame> :
    When a data.frame containing the gene key:value columns (specified by input_col and output_col, respectively) is provided, this will be used to perform aggregation/expansion.

  • gene_map=NULL and input_species!=output_species :
    A gene_map is automatically generated by map_orthologs to perform inter-species gene aggregation/expansion.

  • gene_map=NULL and input_species==output_species :
    A gene_map is automatically generated by map_genes to perform within-species gene gene symbol standardization and aggregation/expansion.

input_col

Column name within gene_map with gene names matching the row names of X.

output_col

Column name within gene_map with gene names that you wish you map the row names of X onto.

standardise_genes

If TRUE AND gene_output="columns", a new column "input_gene_standard" will be added to gene_df containing standardised HGNC symbols identified by gorth.

input_species

Name of the input species (e.g., "mouse","fly"). Use map_species to return a full list of available species.

output_species

Name of the output species (e.g. "human","chicken"). Use map_species to return a full list of available species.

method

R package to use for gene mapping:

  • "gprofiler" : Slower but more species and genes.

  • "homologene" : Faster but fewer species and genes.

  • "babelgene" : Faster but fewer species and genes. Also gives consensus scores for each gene mapping based on a several different data sources.

drop_nonorths

Drop genes that don't have an ortholog in the output_species.

non121_strategy

How to handle genes that don't have 1:1 mappings between input_species:output_species. Options include:

  • "drop_both_species" or "dbs" or 1 :
    Drop genes that have duplicate mappings in either the input_species or output_species
    (DEFAULT).

  • "drop_input_species" or "dis" or 2 :
    Only drop genes that have duplicate mappings in the input_species.

  • "drop_output_species" or "dos" or 3 :
    Only drop genes that have duplicate mappings in the output_species.

  • "keep_both_species" or "kbs" or 4 :
    Keep all genes regardless of whether they have duplicate mappings in either species.

  • "keep_popular" or "kp" or 5 :
    Return only the most "popular" interspecies ortholog mappings. This procedure tends to yield a greater number of returned genes but at the cost of many of them not being true biological 1:1 orthologs.

  • "sum","mean","median","min" or "max" :
    When gene_df is a matrix and gene_output="rownames", these options will aggregate many-to-one gene mappings (input_species-to-output_species) after dropping any duplicate genes in the output_species.

agg_fun

Aggregation function passed to aggregate_mapped_genes. Set to NULL to skip aggregation step (default).

mthreshold

Maximum number of ortholog names per gene to show. Passed to gorth. Only used when method="gprofiler" (DEFAULT : Inf).

as_sparse

Convert gene_df to a sparse matrix. Only works if gene_df is one of the following classes:

  • matrix

  • Matrix

  • data.frame

  • data.table

  • tibble

If gene_df is a sparse matrix to begin with, it will be returned as a sparse matrix (so long as gene_output= "rownames" or "colnames").

as_delayedarray

Convert aggregated matrix to DelayedArray.

sort_rows

Sort gene_df rows alphanumerically.

test_species

Which species to test for matches with. If set to NULL, will default to a list of humans and 5 common model organisms. If test_species is set to one of the following options, it will automatically pull all species from that respective package and test against each of them:

  • "homologene" : 20+ species (default)

  • "gprofiler" : 700+ species

  • "babelgene" : 19 species

chunk_size

An integer indicating number of cells to include per chunk. This can be a more memory efficient and scalable way of aggregating on-disk data formats like AnnData, rather than reading in the entire matrix into memory at once (default: NULL).

verbose

Print messages.

...

Additional arguments to be passed to gorth or homologene.

NOTE: To return only the most "popular" interspecies ortholog mappings, supply mthreshold=1 here AND set method="gprofiler" above. This procedure tends to yield a greater number of returned genes but at the cost of many of them not being true biological 1:1 orthologs.

For more details, please see here.

Value

An aggregated/expanded version of the input single-cell data object.

Examples

obj <- example_obj("ad")
obj2 <- map_data(obj = obj,
                 input_species = "human",
                 output_species = "mouse")

bschilder/scKirby documentation built on Oct. 2, 2024, 10:16 p.m.