Aim

The current duplicated() method for MTuples objects is a two-headed beast. When m $< 3$ it uses the duplicated() method for GRanges objects, GenomicRanges:::.duplicated.GenomicRanges, which is very fast because it uses custom written C code. However, when m $\geq 3$, it basically uses base::duplicated.array with MARGIN = 1, which is very slow when there are a large number of rows.

What I would like is a fast method that finds duplicate rows of an (integer) matrix, regardless of the value of m.

Some notes on software

Details on accessing the Bioconductor SVN server are available here.

Many of the methods currently defined in the IRanges package are currently being moved to the S4Vectors package. I link to the current versions in the S4Vectors package in this document.

Reproducible example

Example data

## Create some test data (a matrix)
# The matrix, x, has at least 3 columns, all of which contain integers.
# The first column is an integer-encoding of a chromosome and so there are approximately 20-30 unique values. 
# The second column is an integer-encoding of a genomic strand and so there are at most 3 unique values (representing positive, negative or unknown/irrelevant).
# The remaining columns are genomic positions, which are integers in the range of approximately 1-250,000,000.
# sim_data adds 'd' duplicates as the last 'd' rows
# n is the number of rows
# m + 2 is the number of columns. m = 1 is the minimum.
# d is the number of duplicates added to the end of the matrix
# sim_strand is whether the strand is simulated (column 2 of the matrix)
# NOTE: There might be additional duplicates in rows 1 to (n - d)
sim_data <- function(n, m, d, sim_strand = FALSE){
  if (d >= n){
    stop("Require d < n")
  }
  i <- sample(n - d, d)
  chromosome <- sample(x = 24L, size = n - d, replace = TRUE)
  chromosome <- c(chromosome, chromosome[i])
  if (sim_strand){
    strand <- sample(x = 3L, size = n - d, replace = TRUE)
  } else{
    strand <- rep(x = 3L, times = n - d)
  }
  strand <- c(strand, strand[i])
  pos <- matrix(sort(sample(x = 250000000L, size = (n - d) * m, replace = TRUE)), ncol = m)
  pos <- rbind(pos, pos[i, , drop = FALSE], deparse.level = 0)

  cbind(chromosome, strand, pos, deparse.level = 0)
}

n <- 2000000
x_1 <- sim_data(n = n, m = 1, d = 100)
x_2 <- sim_data(n = n, m = 2, d = 100)
x_3 <- sim_data(n = n, m = 3, d = 100)
x_8 <- sim_data(n = n, m = 8, d = 100)

library(GenomicRanges) # Available from BioConductor
matrix2GR <- function(x){
  if (ncol(x) == 3L){
    GRanges(seqnames = x[, 1], ranges = IRanges(start = x[, 3], width = 1), strand = ifelse(x[, 2] == 1L, '+', ifelse(x[, 2] == 2L, '-', '*')))
  } else if (ncol(x) == 4L){
    GRanges(seqnames = x[, 1], ranges = IRanges(start = x[, 3], width = x[, 4]), strand = ifelse(x[, 2] == 1L, '+', ifelse(x[, 2] == 2L, '-', '*')))
  } else{
    stop("Can't convert when m > 2")
  }
}

y_1 <- matrix2GR(x_1)
y_2 <- matrix2GR(x_2)

base::duplicated.array(x, MARGIN = 1)

This method gives the correct results and will be used as the gold standard.

# Get the gold standard calls and benchmark
system.time(z_1 <- duplicated(x_1, MARGIN = 1))
system.time(z_2 <- duplicated(x_2, MARGIN = 1))
system.time(z_3 <- duplicated(x_3, MARGIN = 1))
system.time(z_8 <- duplicated(x_8, MARGIN = 1))

GenomicRanges:::.duplicated.GenomicRanges(x) hack

This method is very fast and gives identical results to the gold standard. However, it only works when m = 1 or m = 2.

# Check identical to gold standard
identical(duplicated(y_1), z_1)
identical(duplicated(y_2), z_2)

# Benchmark
system.time(duplicated(y_1))
system.time(duplicated(y_2))

rowSumsHash(x)

This uses a simple hash of each row, namely the rowSums, to identify candidate duplicates. These candidates are then checked using the rigorous (but slow) base::duplicated.arraay method.

# rowSums-based hash method
rowSumsHash <- function(x){
  z <- rowSums(x)
  d <- duplicated(z, fromLast = FALSE) | duplicated(z, fromLast = TRUE) # Need both fromLast = FALSE and fromLast = TRUE to ensure all duplicates get flagged
  d[d] <- duplicated(x[d, ])
  return(d)
}

# Check identical to gold standard
identical(z_1, rowSumsHash(x_1))
identical(z_2, rowSumsHash(x_2))
identical(z_3, rowSumsHash(x_3))
identical(z_8, rowSumsHash(x_8))

# Benchmark
system.time(rowSumsHash(x_1))
system.time(rowSumsHash(x_2))
system.time(rowSumsHash(x_3))
system.time(rowSumsHash(x_8))

rowSumsHashCpp(x)

An alternative implementation of rowSumsHash that does part of the work using Rcpp. Currently, this method still requires a call to base::duplicated.array, although it might be possible to change this behaviour for an additional (small) speed-up.

rowSumsHashCpp <- function(x){
  d <- rowSumsHashInternalCpp(x)
  d[d] <- duplicated(x[d, ])
  return(d)
}

# Check identical to gold standard
identical(z_1, rowSumsHashCpp(x_1))
identical(z_2, rowSumsHashCpp(x_2))
identical(z_3, rowSumsHashCpp(x_3))
identical(z_8, rowSumsHashCpp(x_8))

# Benchmark
system.time(rowSumsHashCpp(x_1))
system.time(rowSumsHashCpp(x_2)) 
system.time(rowSumsHashCpp(x_3))
system.time(rowSumsHashCpp(x_8))

TODOs

sessionInfo()

sessionInfo()

Notes on GenomicRanges:::.duplicated.GenomicRanges

GenomicRanges:::.duplicated.GenomicRanges finds duplicate "quads", which can be thought of as rows of a 4-column matrix (although the matrix isn't explicitly formed, I think). GenomicRanges:::.duplicated.GenomicRanges calls the function IRanges:::duplicatedIntegerQuads, which in turn calls custom C routines, the default being Integer_selfmatch4_hash. I don't fully understand how this C routine works. But, from what I can tell, it hashes each "quad" to identify duplicates. Unfortunately, the C-level routine is hard-coded to work with "quads" and I can't see a simple way to generalise it.

There is a related function IRanges:::duplicatedIntegerPairs, and set of associated C-level routines, which suggests to me that the approach taken in the C code isn't readily generalisable to an arbitrary number of "columns". Otherwise, I would expect that both IRanges:::duplicatedIntegerPairs and IRanges:::duplicatedIntegerQuads would be special cases of the same routine.

I hoped that I might be able to make sense of this fast C-level routine and extend it to work with matrices of arbitrary dimensions. I'm not very good at C programming, but, from what I can tell, it hashes each "quad" to identify duplicates. Unfortunately, the C-level routine is hard-coded to work with "quads" and I can't see a simple way to generalise it.

Current status

Posted question to the R and C++ Google Group on 11/05/2014. Received very helpful reply from Gabor Csardi https://groups.google.com/forum/#!topic/r-and-cpp/M8ySxwAgoBE. I basically used his idea of computing rowSums as a hash and then looking more carefully and duplicate hashes.



PeteHaitch/cometh documentation built on May 8, 2019, 1:32 a.m.