The current duplicated()
method for MTuples
objects is a two-headed beast. When m
$< 3$ it uses the duplicated()
method for GRanges
objects, GenomicRanges:::.duplicated.GenomicRanges
, which is very fast because it uses custom written C code. However, when m
$\geq 3$, it basically uses base::duplicated.array
with MARGIN = 1
, which is very slow when there are a large number of rows.
What I would like is a fast method that finds duplicate rows of an (integer) matrix, regardless of the value of m
.
Details on accessing the Bioconductor SVN server are available here.
Many of the methods currently defined in the IRanges
package are currently being moved to the S4Vectors
package. I link to the current versions in the S4Vectors
package in this document.
## Create some test data (a matrix) # The matrix, x, has at least 3 columns, all of which contain integers. # The first column is an integer-encoding of a chromosome and so there are approximately 20-30 unique values. # The second column is an integer-encoding of a genomic strand and so there are at most 3 unique values (representing positive, negative or unknown/irrelevant). # The remaining columns are genomic positions, which are integers in the range of approximately 1-250,000,000. # sim_data adds 'd' duplicates as the last 'd' rows # n is the number of rows # m + 2 is the number of columns. m = 1 is the minimum. # d is the number of duplicates added to the end of the matrix # sim_strand is whether the strand is simulated (column 2 of the matrix) # NOTE: There might be additional duplicates in rows 1 to (n - d) sim_data <- function(n, m, d, sim_strand = FALSE){ if (d >= n){ stop("Require d < n") } i <- sample(n - d, d) chromosome <- sample(x = 24L, size = n - d, replace = TRUE) chromosome <- c(chromosome, chromosome[i]) if (sim_strand){ strand <- sample(x = 3L, size = n - d, replace = TRUE) } else{ strand <- rep(x = 3L, times = n - d) } strand <- c(strand, strand[i]) pos <- matrix(sort(sample(x = 250000000L, size = (n - d) * m, replace = TRUE)), ncol = m) pos <- rbind(pos, pos[i, , drop = FALSE], deparse.level = 0) cbind(chromosome, strand, pos, deparse.level = 0) } n <- 2000000 x_1 <- sim_data(n = n, m = 1, d = 100) x_2 <- sim_data(n = n, m = 2, d = 100) x_3 <- sim_data(n = n, m = 3, d = 100) x_8 <- sim_data(n = n, m = 8, d = 100) library(GenomicRanges) # Available from BioConductor matrix2GR <- function(x){ if (ncol(x) == 3L){ GRanges(seqnames = x[, 1], ranges = IRanges(start = x[, 3], width = 1), strand = ifelse(x[, 2] == 1L, '+', ifelse(x[, 2] == 2L, '-', '*'))) } else if (ncol(x) == 4L){ GRanges(seqnames = x[, 1], ranges = IRanges(start = x[, 3], width = x[, 4]), strand = ifelse(x[, 2] == 1L, '+', ifelse(x[, 2] == 2L, '-', '*'))) } else{ stop("Can't convert when m > 2") } } y_1 <- matrix2GR(x_1) y_2 <- matrix2GR(x_2)
base::duplicated.array(x, MARGIN = 1)
This method gives the correct results and will be used as the gold standard.
# Get the gold standard calls and benchmark system.time(z_1 <- duplicated(x_1, MARGIN = 1)) system.time(z_2 <- duplicated(x_2, MARGIN = 1)) system.time(z_3 <- duplicated(x_3, MARGIN = 1)) system.time(z_8 <- duplicated(x_8, MARGIN = 1))
GenomicRanges:::.duplicated.GenomicRanges(x)
hackThis method is very fast and gives identical results to the gold standard. However, it only works when m
= 1 or m = 2
.
# Check identical to gold standard identical(duplicated(y_1), z_1) identical(duplicated(y_2), z_2) # Benchmark system.time(duplicated(y_1)) system.time(duplicated(y_2))
rowSumsHash(x)
This uses a simple hash of each row, namely the rowSums
, to identify candidate duplicates. These candidates are then checked using the rigorous (but slow) base::duplicated.arraay
method.
# rowSums-based hash method rowSumsHash <- function(x){ z <- rowSums(x) d <- duplicated(z, fromLast = FALSE) | duplicated(z, fromLast = TRUE) # Need both fromLast = FALSE and fromLast = TRUE to ensure all duplicates get flagged d[d] <- duplicated(x[d, ]) return(d) } # Check identical to gold standard identical(z_1, rowSumsHash(x_1)) identical(z_2, rowSumsHash(x_2)) identical(z_3, rowSumsHash(x_3)) identical(z_8, rowSumsHash(x_8)) # Benchmark system.time(rowSumsHash(x_1)) system.time(rowSumsHash(x_2)) system.time(rowSumsHash(x_3)) system.time(rowSumsHash(x_8))
rowSumsHashCpp(x)
An alternative implementation of rowSumsHash
that does part of the work using Rcpp. Currently, this method still requires a call to base::duplicated.array
, although it might be possible to change this behaviour for an additional (small) speed-up.
rowSumsHashCpp <- function(x){ d <- rowSumsHashInternalCpp(x) d[d] <- duplicated(x[d, ]) return(d) } # Check identical to gold standard identical(z_1, rowSumsHashCpp(x_1)) identical(z_2, rowSumsHashCpp(x_2)) identical(z_3, rowSumsHashCpp(x_3)) identical(z_8, rowSumsHashCpp(x_8)) # Benchmark system.time(rowSumsHashCpp(x_1)) system.time(rowSumsHashCpp(x_2)) system.time(rowSumsHashCpp(x_3)) system.time(rowSumsHashCpp(x_8))
seqnames
, strand
, pos
(rather than x
) to function that computes duplicatessessionInfo()
sessionInfo()
GenomicRanges:::.duplicated.GenomicRanges
GenomicRanges:::.duplicated.GenomicRanges
finds duplicate "quads", which can be thought of as rows of a 4-column matrix (although the matrix isn't explicitly formed, I think). GenomicRanges:::.duplicated.GenomicRanges
calls the function IRanges:::duplicatedIntegerQuads
, which in turn calls custom C routines, the default being Integer_selfmatch4_hash
. I don't fully understand how this C routine works. But, from what I can tell, it hashes each "quad" to identify duplicates. Unfortunately, the C-level routine is hard-coded to work with "quads" and I can't see a simple way to generalise it.
There is a related function IRanges:::duplicatedIntegerPairs
, and set of associated C-level routines, which suggests to me that the approach taken in the C code isn't readily generalisable to an arbitrary number of "columns". Otherwise, I would expect that both IRanges:::duplicatedIntegerPairs
and IRanges:::duplicatedIntegerQuads
would be special cases of the same routine.
I hoped that I might be able to make sense of this fast C-level routine and extend it to work with matrices of arbitrary dimensions. I'm not very good at C programming, but, from what I can tell, it hashes each "quad" to identify duplicates. Unfortunately, the C-level routine is hard-coded to work with "quads" and I can't see a simple way to generalise it.
Posted question to the R and C++ Google Group on 11/05/2014. Received very helpful reply from Gabor Csardi https://groups.google.com/forum/#!topic/r-and-cpp/M8ySxwAgoBE. I basically used his idea of computing rowSums as a hash and then looking more carefully and duplicate hashes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.