#' @title Grouping of sorted values into sets with smallest differences
#'
#' @description
#'
#' `groupConsecutive` groups **sorted** values in `x` for which the difference
#' is smaller than `maxDiff`. As a result, the mean difference between the
#' groups will always be larger than `maxDiff`, but difference between
#' individual values within the same group (e.g. between the first and last)
#' can be larger `maxDiff`.
#'
#' In detail, from the sorted `x`, the function starts from the smallest value
#' defining the first group as the one containing all values in `x` with a
#' difference to this first value which is `<= maxDiff`.
#' The next group is the defined based on the next larger value that is not part
#' of the first group and includes all values with a difference `<= maxDiff` to
#' this value. For values fulfilling this criteria but being already part of
#' a previous group, the differences to the mean value of the current group
#' and to the mean of previous groups are compared and values are assigned to
#' the group to which they have the smallest difference.
#'
#' Example: values `1.1, 1.9, 2.2` should be grouped with a `maxDiff = 1`. The
#' first group is defined to include all values for which the difference to the
#' first value (`1.1`) is smaller `maxDiff`. Thus, the first group is defined
#' to contain values `1.1 and 1.9`. Then the next group is defined based on the
#' next larger value not part of any group, `2.2`. This group contains values
#' `1.9` and `2.2` with the value `1.9` being already assigned to the first
#' group. The difference between this value `1.9` and the mean of the
#' current group (`mean(c(1.9, 2.2)`) is then compared to the difference of
#' `1.9` to the mean value of the group `1.9` is already part of
#' (which is `mean(c(1.1, 1.9))`). Since the difference to the second group is
#' smaller, `1.9` is removed from the first group and assigned to the second
#' one.
#'
#' @note
#'
#' The difference between consecutive (ordered) values within a defined group
#' is always `<= maxDiff`, but the difference between e.g. the first and the
#' last of the (ordered) values can be larger than `maxDiff`. See
#' [groupClosest()] for a more stringent grouping function.
#'
#' @param x `numeric` of values that should be grouped.
#'
#' @param maxDiff `numeric(1)` defining the threshold for difference between
#' values in `x` to be grouped into the same group.
#'
#' @return `integer` with the group assignment (values grouped together have
#' the same return value).
#'
#' @author Johannes Rainer
#'
#' @family grouping operations
#'
#' @export
#'
#' @examples
#'
#' ## The example described above
#' x <- c(1.1, 1.9, 2.2)
#' groupConsecutive(x)
#'
#' x <- c(1.1, 1.5, 1.7, 2.3, 2.7, 4.3, 4.4, 4.9, 5.2, 5.4, 5.8, 6, 7,
#' 9, 9.5, 15)
#'
#' groupConsecutive(x)
#' ## value 5.2 was initially grouped with 4.3 (because their difference is
#' ## smaller 1, but then re-grouped together with 5.4 because the difference
#' ## between 5.4 (the next value outside the group of 4.3) and 5.2 is smaller
#' ## than its difference to the mean value of the group for value 4.3
#'
#' ## Example for a case in which values are NOT grouped into the same group
#' ## even if the difference between them is <= maxDiff
#' a <- c(4.9, 5.2, 5.4)
#' groupConsecutive(a, maxDiff = 0.3)
groupConsecutive <- function(x, maxDiff = 1) {
if (is.unsorted(x)) {
idx <- order(x)
x <- x[idx]
} else idx <- integer()
x_len <- length(x)
x_groups <- rep(NA_integer_, x_len)
i <- 1
group_id <- 1
while(any(is.na(x_groups))) {
grp <- which(abs(x - x[i]) <= maxDiff)
## Check if they are already part of a previous group
not_in_prev_grp <- is.na(x_groups[grp])
in_prev_grp <- grp[!not_in_prev_grp]
## grp <- grp[not_in_prev_grp]
if (length(in_prev_grp)) {
## compare difference to current x[i] to mean of previous group(s)
## they are part of and assign them to the group with the closest
## difference
## i_diff <- abs(x[in_prev_grp] - x[i])
## Compare to the average of the current group.
i_diff <- abs(x[in_prev_grp] - mean(x[grp]))
prev_grp <- x_groups[in_prev_grp]
to_rem <- rep(FALSE, length(in_prev_grp))
for (j in unique(prev_grp)) {
j_diff <- abs(x[in_prev_grp] - mean(x[which(x_groups == j)]))
to_rem <- to_rem | j_diff < i_diff
}
grp <- c(in_prev_grp[!to_rem], grp[not_in_prev_grp])
}
x_groups[grp] <- group_id
group_id <- group_id + 1
i <- which.max(is.na(x_groups))
}
x_groups[idx] <- x_groups
x_groups
}
#' @title Group rows of a diagonal matrix using a threshold
#'
#' @description
#'
#' This function groups elements (rows or columns) of a diagonal matrix, such as
#' a pairwise correlation matrix or similarity matrix, with a value `>=
#' threshold`. This creates clusters of elements in which **all** elements have
#' a value `>= threshold` with **any** other element in that cluster. On a
#' correlation matrix (such as created with `cor`) it will generate small
#' clusters of highly correlated elements. Note however that single elements in
#' one cluster could also have a correlation `>= threshold` to another element
#' in another cluster. The average similarity to its own cluster will however
#' be higher to that of the other.
#'
#' @details
#'
#' The algorithm is defined as follows:
#' - all pairs of values in `x` which are `>= threshold` are identified and
#' sorted decreasingly.
#' - starting with the pair with the highest correlation, groups are defined:
#' - if none of the two is in a group, both are put into the same new group.
#' - if one of the two is already in a group, the other is put into the same
#' group if **all** correlations of it to that group are `>= threshold`
#' (and are not `NA`).
#' - if both are already in the same group nothing is done.
#' - if both are in different groups: an element is put into the group of the
#' other if a) all correlations of it to members of the other's group
#' are not `NA` and `>= threshold` **and** b) the average correlation to the
#' other group is larger than the average correlation to its own group.
#'
#' This ensures that groups are defined in which all elements have a correlation
#' `>= threshold` with each other and the correlation between members of the
#' same group is maximized.
#'
#' @param x symmetrix `numeric` `matrix`.
#'
#' @param threshold `numeric(1)` above which rows in `x` should be grouped.
#'
#' @param full `logical(1)` whether the full matrix should be considered, or
#' just the upper triangular matrix (including the diagonal).
#'
#' @param ... ignored.
#'
#' @return `integer` same length than `nrow(x)`, grouped elements (rows) defined
#' by the same value.
#'
#' @author Johannes Rainer
#'
#' @family grouping operations
#'
#' @export groupSimilarityMatrix
#'
#' @examples
#'
#' x <- rbind(
#' c(1, 0.9, 0.6, 0.8, 0.5),
#' c(0.9, 1, 0.7, 0.92, 0.8),
#' c(0.6, 0.7, 1, 0.91, 0.7),
#' c(0.8, 0.92, 0.91, 1, 0.9),
#' c(0.5, 0.8, 0.7, 0.9, 1)
#' )
#'
#' groupSimilarityMatrix(x, threshold = 0.9)
#'
#' groupSimilarityMatrix(x, threshold = 0.1)
#'
#' ## Add also a correlation between 3 and 2
#' x[2, 3] <- 0.9
#' x[3, 2] <- 0.9
#' x
#' groupSimilarityMatrix(x, threshold = 0.9)
#'
#' ## Add a higher correlation between 4 and 5
#' x[4, 5] <- 0.99
#' x[5, 4] <- 0.99
#' x
#' groupSimilarityMatrix(x, threshold = 0.9)
#'
#' ## Increase correlation between 2 and 3
#' x[2, 3] <- 0.92
#' x[3, 2] <- 0.92
#' x
#' groupSimilarityMatrix(x, threshold = 0.9) ## Don't break previous cluster!
groupSimilarityMatrix <- function(x, threshold = 0.9, full = TRUE, ...) {
nr <- nrow(x)
if (nr != ncol(x))
stop("'x' should be a symmetric matrix")
if (!full)
x[lower.tri(x)] <- NA
res <- rep(NA_integer_, nr)
sl <- seq_len(nr)
x[cbind(sl, sl)] <- NA
idx_pairs <- which(x >= threshold, arr.ind = TRUE)
idx_pairs <- idx_pairs[order(x[idx_pairs], decreasing = TRUE), ,
drop = FALSE]
grp_id <- 1
the_other <- c(2, 1)
for (i in seq_len(nrow(idx_pairs))) {
got_grp <- res[idx_pairs[i, ]]
nas <- is.na(got_grp)
if (any(nas)) {
## at least one of them is not in a group
if (sum(nas) == 2) {
## none of the two is in a group yet. Check:
## correlation above threshold for both with any other
## complete group?
grps <- unique(res[!is.na(res)])
mean_cor_to_grp <- integer(length(grps))
names(mean_cor_to_grp) <- grps
idx <- idx_pairs[i, ]
for (grp in grps) {
idx_grp <- which(res == grp)
cor_to_grp <- x[idx, idx_grp]
if (full)
cor_to_grp <- c(cor_to_grp, x[idx_grp, idx])
if (!(any(is.na(cor_to_grp)) ||
any(cor_to_grp < threshold)))
mean_cor_to_grp[grp] <- mean(cor_to_grp)
}
mean_cor_to_grp <- mean_cor_to_grp[mean_cor_to_grp > 0]
if (length(mean_cor_to_grp)) {
## yes: put them into the group with which both have the
## highest correlation.
res[idx] <- as.integer(
names(sort(mean_cor_to_grp, decreasing = TRUE)))[1L]
} else {
## no: add them as new group
res[idx] <- grp_id
grp_id <- grp_id + 1
}
} else {
## One is not in a group. Put that into the group of the
## other if a) no cor is NA and b) cor to all of the group
## are >= threshold.
idx <- idx_pairs[i, nas]
idx_grp <- which(res == got_grp[!nas])
cor_to_grp <- x[idx, idx_grp]
if (full)
cor_to_grp <- c(cor_to_grp, x[idx_grp, idx])
if (!(any(is.na(cor_to_grp)) || any(cor_to_grp < threshold)))
res[idx] <- got_grp[!nas]
}
} else {
## both are in a group
if (length(unique(got_grp)) > 1) {
grp_1 <- which(res == got_grp[1])
grp_2 <- which(res == got_grp[2])
cor_1_1 <- x[idx_pairs[i, 1], grp_1]
cor_1_2 <- x[idx_pairs[i, 1], grp_2]
cor_2_1 <- x[idx_pairs[i, 2], grp_1]
cor_2_2 <- x[idx_pairs[i, 2], grp_2]
if (full) {
cor_1_1 <- c(cor_1_1, x[grp_1, idx_pairs[i, 1]])
cor_1_2 <- c(cor_1_2, x[grp_2, idx_pairs[i, 1]])
cor_2_1 <- c(cor_2_1, x[grp_1, idx_pairs[i, 2]])
cor_2_2 <- c(cor_2_2, x[grp_2, idx_pairs[i, 2]])
}
mcor_1_1 <- mean(cor_1_1, na.rm = TRUE)
mcor_1_2 <- mean(cor_1_2)
mcor_2_1 <- mean(cor_2_1)
mcor_2_2 <- mean(cor_2_2, na.rm =TRUE)
## Put the elements into the group of the other, if its
## correlation to its group is larger than to its own group
if (is.finite(mcor_1_2) && is.finite(mcor_1_1) &&
!any(cor_1_2 < threshold) && mcor_1_2 >= mcor_1_1)
res[idx_pairs[i, 1]] <- got_grp[2]
else if (is.finite(mcor_2_1) && is.finite(mcor_2_2) &&
!any(cor_2_1 < threshold) && mcor_2_1 >= mcor_2_2)
res[idx_pairs[i, 2]] <- got_grp[1]
} # else nothing to do - they are already in the same group
}
}
nas <- is.na(res)
if (any(nas))
res[nas] <- seq(grp_id, length.out = sum(nas))
res
}
#' @title Group values with differences below threshold
#'
#' @description
#'
#' Group values with a difference between them being smaller than a user
#' defined threshold. This function uses the [groupSimilarityMatrix()] function
#' to create groups with smallest differences between its members. Differences
#' between **all** members of one group are below the user defined threshold
#' `maxDiff`. This is a more stringent grouping than what [groupConsecutive()]
#' performs leading thus to smaller groups (with smaller differences between
#' its members).
#'
#' @param x `numeric` of values that should be grouped.
#'
#' @param maxDiff `numeric(1)` defining the threshold for difference between
#' values in `x` to be grouped into the same group.
#'
#' @return `integer` with the group assignment (values grouped together have
#' the same return value).
#'
#' @author Johannes Rainer
#'
#' @family grouping operations
#'
#' @export
#'
#' @importFrom stats dist
#'
#' @examples
#'
#' x <- c(1.1, 1.9, 2.2)
#' groupClosest(x)
#' ## Although the difference between the 1st and 2nd element would be smaller
#' ## than the threshold, they are not grouped because the difference between
#' ## the 2nd and 3rd element is even smaller. The first element is also not
#' ## put into the same group, because it has a difference > diffRt to the 3rd
#' ## element.
#'
#' x <- c(1.1, 1.5, 1.7, 2.3, 2.7, 4.3, 4.4, 4.9, 5.2, 5.4, 5.8, 6, 7,
#' 9, 9.5, 15)
#'
#' groupClosest(x)
groupClosest <- function(x, maxDiff = 1) {
dists <- as.matrix(dist(x, method = "manhattan"))
groupSimilarityMatrix(-dists, threshold = -maxDiff, full = TRUE)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.