Short R Tutorial: Sequence Analysis Typologies for Large Databases'
In WeightedCluster: Clustering of Weighted Data

Introduction

knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      # record the current time before each chunk
      now <<- Sys.time()
    } else {
      # calculate the time difference after a chunk
      res <- difftime(Sys.time(), now, units = "secs")
      # return a character string to show the time
      paste("Time for this code chunk to run:", round(res,
        2), "seconds")
    }
  }
}))
knitr::opts_chunk$set(dev = "png", dev.args = list(type = "cairo-png"), time_it=TRUE)

This document provides an introduction to the R code to create typologies of trajectories in large databases using the algorithm provided by the WeightedCluster R library [@Studer2013]. It also presents methods to evaluate the quality of the resulting clustering in large databases. Readers interested by the methods and looking for more information on the interpretation of the results are referred to:

Studer, M., R. Sadeghi and L. Tochon (2024). Sequence Analysis for Large Databases. LIVES Working Papers 104. https://doi.org/10.12682/LIVES.2296-1658.2024.104

You are kindly asked to cite the above reference if you use the methods presented in this document.

Warning!! To avoid lengthy computations (and overloading the CRAN server), we restricted the number of iterations and the sample size. We strongly recommend using much higher values.

We start by loading the required library and setting the seed.

set.seed(1)
library(WeightedCluster)

Data Preparation

We rely on the biofam dataset to illustrate the use of the WeightedCluster package. This public dataset is distributed with the TraMineR R package and code family trajectories of a subsample of the Swiss Household Panel [@SHPGroup2024].

First, we need to prepare the data by creating a state sequence object using the seqdef command [@GabadinhoRitschardMullerStuder2011JSS]. This object stores all the information about our trajectories, including the data and associated characteristics, such as state labels. During this step, we further define proper state labels as the original data are coded using numerical values. We then plot the sequences using, for instance, a chronogram.

data(biofam) #load illustrative data
## Defining the new state labels 
statelab <- c("Parent", "Left", "Married", "Left/Married",  "Child", 
            "Left/Child", "Left/Married/Child", "Divorced")
## Creating the state sequence object,
biofam.seq <- seqdef(biofam[,10:25], alphabet=0:7, states=statelab)
seqdplot(biofam.seq, legend.prop=0.2)

CLARA Clustering

CLARA for SA is available in the seqclararange function. It is used as follows:

bfclara <- seqclararange(biofam.seq, R = 50, sample.size = 100, kvals = 2:10, 
                         seqdist.args = list(method = "HAM"), parallel=TRUE, 
                         stability=TRUE)

The function requires us to specify the sequence to cluster (our biofam.seq sequence object), the number of iterations (argument R, here set to 50 to avoid long computation time, larger values are recommended) and the subsample size (argument sample.size, here again set to the low value of 100 for illustrative purpose). The number of groups in our typology is set using the kvals arguments, here between 2 and ten groups solutions. We directly specify a range of values that will be considered later on. Finally, we need to specify how to compute the distance between sequences through the seqdist.args argument as a list object. All the arguments specified here will be directly passed to the seqdist function. Therefore, any distance measures available in seqdist can be used here. Finally, we set stability=TRUE to estimate the stability of the clustering among the subsamples.

Setting parallel=TRUE, a default parallel back-end is set up using the future framework [@Bengtsson2021]. However, any parallel back-end previously defined with the plan function will be used when parallel=FALSE. The parallel protocol can then be adapted to specific environments, for instance some High Performance Computing (HPC) server relies on specific protocols (MPI,...). As implied by the name, setting progressbar=TRUE shows information (and estimated computation time) on the progress of the computations.

Cluster Quality Indices

The values of the medoid-based CQI are shown when the result is printed, or can be plotted using the plot command. When plotting the CQI, standardizing the values makes it easier to identify the best solution [@Studer2013].

bfclara
plot(bfclara, norm="range")

Except the PBM index, all indices favor a five-cluster solution. The resulting clustering is stored in the clustering element of the results. It can for instance be used to represent the sequences in each cluster as follows.

The stability of the clustering can also be plotted using either the average value, or the number of recoveries of a similar solution.

plot(bfclara, stat="stabmean")

Here again, the five-cluster solution shows the highest CQI values. However, the absolute number of recoveries is low for more than seven groups. A higher number of iterations is therefore recommended. This is not surprising as we used a low number of iterations.

plot(bfclara, stat="stability")

The bootclustrange function can be used to bootstrap the cluster quality measures. It has the following arguments. The first object is the clustering to be evaluated, as a seqclararange object, a data.frame or a vector. Second, the sequences that were used to create the typology. We should further specify the distance measure to use (as before), the number of bootstraps (R argument), and the subsample size (here 100). We would generally use higher values for this last two arguments. Finally, the parallel and progressbar arguments could be used as before. The resulting object is then printed and the standardized values of the CQI are plotted. The results lead to the same conclusion as for the medoid-based CQI.

    bCQI <- bootclustrange(bfclara, biofam.seq, seqdist.args = list(method = "HAM"), R = 50, sample.size = 100,  parallel=TRUE)
    bCQI
  plot(bCQI, norm="zscore")

Plotting the Typology

Once a clustering in a given number of groups has been selected, we can plot the sequences by cluster to give a better interpretation.

seqdplot(biofam.seq, group=bfclara$clustering$cluster5)

Fuzzy Clustering

By setting method="fuzzy" in the seqclararange function, the fuzzy version of the algorithm is used. It should be noted that the computations are generally longer than for crisp clustering. The CQI values are printed and plotted using the same commands.

bfclaraf <- seqclararange(biofam.seq, R = 50, sample.size = 100, kvals = 2:10, method="fuzzy",
                            seqdist.args = list(method = "HAM"), parallel=TRUE)
bfclaraf
plot(bfclaraf, norm="zscore")

The XB and AMS index favor a four or six cluster solution. Several plots are available to describe fuzzy clustering of trajectories, see @Studer2018 and the fuzzyseqplot function. Here, we use sequence index plot ordered by membership strength, with typical trajectories of each cluster represented at the top, leaving aside trajectories with low membership. In this kind of graphic, hybrid trajectories are represented at the bottom.

fuzzyseqplot(biofam.seq, group=bfclaraf$clustering$cluster4, type="I", sortv="membership", membership.threashold=0.4)

Any scripts or data that you put into this service are public.

WeightedCluster documentation built on April 24, 2025, 3:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

WeightedCluster
Clustering of Weighted Data

Short R Tutorial: Sequence Analysis Typologies for Large Databases'
In WeightedCluster: Clustering of Weighted Data

Introduction

Data Preparation

CLARA Clustering

Cluster Quality Indices

Plotting the Typology

Fuzzy Clustering

Try the WeightedCluster package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

WeightedCluster Clustering of Weighted Data

Short R Tutorial: Sequence Analysis Typologies for Large Databases' In WeightedCluster: Clustering of Weighted Data

Introduction

Data Preparation

CLARA Clustering

Cluster Quality Indices

Plotting the Typology

Fuzzy Clustering

Try the WeightedCluster package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

WeightedCluster
Clustering of Weighted Data

Short R Tutorial: Sequence Analysis Typologies for Large Databases'
In WeightedCluster: Clustering of Weighted Data