knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { # record the current time before each chunk now <<- Sys.time() } else { # calculate the time difference after a chunk res <- difftime(Sys.time(), now, units = "secs") # return a character string to show the time paste("Time for this code chunk to run:", round(res, 2), "seconds") } } })) knitr::opts_chunk$set(dev = "png", dev.args = list(type = "cairo-png"), time_it=TRUE)
This document provides a very quick introduction to the R
code needed to estimate the quality of a typology in a subsequent regression or when the relationship between the typology and the covariate is of key interest. Readers interested in the methods and the exact interpretation of the results are referred to:
You are kindly asked to cite the above reference if you use the methods presented in this document.
Let's start by setting the seed for reproducible results.
set.seed(1)
For this example, we use the mvad
dataset. Let's start with the creation of the state sequence object.
## Loading the TraMineR library library(TraMineR) ## Loading the data data(mvad) ## State properties mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training") mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training") mvad.shortlab <- c("EM","FE","HE","JL","SC","TR") ## Creating the state sequence object mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6)
We will now create a typology using cluster analysis. Readers interested in more detail are referred to the WeightedCluster
library manual (also available as a vignette), which goes into the details of the creation and computation of cluster quality measures.
We start by computing dissimilarities with the seqdist
function using the Hamming distance. We then use Ward clustering to create a typology of the trajectories. For this step, we recommend the use of the fastcluster
library [@R-fastcluster], which considerably speed up the computations.
## Using fastcluster for hierarchical clustering library(fastcluster) ## Distance computation diss <- seqdist(mvad.seq, method="LCS") ## Hierarchical clustering hc <- hclust(as.dist(diss), method="ward.D")
We can now compute several cluster quality indices using as.clustrange
function from two to ten groups.
# Loading the WeightedCluster library library(WeightedCluster) # Computing cluster quality measures. clustqual <- as.clustrange(hc, diss=diss, ncluster=10) clustqual
clustassoc
functionIn this example, we will focus on the association between father unemployment status (funemp
variable) and our school-to-work trajectories. The clustassoc
function provides several indicators of the quality of typology to study this association.
The function takes a clustrange
object as the first argument. The diss
argument specifies the distance matrix used for clustering, covar
the covariate of association of interest, and weights
an optional case weights vector.
cla <- clustassoc(clustqual, diss=diss, covar=mvad$funemp) cla
The resulting object presents three indicators. The Unaccounted
column shows the share of the direct association between the trajectories and the covariates that is \alert{not accounted for} by the typology. This computation are based on the discrepancy analysis framework [@StuderRitschardGabadinhoMuller2011SMR]. A low value means that the typology carries most of the information that is relevant to study the association between our covariate and the trajectories.
The Remaining
column presents the share of the overall variability of the trajectories that is \alert{not accounted for} by the typology. A low value indicates that there is no variation left not explained by the typology. Warning, this is usually a very low value. The value presented in the "No clustering" row (the first) is equivalent to the pseudo-$R^2$ of a discrepancy analysis between the trajectories and the covariates.
The BIC
column presents the Bayesian Information Criterion for the association between the typology and the covariate (again the lower the better). While the first column provides the most reliable information, the BIC
might be useful when parcimony is of key interest.
The general idea is to select a cluster solution with low values on Unaccounted
and BIC
(only if relevant).
The results can be plotted to make it easier to find the minimum.
plot(cla, main="Unaccounted")
According to the plot, at least 6 groups are required. However, around one fifth of the association is left un-reproduced by the clustering. It might be interesting to compare the 5 and 6 clusters solutions to better understand the association.
seqdplot(mvad.seq, group=clustqual$clustering$cluster5, border=NA)
seqdplot(mvad.seq, group=clustqual$clustering$cluster6, border=NA)
We can notice that the 6 cluster solutions contains a new joblessness cluster, which is found to be important to study the association between father unemployment and son school-to-work trajectories.
The presented might lead to different recommendations than the usual cluster quality indices, because it focuses on a relationship with a covariate. The method often suggests a higher number of groups.
knitr::write_bib(file = 'packages.bib')
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.