knitr::knit_hooks$set(time_it = local({ now <- NULL function(before, options) { if (before) { # record the current time before each chunk now <<- Sys.time() } else { # calculate the time difference after a chunk res <- difftime(Sys.time(), now, units = "secs") # return a character string to show the time paste("Time for this code chunk to run:", round(res, 2), "seconds") } } })) knitr::opts_chunk$set(dev = "png", dev.args = list(type = "cairo-png"), time_it=TRUE)
This document provides an introduction to the R
code to create typologies of trajectories using fuzzy and property-based clustering in sequence analysis using the cluster
and WeightedCluster R
library [@Studer2013]. Readers interested by the methods and looking for more information on the interpretation of the results are referred to:
You are kindly asked to cite the above reference if you use the methods presented in this document.
We start by loading the required library and setting the seed.
set.seed(1) library(WeightedCluster)
We rely on the biofam
dataset to illustrate the use of the WeightedCluster
package. This public dataset is distributed with the TraMineR R
package and code family trajectories of a subsample of the Swiss Household Panel [@SHPGroup2024].
First, we need to prepare the data by creating a state sequence object using the seqdef
command [@GabadinhoRitschardMullerStuder2011JSS]. This object stores all the information about our trajectories, including the data and associated characteristics, such as state labels. During this step, we further define proper state labels as the original data are coded using numerical values. We then plot the sequences using, for instance, a chronogram.
data(biofam) #load illustrative data ## Defining the new state labels statelab <- c("Parent", "Left", "Married", "Left/Married", "Child", "Left/Child", "Left/Married/Child", "Divorced") ## Creating the state sequence object, biofam.seq <- seqdef(biofam[,10:25], alphabet=0:7, states=statelab) seqdplot(biofam.seq, legend.prop=0.2)
The clustering algorithms discussed here relies on a distance matrix. We compute one using the "LCS"
distance.
diss <- seqdist(biofam.seq, method="LCS")
In fuzzy clustering, the sequences belong to each cluster with an estimated membership strength. This approach is particularly useful when some sequences are thought to lie in between two (or more) types of sequences (i.e., hybrid-type sequences) or when only weak structure is found in the data. Here, we use the fanny
fuzzy clustering algorithm [@Kaufman1990,@R-cluster], which is available in the cluster R
library. It is used as follows:
library(cluster) ## Loading the library fclust <- fanny(diss, k=5, diss=TRUE, memb.exp=1.5)
The fanny
function requires us to specify the distance matrix (our diss
object), the number of groups (argument k=5
), diss=TRUE
specifies that we directly provided a distance matrix, and memb.exp=1.5
the fuzziness parameters (values between 1.5 and 2 are often recommended).
Descriptive statistics of membership strength to each cluster provide useful information. The mean can be interpreted as a relative frequency of each cluster if sequences are weighted according to their membership strength.
summary(fclust$membership)
A graphical representation of the typology can be obtained by weighting each sequences according to its membership strength [@Studer2018]. Using this strategy, one can represent distribution plots (see the seqplot
function of the TraMineR
package). The fuzzyseqplot
function from the WeightedCluster
can be used to do so.
## Displaying the resulting clustering with membership threshold of 0.4 fuzzyseqplot(biofam.seq, group=fclust$membership, type="d")
Following a similar strategy, one can further order the sequences according to the membership strength (argument sortv="membership"
) in an index plot (argument type="I"
) and remove sequences with low membership strengths (argument membership.threashold =0.4
). In this plot, each sequence is represented using a thin line. The sequences at the top are the one with the highest membership strength. They therefore describe the best the full cluster membership.
## Displaying the resulting clustering with membership threshold of 0.4 fuzzyseqplot(biofam.seq, group=fclust$membership, type="I", membership.threashold =0.4, sortv="membership")
Fuzzy clustering membership matrix can be directly included in subsequent regression as an independent variable. To use it as a dependent variable, you mus use Dirichlet or Beta regression (see the full article for more explanation).
Dirichlet regressions are available in the DirichletReg
package [@R-DirichletReg]. Here is a sample regression with sex and birth year as independent variables.
library(DirichletReg) ##Estimation of Dirichlet Regression ##Dependent variable formatting fmember <- DR_data(fclust$membership) ## Estimation bdirig <- DirichReg(fmember~sex+birthyr|1, data=biofam, model="alternative") ## Displaying results of Dirichlet regression. summary(bdirig)
Beta regression is available in the betareg
package [@R-betareg]. To estimate a beta regression for the third fuzzy type, one can use.
library(betareg) ## Estimation of beta regression breg1 <- betareg(fclust$membership[, 3]~sex+birthyr, data=biofam) ## Displaying results summary(breg1)
The aim of property-based clustering is to build a sequence typology by identifying well-defined clustering rules that are based on the most relevant properties of the analyzed object. Here, we present the algorithm discussed in [@Studer2018], an extension of the algorithms of @Chavent.etal2007 and @Piccarreta2007.
Property-based clustering works in two steps. First, sequence properties are extracted. Second, the clustering is created based on the extracted properties. The following table presents the states sequences properties that can be extracted.
Name | Description
----------------|-----------------------------------------------
"state"
| The state in which an individual is found, at each time position $t$.
"spell.age"
| The age at the beginning of each spell of a given type.
"spell.dur"
| The duration of each of the spells presented above.
"duration"
| The total time spent in each state.
"pattern"
| Count of the frequent subsequences of states in the DSS.
"AFpattern"
| Age at the first occurrence of the above frequent subsequence.
"transition"
| Count of the frequent subsequence of events in each sequence, where each transition is considered another event.
"AFtransition"
| Age at the first occurrence of the above frequent subsequence.
"Complexity"
| Complexity index, number of transitions, turbulence.
Second, the clustering algorithm is run. In R
, the two steps are regrouped in the same function seqpropclust
. The first argument specifies the state sequence object. Then, the diss
argument specifies the distance matrix, maxcluster
the maximum number of clusters to consider, and properties
the list of properties to consider using the names in the above table.
pclust <- seqpropclust(biofam.seq, diss=diss, maxcluster=5, properties=c("state", "duration")) pclust
If GraphViz is installed on the computer, one can use seqtreedisplay
for a graphical representation of the results (not presented here).
seqtreedisplay(pclust, type="d", border=NA, showdepth=TRUE)
The cluster quality indices can be computed using the as.clustrange()
function as for hierarchical clustering algorithms.
pclustqual <- as.clustrange(pclust, diss=diss, ncluster=5) pclustqual
The clustering solution can then be accessed and plotted using the same procedure as for any cluster algorithms, e.g.:
seqdplot(biofam.seq, pclustqual$clustering$cluster4)
knitr::write_bib(file = 'packages.bib')
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.