seqdist {TraMineR}  R Documentation 
Computes pairwise dissimilarities between sequences or dissimilarity from a reference sequence. Several dissimilarity measures can be chosen, including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between sequence state distributions.
seqdist(seqdata, method, refseq = NULL, norm = "none", indel = 1.0, sm = NULL, with.missing = FALSE, full.matrix = TRUE, kweights = rep(1.0, ncol(seqdata)), tpow = 1.0, expcost = 0.5, context, link = "mean", h = 0.5, nu, transindel = "constant", otto, previous = FALSE, add.column = TRUE, breaks = NULL, step = 1, overlap = FALSE, weighted = TRUE, global.pdotj = NULL, prox = NULL)
seqdata 
State Sequence Object.
The sequence data to use.
It can be created with the 
method 
String.
The dissimilarity measure to use.
It can be 
refseq 
The most frequent sequence ( An external sequence when a state sequence object and 
norm 
String.
Default: 
indel 
Double or Vector of Doubles.
Default: The single stateindependent insertion/deletion cost when a double and
The statedependent insertion/deletion costs when a vector of doubles and

sm 
The substitutioncost matrix when a matrix and The series of the substitutioncost matrices when an array and
The name of a
Note: With 
with.missing 
Logical.
Default: 
full.matrix 
Logical.
Default: 
kweights 
Vector of Doubles.
Default: vector of 
tpow 
Double.
Default: 
expcost 
Double.
Default: 
context 
Double.
Default: 
link 
String.
Default: 
h 
Double.
Default: The exponential weight of spell length when The gap penalty when 
nu 
Double.
Stiffness when 
transindel 
String.
Default: 
otto 
Double.
The origintransition tradeoff weight when 
previous 
Logical.
Default: 
add.column 
Logical.
Default: 
breaks 

step 
Integer.
Default: 
overlap 
Logical.
Default: 
weighted 
Logical.
Default: 
global.pdotj 
Numerical vector, 
prox 

The seqdist
function returns a matrix of distances between sequences
or a vector of distances from the reference sequence when refseq
is set.
The available metrics (see method
option) include:
Edit distances: optimal matching ("OM"
), localized OM
("OMloc"
), spelllengthsensitive OM ("OMslen"
), OM of spell
sequences ("OMspell"
), OM of transition sequences ("OMstran"
),
Hamming ("HAM"
), dynamic Hamming ("DHD"
), and the time warp edit
distance ("TWED"
).
Metrics based on counts of common attributes: distance based on
the longest common subsequence ("LCS"
), on the longest common prefix
("LCP"
), on the longest common suffix ("RLCP"
), on the number
of matching subsequences ("NMS"
), on the number of matching
subsequences weighted by the minimum shared time ("NMSMST"
) and,
the subsequence vectorial representation distance ("SVRspell"
).
Distances between state distributions: Euclidean ("EUCLID"
),
Chisquared ("CHI2"
).
See Studer and Ritschard (2014, 2016) for a description and the comparison
of the above dissimilarity measures except "TWED"
for which we refer to
Marteau (2009) and Halpin (2014).
Each method can be controlled with the following parameters:
method  parameters 
   
OM  sm, indel, norm, refseq 
OMloc  sm, expcost, context, refseq 
OMslen  sm, indel, link, h, refseq 
OMspell  sm, indel, tpow, expcost, refseq 
OMstran  sm, indel, transindel, otto, previous, add.column 
HAM, DHD  sm, norm, refseq 
CHI2  breaks, step, overlap, norm, weighted, global.pdotj 
EUCLID  breaks, step, overlap, norm 
LCS, LCP, RLCP  norm, refseq 
NMS  prox, kweights, refseq 
NMSMST  kweights, tpow, refseq 
SVRspell  prox, kweights, tpow, refseq 
TWED  sm, indel, h, nu, refseq 
   
"LCS"
is "OM"
with a substitution cost of 2 (sm = "CONSTANT",
cval = 2
) and an indel
of 1.0
. "HAM"
is "OM"
without
indels. "DHD"
is "HAM"
with specific substitution costs at each
position.
"HAM"
and "DHD"
apply only to sequences of equal length.
When sm = NULL
, the substitutioncost matrix is automatically created
for "HAM"
with a single substitution cost of 1 and for "DHD"
with
the costs derived from the transition rates at the successive positions.
Some distances can optionally be normalized by means of the norm
argument.
If set to "auto"
, Elzinga's normalization (similarity divided by
geometrical mean of the two sequence lengths) is applied to "LCS"
,
"LCP"
and "RLCP"
distances, while Abbott's normalization (distance
divided by length of the longer sequence) is used for "OM"
, "HAM"
and "DHD"
. Elzinga's method can be forced with "gmean"
and
Abbott's rule with "maxlength"
. With "maxdist"
the distance is
normalized by its maximal possible value. For more details, see
Gabadinho et al. (2009, 2011). Finally, "YujianBo"
is the
normalization proposed by Yujian and Bo (2007) that preserves the
triangle inequality. The square of the "CHI2"
and "EUCLID"
distances are normalized by the number of intervals and by the maximal distance
on each interval. Note that for 'CHI2' the maximal distance on each interval
depends on the state distribution on the interval.
When sequences contain gaps and the gaps = NA
option was passed to
seqdef
(i.e. when there are non deleted missing values), the
with.missing
argument should be set as TRUE
. If left as
FALSE
the function stops when it encounters a gap. This is to make the
user aware that there are gaps in the sequences. For methods that need an
sm
value, seqdist
expects a substitutioncost matrix with a row
and a column entry for the missing state (symbol defined with the nr
option of seqdef
). Substitutioncost matrices returned by
seqcost
(and so seqsubm
) include these additional
entries when the function is called with with.missing = TRUE
. More
details on how to compute distances with sequences containing gaps can be
found in Gabadinho et al. (2009).
When refseq
is NULL
(default), the whole matrix of pairwise
distances between sequences or, if full.matrix = FALSE
,
the corresponding dist
object of pairwise distances between sequences
is returned. Otherwise, a vector with distances between the sequences in the
state sequence object and the reference sequence specified with refseq
is returned.
Matthias Studer, PierreAlexandre Fonta, Alexis Gabadinho, Nicolas S. Müller, Gilbert Ritschard.
Studer, M. and G. Ritschard (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481511. DOI: 10.1111/rssa.12125
Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland. DOI: 10.12682/lives.22961658.2014.33
Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.
Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining
Sequence Data in R
with the TraMineR
package: A user's guide
Department of Econometrics and Laboratory of Demography, University of Geneva
Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. DOI: 10.1007/9783319049694_5
Marteau, P.F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 306–318. DOI: 10.1109/TPAMI.2008.76
Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095. DOI: 10.1109/TPAMI.2007.1078
See also all references in Studer and Ritschard (2014, 2016)
seqcost
, seqsubm
, seqdef
, and for
multichannel distances seqdistmc
.
## ======================== ## Example without missings ## ======================== ## Defining a sequence object with columns 10 to 25 of a ## subset of the 'biofam' data set data(biofam) biofam.seq < seqdef(biofam[501:600, 10:25]) ## OM distances with a substitutioncost matrix derived ## from transition rates biofam.om < seqdist(biofam.seq, method = "OM", indel = 3, sm = "TRATE") ## OM distances using the vector of estimated indels and ## substitution costs derived from the estimated indels costs < seqcost(biofam.seq, method = "INDELSLOG") biofam.om < seqdist(biofam.seq, method = "OM", indel = costs$indel, sm = costs$sm) ## Normalized LCP distances biofam.lcp.n < seqdist(biofam.seq, method = "LCP", norm = "auto") ## Normalized LCS distances to the most frequent sequence biofam.dref1 < seqdist(biofam.seq, method = "LCS", refseq = 0, norm = "auto") ## LCS distances to an external sequence ref < seqdef(as.matrix("(0,5)(3,5)(4,6)"), informat = "SPS", alphabet = alphabet(biofam.seq)) biofam.dref2 < seqdist(biofam.seq, method = "LCS", refseq = ref) ## Chisquared distance over the full observed timeframe biofam.chi.full < seqdist(biofam.seq, method = "CHI2", step = max(seqlength(biofam.seq))) ## Chisquared distance over successive overlaping ## intervals of length 4 biofam.chi.ostep < seqdist(biofam.seq, method = "CHI2", step = 4, overlap = TRUE) ## ===================== ## Example with missings ## ===================== data(ex1) ex1.seq < seqdef(ex1[, 1:13]) ## OM with substitution costs based on transition ## probabilities and indel set as half the maximum ## substitution cost costs.tr < seqcost(ex1.seq, method = "TRATE", with.missing = TRUE) ex1.om < seqdist(ex1.seq, method = "OM", indel = costs.tr$indel, sm = costs.tr$sm, with.missing = TRUE) ## Localized OM ex1.omloc < seqdist(ex1.seq, method = "OMloc", indel = costs.tr$indel, sm = costs.tr$sm, with.missing = TRUE) ## OM of spells ex1.omspell < seqdist(ex1.seq, method = "OMspell", sm = costs.tr$sm, indel = costs.tr$indel, with.missing = TRUE) ## Distance based on number of matching subsequences ex1.nms < seqdist(ex1.seq, method = "NMS", with.missing = TRUE) ## Using the sequence vetorial representation metric costs.fut < seqcost(ex1.seq, method = "FUTURE", lag = 4, proximities = TRUE, with.missing = TRUE) ex1.svr < seqdist(ex1.seq, method = "SVRspell", prox = costs.fut$prox, with.missing = TRUE)