seqdist {TraMineR} | R Documentation |
Distances (dissimilarities) between sequences
Description
Computes pairwise dissimilarities between sequences or dissimilarity from a reference sequence. Several dissimilarity measures can be chosen, including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between state distributions within sequences.
Usage
seqdist(seqdata, method, refseq = NULL, norm = "none", indel = "auto", sm = NULL,
with.missing = FALSE, full.matrix = TRUE, kweights = rep(1.0, ncol(seqdata)),
tpow = 1.0, expcost = 0.5, context, link = "mean", h = 0.5, nu,
transindel = "constant", otto, previous = FALSE, add.column = TRUE,
breaks = NULL, step = 1, overlap = FALSE, weighted = TRUE,
global.pdotj = NULL, prox = NULL, check.max.size=TRUE,
opt.args = list())
Arguments
seqdata |
State sequence object of class |
method |
String.
The dissimilarity measure to use.
It can be |
refseq |
When an integer, the index of a sequence in When a state sequence object, it must contain a single sequence and have the same
alphabet as When a list, it must be a list of two sets of indexes of |
norm |
String.
Default: |
indel |
Double, Vector of Doubles, or String.
Default: The single state-independent insertion/deletion cost when a double. The state-dependent insertion/deletion costs when a vector of doubles. The vector should contain an indel cost by state in the order of the alphabet. When |
sm |
The substitution-cost matrix when a matrix and The series of the substitution-cost matrices when an array and
One of the strings
Note: With |
with.missing |
Logical.
Default: |
full.matrix |
Logical.
Default: |
kweights |
Double or vector of doubles.
Default: vector of |
tpow |
Double.
Default: |
expcost |
Double.
Default: |
context |
Double.
Default: |
link |
String.
Default: |
h |
Double.
Default: The exponential weight of spell length when The gap penalty when |
nu |
Double.
Stiffness when |
transindel |
String.
Default: |
otto |
Double.
The origin-transition trade-off weight when |
previous |
Logical.
Default: |
add.column |
Logical.
Default: |
breaks |
List of ordered pairs of integers.
Default: |
step |
Integer.
Default: |
overlap |
Logical.
Default: |
weighted |
Logical.
Default: |
global.pdotj |
Numerical vector, |
prox |
|
check.max.size |
Logical. Should |
opt.args |
List. List of additional non-documented arguments for development usage. |
Details
The seqdist
function returns a matrix of distances between sequences
or a vector of distances from the reference sequence when refseq
is set.
The available metrics (see method
option) include:
-
Edit distances: optimal matching (
"OM"
), localized OM ("OMloc"
), spell-length-sensitive OM ("OMslen"
), OM of spell sequences ("OMspell"
), OM of transition sequences ("OMstran"
), Hamming ("HAM"
), dynamic Hamming ("DHD"
), and the time warp edit distance ("TWED"
). -
Metrics based on counts of common attributes: distance based on the longest common subsequence (
"LCS"
), on the longest common prefix ("LCP"
), on the longest common suffix ("RLCP"
), on the number of matching subsequences ("NMS"
), on the number of matching subsequences weighted by the minimum shared time ("NMSMST"
) and, the subsequence vectorial representation distance ("SVRspell"
). -
Distances between state distributions: Euclidean (
"EUCLID"
), Chi-squared ("CHI2"
).
See Studer and Ritschard (2014, 2016) for a description and the comparison
of the above dissimilarity measures except "TWED"
for which we refer to
Marteau (2009) and Halpin (2014).
Each method can be controlled with the following parameters:
method | parameters |
------------------ | --------------------------------- |
OM | sm, indel, norm |
OMloc | sm, expcost, context, norm |
OMslen | sm, indel, link, h, norm |
OMspell | sm, indel, norm, tpow, expcost, norm |
OMstran | sm, indel, transindel, otto, previous, add.column, norm |
HAM, DHD | sm, norm |
CHI2 | breaks, step, overlap, norm, weighted, global.pdotj, norm |
EUCLID | breaks, step, overlap, norm |
LCS, LCP, RLCP | norm |
NMS | prox, kweights |
NMSMST | kweights, tpow |
SVRspell | prox, kweights, tpow |
TWED | sm, (indel), h, nu, norm |
------------------ | --------------------------------- |
"LCS"
is "OM"
with a substitution cost of 2 (sm = "CONSTANT",
cval = 2
) and an indel
of 1.0
. "HAM"
is "OM"
without
indels. "DHD"
is "HAM"
with specific substitution costs at each
position.
"HAM"
and "DHD"
apply only to sequences of equal length.
For "TWED"
, the (single) indel serves only for empty sequences.
The distance to an empty sequence is set as n*
indel
, where n
is
the length of the non empty sequence. By default (indel="auto"
), indel is set
as 2 * max(sm) + nu + h
.
When sm = NULL
, the substitution-cost matrix is automatically created
for "HAM"
with a single substitution cost of 1 and for "DHD"
with
the costs derived from the transition rates at the successive positions, i.e. with
sm = "TRATE"
.
Some distances can optionally be normalized by means of the norm
argument.
Let d
be the distance, m
the maximum possible of the distance
given the lengths p
and q
of the two sequences, and k
the
length of the longer sequence. Normalization "maxlength"
is d/k
(Abbott's normalization), "gmean"
is 1-(m-d)/(p*q)^.5
(Elzinga's
normalization), "maxdist"
is d/m
, and "YujianBo" is 2*d/(m+d)
.
For more details, see Gabadinho et al. (2009, 2011).
Actually, to avoid negative outcomes, the length p
, q
, and k
are
set as (max) indel times the corresponding length. For some distances, m
is
only a possibly non-reachable upper bound.
When norm="auto"
, "gmean"
is applied to "LCS"
,
"LCP"
and "RLCP"
distances, "maxlength"
is applied to "OM"
, "HAM"
and "DHD"
, and the normalization "YujianBo" of Yujian and Bo (2007) that preserves the
triangle inequality is used in the other cases except "CHI2"
and "EUCLID"
.
For the latter two, the square of the
distances are normalized by the number of intervals and the maximal distance
on each interval. Note that for 'CHI2' the maximal distance on each interval
depends on the state distribution on the interval.
When sequences contain gaps and the left = NA
, gaps = NA
, or right = NA
option was passed to
seqdef
(i.e. when there are non deleted missing values), the
with.missing
argument should be set as TRUE
. If left as
FALSE
the function stops when it encounters a gap. This is to make the
user aware that there are gaps in the sequences. For methods that need an
sm
value, seqdist
expects a substitution-cost matrix with a row
and a column entry for the missing state (symbol defined with the nr
option of seqdef
). Substitution-cost matrices returned by
seqcost
(and so seqsubm
) include these additional
entries when the function is called with with.missing = TRUE
. More
details on how to compute distances with sequences containing gaps can be
found in Gabadinho et al. (2009).
Value
When refseq
is NULL
(default), the whole matrix of pairwise
distances between sequences or, if full.matrix = FALSE
,
the corresponding dist
object of pairwise distances between sequences.
When refseq
is a list
of two sets of indexes, the matrix
of distances from the first set of sequences (rows) to the second set (columns).
Otherwise, a vector with distances from the sequences in the
state sequence object to the reference sequence specified with refseq
.
Author(s)
Matthias Studer, Gilbert Ritschard, Pierre-Alexandre Fonta, Alexis Gabadinho, Nicolas S. Müller.
References
Studer, M. and G. Ritschard (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511, doi:10.1111/rssa.12125
Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland, doi:10.12682/lives.2296-1658.2014.33
Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.
Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining
Sequence Data in R
with the TraMineR
package: A user's guide.
Department of Econometrics and Laboratory of Demography, University of Geneva
Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.-A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. doi:10.1007/978-3-319-04969-4_5
Marteau, P.-F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 306–318. doi:10.1109/TPAMI.2008.76
Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095. doi:10.1109/TPAMI.2007.1078
See also all references in Studer and Ritschard (2014, 2016)
See Also
seqcost
, seqsubm
, seqdef
, and seqMD
for
multidomain (multichannel) distances using the cost additive trick.
Examples
## =========================
## Examples without missings
## =========================
## Defining a sequence object with columns 10 to 25
## of a subset of the 'biofam' data set
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
## OM distances using the vector of indels and substitution
## costs derived from the estimated state frequencies
costs <- seqcost(biofam.seq, method = "INDELSLOG")
biofam.om <- seqdist(biofam.seq, method = "OM",
indel = costs$indel, sm = costs$sm)
## OM between sequences of transitions
biofam.omstran <- seqdist(biofam.seq, method = "OMstran",
indel = costs$indel, sm = costs$sm,
otto=.3, transindel="subcost")
## Normalized LCP distances
biofam.lcp.n <- seqdist(biofam.seq, method = "LCP",
norm = "auto")
## Normalized LCS distances to the most frequent sequence
biofam.dref1 <- seqdist(biofam.seq, method = "LCS",
refseq = 0, norm = "auto")
## LCS distances to an external sequence
ref <- seqdef(as.matrix("(0,5)-(3,5)-(4,6)"), informat = "SPS",
alphabet = alphabet(biofam.seq))
biofam.dref2 <- seqdist(biofam.seq, method = "LCS",
refseq = ref)
## LCS distances between two subsets of sequences
set1 <- 1:10
set2 <- 31:36
biofam.dref2 <- seqdist(biofam.seq, method = "LCS",
refseq = list(set1,set2))
## Chi-squared distance over the full observed timeframe
biofam.chi.full <- seqdist(biofam.seq, method = "CHI2",
step = max(seqlength(biofam.seq)))
## Chi-squared distance over successive overlapping
## intervals of length 4
biofam.chi.ostep <- seqdist(biofam.seq, method = "CHI2",
step = 4, overlap = TRUE)
## ======================
## Examples with missings
## ======================
data(ex1)
## Ignore empty row 7
ex1.seq <- seqdef(ex1[1:6, 1:13])
## OM with indel and substitution costs based on
## log of inverse state frequencies
costs.ex1 <- seqcost(ex1.seq, method = "INDELSLOG",
with.missing = TRUE)
ex1.om <- seqdist(ex1.seq, method = "OM",
indel = costs.ex1$indel, sm = costs.ex1$sm,
with.missing = TRUE)
## Localized OM
ex1.omloc <- seqdist(ex1.seq, method = "OMloc",
sm = costs.ex1$sm, expcost=.1, context = .4,
with.missing = TRUE)
## OMspell with a scalar indel
indel <- max(costs.ex1$indel)
## OM of spells
ex1.omspell <- seqdist(ex1.seq, method = "OMspell",
indel = indel, sm = costs.ex1$sm,
with.missing = TRUE)
## Distance based on number of matching subsequences
ex1.nms <- seqdist(ex1.seq, method = "NMS",
with.missing = TRUE)
## Using the sequence vectorial representation metric
costs.fut <- seqcost(ex1.seq, method = "FUTURE", lag = 4,
proximities = TRUE, with.missing = TRUE)
ex1.svr <- seqdist(ex1.seq, method = "SVRspell",
prox = costs.fut$prox, with.missing = TRUE)