TraMineR

seqdist {TraMineR}

R Documentation

Distances (dissimilarities) between sequences

Description

Computes pairwise dissimilarities between sequences or dissimilarity from a reference sequence. Several dissimilarity measures can be chosen, including optimal matching (OM) and many of its variants, distance based on the count of common attributes, and distances between state distributions within sequences.

Usage

seqdist(seqdata, method, refseq = NULL, norm = "none", indel = "auto", sm = NULL,
  with.missing = FALSE, full.matrix = TRUE, kweights = rep(1.0, ncol(seqdata)),
  tpow = 1.0, expcost = 0.5, context, link = "mean", h = 0.5, nu,
  transindel = "constant", otto, previous = FALSE, add.column = TRUE,
  breaks = NULL, step = 1, overlap = FALSE, weighted = TRUE,
  global.pdotj = NULL, prox = NULL, check.max.size=TRUE,
  opt.args = list())

Arguments

`seqdata`	State sequence object of class `stslist`. The sequence data to use. Use `seqdef` to create such an object.
`method`	String. The dissimilarity measure to use. It can be `"OM"`, `"OMloc"`, `"OMslen"`, `"OMspell"`, `"OMstran"`, `"HAM"`, `"DHD"`, `"CHI2"`, `"EUCLID"`, `"LCS"`, `"LCP"`, `"RLCP"`, `"NMS"`, `"NMSMST"`, `"SVRspell"`, or `"TWED"`. See the Details section.
`refseq`	`NULL`, Integer, State Sequence Object, or List. Default: `NULL`. The baseline sequence to compute the distances from. When an integer, the index of a sequence in `seqdata` or `0` for the most frequent sequence. When a state sequence object, it must contain a single sequence and have the same alphabet as `seqdata`. When a list, it must be a list of two sets of indexes of `seqdata` rows.
`norm`	String. Default: `"none"`. The normalization to use when `method` is one of `"OM"`, `"OMloc"`, `"OMslen"`, `"OMspell"`, `"OMstran"`, `"TWED"`, `"HAM"`, `"DHD"`, `"LCS"`, `"LCP"`, `"RLCP"`, `"CHI2"`, `"EUCLID"`. It can be `"none"`, `"auto"`, or, except for `"CHI2"` and `"EUCLID"`, `"maxlength"`, `"gmean"`, `"maxdist"`, or `"YujianBo"`. `"auto"` is equivalent to `"maxlength"` when `method` is one of `"OM"`, `"HAM"`, or `"DHD"`, to `"gmean"` when `method` is one of `"LCS"`, `"LCP"`, or `"RLCP"`, to `YujianBo` when `method` is one of `"OMloc"`, `"OMslen"`, `"OMspell"`, `"OMstran"`, `"TWED"`. See the Details section.
`indel`	Double, Vector of Doubles, or String. Default: `"auto"`. Insertion/deletion cost(s). Applies when `method` is one of `"OM"`, `"OMslen"`, `"OMspell"`, or `"OMstran"`. The single state-independent insertion/deletion cost when a double. The state-dependent insertion/deletion costs when a vector of doubles. The vector should contain an indel cost by state in the order of the alphabet. When `"auto"`, the indel is set as `max(sm)/2` when `sm` is a matrix and is computed by means of `seqcost` when `sm` is a string specifying a cost method.
`sm`	`NULL`, Matrix, Array, or String. Substitution costs. Default: `NULL`. The substitution-cost matrix when a matrix and `method` is one of `"OM"`, `"OMloc"`, `"OMslen"`, `"OMspell"`, `"OMstran"`, `"HAM"`, or `"TWED"`. The series of the substitution-cost matrices when an array and `method = "DHD"`. They are grouped in a 3-dimensional array with the third index referring to the position in the sequence. One of the strings `"CONSTANT"`, `"INDELS"`, `"INDELSLOG"`, or `"TRATE"`. Designates a `seqcost` method to build `sm`. `"CONSTANT"` is not relevant for `"DHD"`. `sm` is mandatory when `method` is one of `"OM"`, `"OMloc"`, `"OMslen"`, `"OMspell"`, `"OMstran"`, or `"TWED"`. `sm` is autogenerated when `method` is one of `"HAM"` or `"DHD"` and `sm = NULL`. See the Details section. Note: With `method = "NMS"` or `method = "SVRspell"`, use `prox` instead.
`with.missing`	Logical. Default: `FALSE`. Should the non-deleted missing value be added to the alphabet as an additional state? If `FALSE` and `seqdata` or `refseq` contains such gaps, an error is raised.
`full.matrix`	Logical. Default: `TRUE`. When `refseq = NULL`, if `TRUE`, the full distance matrix is returned, if `FALSE`, an object of class `dist` is returned, that is, a vector containing only values from the lower triangle of the distance matrix. Objects of class `dist` are smaller and can be passed directly as arguments to most clustering functions.
`kweights`	Double or vector of doubles. Default: vector of `1`s. The weights applied to subsequences when `method` is one of `"NMS"`, `"NMSMST"`, or `"SVRspell"`. It contains at position `k` the weight applied to the subsequences of length `k`. It must be positive. Its length should be equal to the number of columns of `seqdata`. If shorter, longer subsequences are ignored. If a scalar, it is transformed into `rep(kweights,ncol(sedata))`.
`tpow`	Double. Default: `1.0`. The exponential weight of spell length when `method` is one of `"OMspell"`, `"NMSMST"`, or `"SVRspell"`.
`expcost`	Double. Default: `0.5`. The cost of spell length transformation when `method = "OMloc"` or `method = "OMspell"`. It must be positive. The exact interpretation is distance-dependent.
`context`	Double. Default: `1-2*expcost`. The cost of local insertion when `method = "OMloc"`. It must be positive.
`link`	String. Default: `"mean"`. The function used to compute substitution costs when `method = "OMslen"`. One of `"mean"` (arithmetic average) or `"gmean"` (geometric mean as in the original proposition of Halpin 2010).
`h`	Double. Default: `0.5`. It must be greater than or equal to 0. The exponential weight of spell length when `method = "OMslen"`. The gap penalty when `method = "TWED"`. It corresponds to the lambda in Halpin (2014), p 88. It is usually chosen in the range [0,1]
`nu`	Double. Stiffness when `method = "TWED"`. It must be strictly greater than 0 and is usually less than 1. See Halpin (2014), p 88.
`transindel`	String. Default: `"constant"`. Method for computing transition indel costs when `method = "OMstran"`. One of `"constant"` (single indel of 1.0), `"subcost"` (based on substitution costs), or `"prob"` (based on transition probabilities).
`otto`	Double. The origin-transition trade-off weight when `method = "OMstran"`. It must be in [0, 1].
`previous`	Logical. Default: `FALSE`. When `method = "OMstran"`, should we also account for the transition from the previous state?
`add.column`	Logical. Default: `TRUE`. When `method = "OMstran"`, should the last column (and also the first column when `previous = TRUE`) be duplicated? When sequences have different lengths, should the last (first) valid state be duplicated.
`breaks`	List of ordered pairs of integers. Default: `NULL`. The list of the possibly overlapping intervals when `method = "CHI2"` or `method = "EUCLID"`. Each interval is defined by the pair `c(t1,t2)` of the start `t1` and end `t2` positions of the interval.
`step`	Integer. Default: `1`. The length of the intervals when `method = "CHI2"` or `method = "EUCLID"` and `breaks = NULL`. It must be positive. It must also be even when `overlap = TRUE`.
`overlap`	Logical. Default: `FALSE`. When `method = "CHI2"` or `method = "EUCLID"` and `breaks = NULL`, should the intervals overlap?
`weighted`	Logical. Default: `TRUE`. When `method` is `"CHI2"` or when `sm` is a string (method), should the distributions of the states account for the sequence weights in `seqdata`? See `seqdef`.
`global.pdotj`	Numerical vector, `"obs"`, or `NULL`. Default: `NULL`. Only for `method = "CHI2"`. The vector of state proportions to be used as marginal distribution. When `NULL`, the state distribution on the corresponding interval is used. When `"obs"`, the overall state distribution in `seqdata` is used for all intervals. When a vector of proportions, it is used as marginal distribution for all intervals.
`prox`	`NULL` or Matrix. Default: `NULL`. The matrix of state proximities when `method = "NMS"` or `method = "SVRspell"`.
`check.max.size`	Logical. Should `seqdist` stop when maximum allowed number of unique sequences is exceeded? Caution, setting `FALSE` may produce unexpected results or even crash R.
`opt.args`	List. List of additional non-documented arguments for development usage.

Details

The seqdist function returns a matrix of distances between sequences or a vector of distances from the reference sequence when refseq is set. The available metrics (see method option) include:

Edit distances: optimal matching ("OM"), localized OM ("OMloc"), spell-length-sensitive OM ("OMslen"), OM of spell sequences ("OMspell"), OM of transition sequences ("OMstran"), Hamming ("HAM"), dynamic Hamming ("DHD"), and the time warp edit distance ("TWED").
Metrics based on counts of common attributes: distance based on the longest common subsequence ("LCS"), on the longest common prefix ("LCP"), on the longest common suffix ("RLCP"), on the number of matching subsequences ("NMS"), on the number of matching subsequences weighted by the minimum shared time ("NMSMST") and, the subsequence vectorial representation distance ("SVRspell").
Distances between state distributions: Euclidean ("EUCLID"), Chi-squared ("CHI2").

See Studer and Ritschard (2014, 2016) for a description and the comparison of the above dissimilarity measures except "TWED" for which we refer to Marteau (2009) and Halpin (2014).

Each method can be controlled with the following parameters:

method	parameters
------------------	---------------------------------
`⁠OM⁠`	`⁠sm, indel, norm⁠`
`⁠OMloc⁠`	`⁠sm, expcost, context, norm⁠`
`⁠OMslen⁠`	`⁠sm, indel, link, h, norm⁠`
`⁠OMspell⁠`	`⁠sm, indel, norm, tpow, expcost, norm⁠`
`⁠OMstran⁠`	`⁠sm, indel, transindel, otto, previous, add.column, norm⁠`
`⁠HAM, DHD⁠`	`⁠sm, norm⁠`
`⁠CHI2⁠`	`⁠breaks, step, overlap, norm, weighted, global.pdotj, norm⁠`
`⁠EUCLID⁠`	`⁠breaks, step, overlap, norm⁠`
`⁠LCS, LCP, RLCP⁠`	`⁠norm⁠`
`⁠NMS⁠`	`⁠prox, kweights⁠`
`⁠NMSMST⁠`	`⁠kweights, tpow⁠`
`⁠SVRspell⁠`	`⁠prox, kweights, tpow⁠`
`⁠TWED⁠`	`⁠sm, (indel), h, nu, norm⁠`
------------------	---------------------------------

"LCS" is "OM" with a substitution cost of 2 (sm = "CONSTANT", cval = 2) and an indel of 1.0. "HAM" is "OM" without indels. "DHD" is "HAM" with specific substitution costs at each position.

"HAM" and "DHD" apply only to sequences of equal length.

For "TWED", the (single) indel serves only for empty sequences. The distance to an empty sequence is set as n*indel, where n is the length of the non empty sequence. By default (indel="auto"), indel is set as 2 * max(sm) + nu + h.

When sm = NULL, the substitution-cost matrix is automatically created for "HAM" with a single substitution cost of 1 and for "DHD" with the costs derived from the transition rates at the successive positions, i.e. with sm = "TRATE".

Some distances can optionally be normalized by means of the norm argument. Let d be the distance, m the maximum possible of the distance given the lengths p and q of the two sequences, and k the length of the longer sequence. Normalization "maxlength" is d/k (Abbott's normalization), "gmean" is 1-(m-d)/(p*q)^.5 (Elzinga's normalization), "maxdist" is d/m, and "YujianBo" is 2*d/(m+d). For more details, see Gabadinho et al. (2009, 2011). Actually, to avoid negative outcomes, the length p, q, and k are set as (max) indel times the corresponding length. For some distances, m is only a possibly non-reachable upper bound.

When norm="auto", "gmean" is applied to "LCS", "LCP" and "RLCP" distances, "maxlength" is applied to "OM", "HAM" and "DHD", and the normalization "YujianBo" of Yujian and Bo (2007) that preserves the triangle inequality is used in the other cases except "CHI2" and "EUCLID". For the latter two, the square of the distances are normalized by the number of intervals and the maximal distance on each interval. Note that for 'CHI2' the maximal distance on each interval depends on the state distribution on the interval.

When sequences contain gaps and the left = NA, gaps = NA, or right = NA option was passed to seqdef (i.e. when there are non deleted missing values), the with.missing argument should be set as TRUE. If left as FALSE the function stops when it encounters a gap. This is to make the user aware that there are gaps in the sequences. For methods that need an sm value, seqdist expects a substitution-cost matrix with a row and a column entry for the missing state (symbol defined with the nr option of seqdef). Substitution-cost matrices returned by seqcost (and so seqsubm) include these additional entries when the function is called with with.missing = TRUE. More details on how to compute distances with sequences containing gaps can be found in Gabadinho et al. (2009).

Value

When refseq is NULL (default), the whole matrix of pairwise distances between sequences or, if full.matrix = FALSE, the corresponding dist object of pairwise distances between sequences.

When refseq is a list of two sets of indexes, the matrix of distances from the first set of sequences (rows) to the second set (columns).

Otherwise, a vector with distances from the sequences in the state sequence object to the reference sequence specified with refseq.

Author(s)

Matthias Studer, Gilbert Ritschard, Pierre-Alexandre Fonta, Alexis Gabadinho, Nicolas S. Müller.

References

Studer, M. and G. Ritschard (2016), "What matters in differences between life trajectories: A comparative review of sequence dissimilarity measures", Journal of the Royal Statistical Society, Series A. 179(2), 481-511, doi:10.1111/rssa.12125

Studer, M. and G. Ritschard (2014). "A Comparative Review of Sequence Dissimilarity Measures". LIVES Working Papers, 33. NCCR LIVES, Switzerland, doi:10.12682/lives.2296-1658.2014.33

Gabadinho, A., G. Ritschard, N. S. Müller and M. Studer (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software 40(4), 1–37.

Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Mining Sequence Data in R with the TraMineR package: A user's guide. Department of Econometrics and Laboratory of Demography, University of Geneva

Halpin, B. (2014). Three Narratives of Sequence Analysis, in Blanchard, P., Bühlmann, F. and Gauthier, J.-A. (Eds.) Advances in Sequence Analysis: Theory, Method, Applications, Vol 2 of Series Life Course Research and Social Policies, pages 75–103, Heidelberg: Springer. doi:10.1007/978-3-319-04969-4_5

Marteau, P.-F. (2009). Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 306–318. doi:10.1109/TPAMI.2008.76

Yujian, L. and Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095. doi:10.1109/TPAMI.2007.1078

See also all references in Studer and Ritschard (2014, 2016)

Examples

## =========================
## Examples without missings
## =========================

## Defining a sequence object with columns 10 to 25
## of a subset of the 'biofam' data set
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])

## OM distances using the vector of indels and substitution
## costs derived from the estimated state frequencies
costs <- seqcost(biofam.seq, method = "INDELSLOG")
biofam.om <- seqdist(biofam.seq, method = "OM",
                     indel = costs$indel, sm = costs$sm)

## OM between sequences of transitions
biofam.omstran <- seqdist(biofam.seq, method = "OMstran",
                     indel = costs$indel, sm = costs$sm,
                     otto=.3, transindel="subcost")

## Normalized LCP distances
biofam.lcp.n <- seqdist(biofam.seq, method = "LCP",
                        norm = "auto")

## Normalized LCS distances to the most frequent sequence
biofam.dref1 <- seqdist(biofam.seq, method = "LCS",
                        refseq = 0, norm = "auto")

## LCS distances to an external sequence
ref <- seqdef(as.matrix("(0,5)-(3,5)-(4,6)"), informat = "SPS",
              alphabet = alphabet(biofam.seq))
biofam.dref2 <- seqdist(biofam.seq, method = "LCS",
                        refseq = ref)

## LCS distances between two subsets of sequences
set1 <- 1:10
set2 <- 31:36
biofam.dref2 <- seqdist(biofam.seq, method = "LCS",
                        refseq = list(set1,set2))


## Chi-squared distance over the full observed timeframe
biofam.chi.full <- seqdist(biofam.seq, method = "CHI2",
                           step = max(seqlength(biofam.seq)))

## Chi-squared distance over successive overlapping
## intervals of length 4
biofam.chi.ostep <- seqdist(biofam.seq, method = "CHI2",
                            step = 4, overlap = TRUE)


## ======================
## Examples with missings
## ======================
data(ex1)
## Ignore empty row 7
ex1.seq <- seqdef(ex1[1:6, 1:13])

## OM with indel and substitution costs based on
## log of inverse state frequencies
costs.ex1 <- seqcost(ex1.seq, method = "INDELSLOG",
                     with.missing = TRUE)
ex1.om <- seqdist(ex1.seq, method = "OM",
                  indel = costs.ex1$indel, sm = costs.ex1$sm,
                  with.missing = TRUE)

## Localized OM
ex1.omloc <- seqdist(ex1.seq, method = "OMloc",
                     sm = costs.ex1$sm, expcost=.1, context = .4,
                     with.missing = TRUE)

## OMspell with a scalar indel
indel <- max(costs.ex1$indel)
## OM of spells
ex1.omspell <- seqdist(ex1.seq, method = "OMspell",
                       indel = indel, sm = costs.ex1$sm,
                       with.missing = TRUE)

## Distance based on number of matching subsequences
ex1.nms <- seqdist(ex1.seq, method = "NMS",
                   with.missing = TRUE)

## Using the sequence vectorial representation metric
costs.fut <- seqcost(ex1.seq, method = "FUTURE", lag = 4,
                     proximities = TRUE, with.missing = TRUE)
ex1.svr <- seqdist(ex1.seq, method = "SVRspell",
                   prox = costs.fut$prox, with.missing = TRUE)

[Package TraMineR version 2.2-12 Index]