Enter a numerical matrix, set of vectors, or set of matrices to calculate similarity per vector.
Usage
lma_simets(a, b = NULL, metric = NULL, group = NULL, lag = 0,
agg = TRUE, agg.mean = TRUE, pairwise = TRUE, symmetrical = FALSE,
mean = FALSE, return.list = FALSE)
Arguments
- a
A vector or matrix. If a vector,
b
must also be provided. If a matrix andb
is missing, each row will be compared. If a matrix andb
is not missing, each row will be compared withb
or each row ofb
.- b
A vector or matrix to be compared with
a
or rows ofa
.- metric
A character or vector of characters at least partially matching one of the available metric names (or 'all' to explicitly include all metrics), or a number or vector of numbers indicating the metric by index:
jaccard
:sum(a & b) / sum(a | b)
euclidean
:1 / (1 + sqrt(sum((a - b) ^ 2)))
canberra
:mean(1 - abs(a - b) / (a + b))
cosine
:sum(a * b) / sqrt(sum(a ^ 2 * sum(b ^ 2)))
pearson
:(mean(a * b) - (mean(a) * mean(b))) /
sqrt(mean(a ^ 2) - mean(a) ^ 2) / sqrt(mean(b ^ 2) - mean(b) ^ 2)
- group
If
b
is missing anda
has multiple rows, this will be used to make comparisons between rows ofa
, as modified byagg
andagg.mean
.- lag
Amount to adjust the
b
index; either rows ifb
has multiple rows (e.g., forlag = 1
,a[1, ]
is compared withb[2, ]
), or values otherwise (e.g., forlag = 1
,a[1]
is compared withb[2]
). Ifb
is not supplied,b
is a copy ofa
, resulting in lagged self-comparisons or autocorrelations.- agg
Logical: if
FALSE
, only the boundary rows between groups will be compared, see example.- agg.mean
Logical: if
FALSE
aggregated rows are summed instead of averaged.- pairwise
Logical: if
FALSE
anda
andb
are matrices with the same number of rows, only paired rows are compared. Otherwise (and if onlya
is supplied), all pairwise comparisons are made.- symmetrical
Logical: if
TRUE
and pairwise comparisons betweena
rows were made, the results in the lower triangle are copied to the upper triangle.- mean
Logical: if
TRUE
, a single mean for each metric is returned per row ofa
.- return.list
Logical: if
TRUE
, a list-like object will always be returned, with an entry for each metric, even when only one metric is requested.
Value
Output varies based on the dimensions of a
and b
:
Out: A vector with a value per metric.
In: Only whena
andb
are both vectors.Out: A vector with a value per row.
In: Any time a single value is expected per row:a
orb
is a vector,a
andb
are matrices with the same number of rows andpairwise = FALSE
, a group is specified, ormean = TRUE
, and only one metric is requested.Out: A data.frame with a column per metric.
In: When multiple metrics are requested in the previous case.Out: A sparse matrix with a
metric
attribute with the metric name.
In: Pairwise comparisons within ana
matrix or between ana
andb
matrix, when only 1 metric is requested.Out: A list with a sparse matrix per metric.
In: When multiple metrics are requested in the previous case.
Details
Use setThreadOptions
to change parallelization options; e.g., run
RcppParallel::setThreadOptions(4) before a call to lma_simets to set the number of CPU
threads to 4.
Examples
text <- c(
"words of speaker A", "more words from speaker A",
"words from speaker B", "more words from speaker B"
)
(dtm <- lma_dtm(text))
#> 4 x 7 sparse Matrix of class "dgCMatrix"
#> a b from more of speaker words
#> [1,] 1 . . . 1 1 1
#> [2,] 1 . 1 1 . 1 1
#> [3,] . 1 1 . . 1 1
#> [4,] . 1 1 1 . 1 1
# compare each entry
lma_simets(dtm)
#> $jaccard
#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)
#>
#> [1,] I . . .
#> [2,] 0.5000000 I . .
#> [3,] 0.3333333 0.5000000 I .
#> [4,] 0.2857143 0.6666667 0.8 I
#>
#> $euclidean
#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)
#>
#> [1,] I . . .
#> [2,] 0.3660254 I . .
#> [3,] 0.3333333 0.3660254 I .
#> [4,] 0.3090170 0.4142136 0.5 I
#>
#> $canberra
#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)
#>
#> [1,] I . . .
#> [2,] 0.5714286 I . .
#> [3,] 0.4285714 0.5714286 I .
#> [4,] 0.2857143 0.7142857 0.8571429 I
#>
#> $cosine
#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)
#>
#> [1,] I . . .
#> [2,] 0.6708204 I . .
#> [3,] 0.5000000 0.6708204 I .
#> [4,] 0.4472136 0.8000000 0.8944272 I
#>
#> $pearson
#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)
#>
#> [1,] I . . .
#> [2,] 0.09128709 I . .
#> [3,] -0.16666667 0.09128709 I .
#> [4,] -0.54772256 0.30000000 0.7302967 I
#>
#> attr(,"time")
#> simets
#> 0
# compare each entry with the mean of all entries
lma_simets(dtm, colMeans(dtm))
#> jaccard euclidean canberra cosine pearson
#> 1 0.5714286 0.4220645 0.4380952 0.7484552 0.1964186
#> 2 0.7142857 0.5166852 0.5986395 0.9128709 0.6454972
#> 3 0.5714286 0.5166852 0.5034014 0.8845380 0.7463905
#> 4 0.7142857 0.5166852 0.5986395 0.9128709 0.6454972
# compare by group (corresponding to speakers and turns in this case)
speaker <- c("A", "A", "B", "B")
## by default, consecutive rows from the same group are averaged:
lma_simets(dtm, group = speaker)
#> jaccard euclidean canberra cosine pearson
#> 1, 2 <-> 3, 4 0.5714286 0.3874259 0.5238095 0.6888467 -0.1324532
## with agg = FALSE, only the rows at the boundary between
## groups (rows 2 and 3 in this case) are used:
lma_simets(dtm, group = speaker, agg = FALSE)
#> jaccard euclidean canberra cosine pearson
#> 2 <-> 3 0.5 0.3660254 0.5714286 0.6708204 0.09128709