Clusters columns in a matrix based on a very minimal algorithm:
Start with the column with the biggest sum / smallest other-cluster weight.
Calculate its correlation with all other columns, and use that to define a cluster.
Repeat with all unassigned columns.
taffyInf
differs in that it will not eliminate columns from the pool, but will
base the selection of an initial term on the weight across all previously assigned clusters.
Usage
taffy(m, k = nrow(m), minterm = 2, co = 0.975)
taffyInf(m, k = 2)
Arguments
- m
A numeric matrix with column names.
- k
Number of clusters to look for. This will be the maximum number for
taffy
, which will stop when the number of columns in the cluster is less thanminterm
. FortaffyInf
this will always be the number returned, and columns may repeat between clusters.- minterm
Minimum number of columns a cluster must have to be considered a cluster.
- co
Quantile-based cutt-off used to assign columns to a cluster.
Value
A list with vectors of column names (taffy
), or a matrix of weights,
with a column for each cluster, and row for each column (taffyInf
).
Examples
m <- Matrix(as.matrix(data.frame(
cluster1_term1 = c(1, 1, 0, 0),
cluster1_term2 = c(1, 0, 0, 0),
cluster2_term2 = c(0, 0, 0, 1),
cluster2_term1 = c(0, 0, 1, 1),
cluster3_term1 = c(1, 0, 0, 1),
cluster4_term1 = c(0, 1, 1, 0)
)))
taffy(m, co = .6)
#> [[1]]
#> [1] "cluster1_term1" "cluster1_term2"
#>
#> [[2]]
#> [1] "cluster2_term2" "cluster2_term1"
#>
taffyInf(m)
#> 1 2
#> cluster1_term1 1.0000000 -1.0000000
#> cluster1_term2 0.5773503 -0.5773503
#> cluster2_term2 -0.5773503 0.5773503
#> cluster2_term1 -1.0000000 1.0000000
#> cluster3_term1 0.0000000 0.0000000
#> cluster4_term1 0.0000000 0.0000000