Simple Clustering — taffy • lusilab

Clusters columns in a matrix based on a very minimal algorithm:

Start with the column with the biggest sum / smallest other-cluster weight.
Calculate its correlation with all other columns, and use that to define a cluster.
Repeat with all unassigned columns.

taffyInf differs in that it will not eliminate columns from the pool, but will base the selection of an initial term on the weight across all previously assigned clusters.

Usage

taffy(m, k = nrow(m), minterm = 2, co = 0.975)

taffyInf(m, k = 2)

Arguments

m: A numeric matrix with column names.
k: Number of clusters to look for. This will be the maximum number for taffy, which will stop when the number of columns in the cluster is less than minterm. For taffyInf this will always be the number returned, and columns may repeat between clusters.
minterm: Minimum number of columns a cluster must have to be considered a cluster.
co: Quantile-based cutt-off used to assign columns to a cluster.

Value

A list with vectors of column names (taffy), or a matrix of weights, with a column for each cluster, and row for each column (taffyInf).

Examples

m <- Matrix(as.matrix(data.frame(
  cluster1_term1 = c(1, 1, 0, 0),
  cluster1_term2 = c(1, 0, 0, 0),
  cluster2_term2 = c(0, 0, 0, 1),
  cluster2_term1 = c(0, 0, 1, 1),
  cluster3_term1 = c(1, 0, 0, 1),
  cluster4_term1 = c(0, 1, 1, 0)
)))
taffy(m, co = .6)
#> [[1]]
#> [1] "cluster1_term1" "cluster1_term2"
#> 
#> [[2]]
#> [1] "cluster2_term2" "cluster2_term1"
#> 
taffyInf(m)
#>                         1          2
#> cluster1_term1  1.0000000 -1.0000000
#> cluster1_term2  0.5773503 -0.5773503
#> cluster2_term2 -0.5773503  0.5773503
#> cluster2_term1 -1.0000000  1.0000000
#> cluster3_term1  0.0000000  0.0000000
#> cluster4_term1  0.0000000  0.0000000