Skip to contents

Walks through the process of creating and assessing a dictionary.

Built with R 4.3.2


Background

Dictionaries in this context are sets of term lists that each represent a category. Categories could range from concepts or topics (e.g., “furniture” or “shopping”) to more abstract aspects of the text (e.g., “emotionality”). Terms may be single literal words (e.g., “term”, “terms”), glob-like or fuzzy words (e.g., “term*“, where the asterisk matches any number of additional letters), or arbitrary patterns (literal or regular expressions, e.g.,”a phrase” or “an? (?:person|object)”).

The most straightforward way to make a dictionary is to manually assign terms to categories, but dictionaries can also be created with data, either by extracting cluster-like structures and assigning them category names (using unsupervised learning methods; see the introduction to word vectors article), or by training a classifier to distinguish between provided tags (using supervised learning methods; see the introduction to text classification article).

Implementation

In lingmatch, dictionaries are ultimately implemented as either lists or data.frames.

As lists, categories are named entries with terms as character vectors:

(dict_unweighted <- list(
  a = c("aa", "ab", "ac"),
  b = c("ba", "bb", "bc")
))
#> $a
#> [1] "aa" "ab" "ac"
#> 
#> $b
#> [1] "ba" "bb" "bc"

As data.frames, terms are stored in a single column, and categories are defined by columns containing weights:

(dict_dataframe <- data.frame(
  term = c("aa", "ab", "ac", "ba", "bb", "bc"),
  a = c(1, .9, .8, 0, 0, 0),
  b = c(0, 0, 0, .9, 1, .8)
))
#>   term   a   b
#> 1   aa 1.0 0.0
#> 2   ab 0.9 0.0
#> 3   ac 0.8 0.0
#> 4   ba 0.0 0.9
#> 5   bb 0.0 1.0
#> 6   bc 0.0 0.8

Lists can also store weights in named character vectors:

(dict_list <- list(
  a = c("aa" = 1, "ab" = .9, "ac" = .8),
  b = c("ba" = .9, "bb" = 1, "bc" = .8)
))
#> $a
#>  aa  ab  ac 
#> 1.0 0.9 0.8 
#> 
#> $b
#>  ba  bb  bc 
#> 0.9 1.0 0.8

Weighted dictionaries can be discretized using the read.dic function:

read.dic(dict_dataframe, as.weighted = FALSE)
#> $a
#> [1] "aa" "ab" "ac"
#> 
#> $b
#> [1] "ba" "bb" "bc"

And un-weighted dictionaries can be converted to a weighted format with binary weights:

read.dic(dict_unweighted, as.weighted = TRUE)
#>   term a b
#> 1   aa 1 0
#> 2   ab 1 0
#> 3   ac 1 0
#> 4   ba 0 1
#> 5   bb 0 1
#> 6   bc 0 1

Dictionaries can also be read in from LIWC’s .dic format:

# this can also be written to and read from a file
raw_dic <- write.dic(dict_unweighted, save = FALSE)
cat(raw_dic)
#> %
#> 1    a
#> 2    b
#> %
#> aa   1
#> ab   1
#> ac   1
#> ba   2
#> bb   2
#> bc   2
read.dic(raw = raw_dic)
#> $a
#> [1] "aa" "ab" "ac"
#> 
#> $b
#> [1] "ba" "bb" "bc"

For use in web applications, JavaScript Object Notation (JSON) may be a useful format, which can be converted to from lists:

cat(jsonlite::toJSON(dict_unweighted, pretty = TRUE))
#> {
#>   "a": ["aa", "ab", "ac"],
#>   "b": ["ba", "bb", "bc"]
#> }

.dic or JSON dictionaries can be used in the adicat highlighter (from the menu: Dictionary > load/create/edit > load), which can be used to see word matches, or process files.

Assessment

Dictionary categories can be thought of as measures of the construct identified by the category’s name. Constructs can range from being well defined by the text itself, to being more subtly embedded in the text.

Some question we might ask when assessing a dictionary category are:

  1. How well will each term capture the word or words we have in mind?
    • That is, are there unaccounted for variants, or unintended wildcard matches?
  2. How well does this set of terms cover the possible instances of the construct?
    • That is, might the construct appear in a way that isn’t covered?
  3. How confident would we be that the resulting score reflects the construct?
    • That is, how much room is there for false positives due to varying contexts or word senses?

For instance, consider this small dictionary:

dict <- list(
  a_words = "a*",
  self_reference = c("i", "i'*", "me", "my", "mine", "myself"),
  furniture = c("table", "chair", "desk*", "couch*", "sofa*"),
  well_adjusted = c("happy", "bright*", "friend*", "she", "he", "they")
)

Character Variants

The a_words category is only defined by characters, so it is perfect in that its scores can be expected to perfectly align with any other reliable means of scoring (such as a human counter). The only threat to this category (assuming texts are lowercased) is special characters that should count as as, but converting such characters could be a part of pre-processing:

clean <- lma_dict("special", as.function = gsub)
lma_process(clean("Àn apple and à potato ærosol."), dict = dict[1], meta = FALSE)
#>                             text a_words
#> 1 an apple and a potato aerosol.       5

Use Variants

The self_reference category is made up of words, so in addition to possible character variants, there are spelling/formatting variants to try and account for. Here “i’*” is particularly vulnerable since the apostrophe may be curly or omitted. A new danger introduced in this category is of false positives due to alternative uses of “i” (e.g., as a list item label) and alternate senses of “mine”. These issues make for possible differences between the automatic score, and that theoretically calculated by a human:

lma_process(
  c("I) A mine.", "Mmeee! idk how but imma try!"),
  dict = dict[2], meta = FALSE
)
#>                           text self_reference
#> 1                   I) A mine.              2
#> 2 Mmeee! idk how but imma try!              0

Coverage

Term Variants

The furniture and well_adjusted categories introduces two main additional considerations: First, they uses broader wildcards, which are probably intended to simply catch plural forms, but are in danger of over-extending. We can use the report_term_matches function to check this:

report <- report_term_matches(dict[3:4], space_dir = "~/Latent Semantic Spaces")
#> preparing dict (0)
#> extracting matches (0)
#> preparing results (0.16)
#> done (0.16)

knitr::kable(report[, c("term", "categories", "variants", "matches")])
term categories variants matches
bright* well_adjusted 63 bright, brights, brighter, brighton, brighten, brightly, brightway, brightley, brightest, brightens, brightful, brighttag, brightman, brighting, brightener, brightling, brightstar, brightwell, brighteyes, brightwork, brightbill, brightmail, brightside, brightened, brightness, brightwood, brightline, bright-red, brightmoor, brightview, brightroll, brightcove, brightkite, brightspot, bright-eyed, brightfield, brightpoint, brightwater, brightwells, brightening, brighthouse, brightscope, brighteners, brightsolid, brightlight, bright-blue, brightblack, brightspark, brightstone, brightworks, bright-field, brightnesses, bright-green, brightwaters, brightsource, brightly-lit, brightonians, bright-yellow, bright-orange, brightlingsea, bright-colored, brightly-colored, brightly-coloured
friend* well_adjusted 38 friend, friende, friends, friendz, friendy, friendo, frienda, friended, friendly, friendlly, friendlys, friending, friendless, friendlier, friendship, friendster, friendfeed, friendlily, friendlies, friendzone, friendlist, friendcode, friendships, friendliest, friendzoned, friend-zone, friendswood, friendstream, friendlyness, friendliness, friendsville, friendraising, friendly-fire, friendsgiving, friends/family, friendlessness, friendsreunited, friend-of-the-court
desk* furniture 23 desk, deska, desko, desks, deskew, deskop, desker, deskset, deskman, desking, deskpro, desktop, deskjet, deskins, desk-top, desktops, deskside, deskstar, deskilled, deskbound, desk-bound, deskilling, desktop-publishing
couch* furniture 18 couch, couche, couches, coucher, couched, couchdb, couchie, couchois, couchers, couchman, couchant, couching, couchette, couchiching, couchpotato, couch-potato, couchsurfers, couchsurfing
sofa* furniture 9 sofa, sofas, sofar, sofast, sofaer, sofala, sofabed, sofamor, sofa-bed
table furniture 1 table
chair furniture 1 chair
happy well_adjusted 1 happy
she well_adjusted 1 she
he well_adjusted 1 he
they well_adjusted 1 they

By default, this searches for matches in a large set of common words found across latent semantic spaces (embeddings), but it can also be run on sets of text.

Category Coverage

Term Coverage

The second consideration is that these categories are trying to cover broad concepts, so there are likely to be obvious but overlooked terms to include. One thing we could do to improve this sort of coverage is search for similar words within a latent semantic space:

meta <- dictionary_meta(
  dict[3:4],
  suggest = TRUE, space_dir = "~/Latent Semantic Spaces"
)
#> preparing terms (0)
#> expanding terms (0.61)
#> loading space (0.8)
#> calculating term similarities (18.91)
#> identifying potential additions (31.61)
#> preparing results (34.03)
#> done (34.27)
meta$suggested
#> $furniture
#>    armchairs     loveseat       wallia    banquette       taiaha     cabriole 
#>   0.06596930   0.06543237   0.06279725   0.06192136   0.06116122   0.05991823 
#>      armless      settees        roset       eyalet        aeron   sporophila 
#>   0.05951132   0.05949731   0.05883960   0.05879392   0.05864686   0.05832045 
#>        futon  upholstered  workstation    siphuncle       kosuth       settee 
#>   0.05830094   0.05736860   0.05715411   0.05704506   0.05700139   0.05680740 
#>      chaises        phyfe workstations       beylik    collinder   satavahana 
#>   0.05649333   0.05624031   0.05619739   0.05596926   0.05586822   0.05566318 
#>        ahmes 
#>   0.05548118 
#> 
#> $well_adjusted
#>       hope      smile       glad     smiles       glow       love   cheerful 
#> 0.09415846 0.08316053 0.07921073 0.07897790 0.07880737 0.07644744 0.07598900 
#>       eyes      shine   thankful    excited        way     wishes     coming 
#> 0.07564743 0.07539943 0.07492050 0.07490184 0.07400350 0.07392510 0.07339015 
#>       wish   youthful    shining  celebrate     loving       know     joyful 
#> 0.07336494 0.07320441 0.07310913 0.07309839 0.07264139 0.07229958 0.07185999 
#>      hopes      bring       life     shines     hoping   facebook     honest 
#> 0.07158810 0.07142593 0.07138101 0.07119462 0.07084740 0.07082714 0.07076935 
#>    welcome      loved        say       tell      young       face      merry 
#> 0.07073722 0.07052697 0.07044166 0.07035872 0.07023679 0.07019802 0.07013483 
#>       sure     joyous    promise   sunshine   grateful       grow       come 
#> 0.07003938 0.06991035 0.06983251 0.06950521 0.06929901 0.06924857 0.06920617 
#>        sad     better        fun      shone        let  promising       turn 
#> 0.06918161 0.06916552 0.06908122 0.06876386 0.06872918 0.06868631 0.06855629 
#>     seeing     summer        eye       meet     hearts        'll    vibrant 
#> 0.06853710 0.06853068 0.06850545 0.06840545 0.06837274 0.06829185 0.06823692 
#>     helped      thank    blessed      light      faces      sweet   remember 
#> 0.06808839 0.06788728 0.06779164 0.06777502 0.06749256 0.06747807 0.06745387 
#>     little   sparkles       look      alive    glowing       past    sincere 
#> 0.06740187 0.06738964 0.06736148 0.06735414 0.06733943 0.06733193 0.06727468 
#>      proud       miss        sky        're       make     thanks      earth 
#> 0.06724195 0.06719751 0.06713858 0.06711578 0.06707854 0.06703660 0.06700477 
#>        day         's       last     caring      cheer   positive        new 
#> 0.06686473 0.06682727 0.06682465 0.06678533 0.06674122 0.06668531 0.06661365 
#>       true    helping    younger  wonderful    hopeful  surprised       knew 
#> 0.06658897 0.06655671 0.06655004 0.06654981 0.06650599 0.06631459 0.06616553 
#>     chance       fair       give    attract       ones    believe    growing 
#> 0.06615762 0.06605126 0.06586211 0.06578738 0.06568497 0.06562074 0.06556792 
#>    healthy   befriend    looking      night 
#> 0.06545284 0.06540390 0.06540297 0.06536912

We can also use the space to assess category cohesiveness by looking at summaries of pairwise cosine similarities between terms:

knitr::kable(meta$summary[, -1], digits = 3)
n_terms n_expanded sim.space sim.min sim.q1 sim.median sim.mean sim.q3 sim.max
furniture 5 34 glove_crawl -0.034 0.015 0.035 0.052 0.085 0.152
well_adjusted 6 47 glove_crawl -0.014 0.017 0.077 0.081 0.137 0.183

Or look at those similarities within categories and expanded terms:

knitr::kable(
  meta$terms[meta$terms$category == "furniture", ],
  digits = 3, row.names = FALSE
)
category term match sim.term sim.category
furniture table table 1.000 0.520
furniture chair chair 1.000 0.644
furniture desk* desk-top 0.218 0.020
furniture desk* desk 1.000 0.543
furniture desk* desking 0.229 0.215
furniture desk* deskpro 0.067 -0.030
furniture desk* deskilled -0.050 -0.139
furniture desk* desktop 0.481 0.197
furniture desk* desktops 0.266 0.120
furniture desk* desks 0.706 0.488
furniture desk* deskjet 0.067 0.014
furniture desk* deskbound -0.033 0.040
furniture desk* deskins -0.074 -0.093
furniture desk* deskilling -0.111 -0.127
furniture desk* desker -0.088 -0.101
furniture desk* deskside 0.102 -0.018
furniture desk* deskstar 0.032 -0.078
furniture couch* couchdb 0.089 0.048
furniture couch* couchant 0.040 0.002
furniture couch* couchsurfing 0.116 0.046
furniture couch* couche 0.059 0.043
furniture couch* couchette 0.156 0.139
furniture couch* couchman 0.004 0.021
furniture couch* couching 0.098 0.026
furniture couch* couched 0.009 -0.023
furniture couch* coucher 0.054 0.136
furniture couch* couches 0.615 0.643
furniture couch* couch 1.000 0.778
furniture sofa* sofaer -0.175 -0.175
furniture sofa* sofabed 0.493 0.493
furniture sofa* sofala -0.017 -0.017
furniture sofa* sofas 0.738 0.738
furniture sofa* sofar -0.095 -0.095
furniture sofa* sofa 1.000 1.000

And we can visualize this together with the most similar suggested terms as a network:

library(visNetwork)

display_network <- function(meta, cat = 1, n = 10, min = .1, seed = 2080) {
  cat_name <- meta$summary$category[[cat]]
  top_suggested <- meta$suggested[[cat_name]][seq_len(n)]
  terms <- meta$expanded[[cat_name]]
  nodes <- data.frame(
    id = c(terms, names(top_suggested)),
    label = c(terms, names(top_suggested)),
    group = rep(
      c("original", "suggested"),
      c(length(terms), length(top_suggested))
    ),
    shape = "box"
  )
  suggested_sim <- lma_simets(lma_lspace(
    nodes$id,
    space = meta$summary$sim.space[[1]]
  ), metric = "cosine")
  nodes$size <- rowMeans(suggested_sim)
  edges <- data.frame(
    from = rep(colnames(suggested_sim), each = nrow(suggested_sim)),
    to = rep(rownames(suggested_sim), nrow(suggested_sim)),
    value = as.numeric(suggested_sim),
    title = as.numeric(suggested_sim)
  )

  visNetwork(
    nodes, within(
      edges[edges$value > min & edges$value < 1, ], value <- (value * 10)^4
    )
  ) |>
    visEdges("title", smooth = FALSE, color = list(opacity = .6)) |>
    visLegend(width = .07) |>
    visLayout(randomSeed = seed) |>
    visPhysics("barnesHut", timestep = .1) |>
    visInteraction(
      dragNodes = TRUE, dragView = TRUE, hover = TRUE, hoverConnectedEdges = TRUE,
      selectable = TRUE, tooltipDelay = 100, tooltipStay = 100
    )
}

display_network(meta)

Or we could look at terms across categories within a dimensionally-reduced version of the space:

library(plotly)

display_reduced_space <- function(
    meta, space_name = "glove_crawl", method = "umap", dim_prop = 1,
    color_seeds = c("#25cb1a", "#c8fd9e", "#1b85ed", "#91b8fb")) {
  suggestions <- unlist(unname(meta$suggested))
  terms <- rbind(meta$terms[, c("category", "match", "sim.category")], data.frame(
    category = rep(
      paste0(names(meta$suggested), "_suggested"), vapply(meta$suggested, length, 0)
    ),
    match = names(suggestions),
    sim.category = suggestions
  ))
  space <- lma_lspace(terms$match, space_name)
  space <- space[, Reduce(unique, lapply(meta$expanded, function(l) {
    order(-colMeans(space[l, ]))[seq_len(min(ncol(space), ncol(space) * dim_prop))]
  }))]
  st <- proc.time()[[3]]
  m <- as.data.frame(if (method == "umap") {
    uwot::umap(space, 15, 3, metric = "cosine")
  } else if (method == "taffy") {
    m <- lusilab::taffyInf(lma_simets(space, metric = "cosine"), 3)
    colnames(m) <- c("V1", "V2", "V3")
    m
  } else if (method == "kmeans") {
    m <- t(kmeans(lma_simets(space, metric = "cosine"), 3)$centers)
    colnames(m) <- c("V1", "V2", "V3")
    m
  } else if (method == "svd") {
    m <- svd(lma_simets(space, metric = "cosine"), 3)$u
    rownames(m) <- rownames(space)
    m
  } else {
    m <- prcomp(lma_simets(space, metric = "cosine"))$rotation[, 1:3]
    dimnames(m) <- list(rownames(space), paste0("V", 1:3))
    m
  })
  message(
    "reduced space via ", method, " method in ",
    round(proc.time()[[3]] - st, 4), " seconds"
  )
  d <- cbind(terms, m)
  d$color <- color_seeds[as.numeric(as.factor(d$category))]
  ds <- split(d, d$category)
  p <- plot_ly(
    do.call(rbind, ds),
    x = ~V1, y = ~V2, z = ~V3, textfont = list(size = 9)
  ) |>
    layout(
      showlegend = TRUE, paper_bgcolor = "#000000", font = list(color = "#ffffff"),
      margin = list(r = 0, b = 0, l = 0)
    )
  for (cat in names(ds)) {
    p <- p |> add_text(
      data = ds[[cat]], text = ~match, name = cat, textfont = list(color = ~color)
    )
  }
  p
}

display_reduced_space(meta)
#> reduced space via umap method in 1.36 seconds

Here it seems the suggested terms strengthen cores of related terms within categories, leaving unrelated terms to group together between categories.

Instead of looking at pairwise comparisons, it might also make sense to compare with category centroids:

meta_centroid <- dictionary_meta(
  dict[3:4],
  pairwise = FALSE, suggest = TRUE, space_dir = "~/Latent Semantic Spaces"
)
#> preparing terms (0)
#> expanding terms (0.27)
#> loading space (0.44)
#> calculating term similarities (18.74)
#> identifying potential additions (22.04)
#> preparing results (24.35)
#> done (24.71)

meta_centroid$suggested
#> $furniture
#>    armchairs     loveseat        futon      armless  upholstered  workstation 
#>    0.2517359    0.2474119    0.2238488    0.2218516    0.2198026    0.2195312 
#>    banquette      settees workstations       settee       chaise        aeron 
#>    0.2191410    0.2169368    0.2152506    0.2146026    0.2130485    0.2088532 
#>      chaises     recliner     cabriole     credenza       seater 
#>    0.2070558    0.2039399    0.2018048    0.2013844    0.1996919 
#> 
#> $well_adjusted
#>        hope       smile        glad        glow      smiles    thankful 
#>   0.2833616   0.2475967   0.2413412   0.2372855   0.2366536   0.2309778 
#>      wishes    youthful     excited        love    cheerful        eyes 
#>   0.2256275   0.2254221   0.2249745   0.2235930   0.2231660   0.2230449 
#>      honest       shine   celebrate      joyful     shining      joyous 
#>   0.2227368   0.2219152   0.2209088   0.2190795   0.2177220   0.2174133 
#>      loving      shines        wish       hopes      coming         way 
#>   0.2168939   0.2162090   0.2156694   0.2155521   0.2136403   0.2128437 
#>       shone    facebook        know         sad     sincere   promising 
#>   0.2127467   0.2123793   0.2121608   0.2120186   0.2111781   0.2108967 
#>      hoping       merry       loved    sparkles         say     promise 
#>   0.2105333   0.2101729   0.2101528   0.2099162   0.2094993   0.2093688 
#>    grateful        life     hopeful     blessed        tell    befriend 
#>   0.2085683   0.2080989   0.2079336   0.2078590   0.2070684   0.2064690 
#>      seeing      hearts    positive    feelings        face       young 
#>   0.2059093   0.2049330   0.2045796   0.2038124   0.2037544   0.2035204 
#>    sunshine     attract         eye       alive        grow      helped 
#>   0.2027419   0.2025035   0.2024249   0.2022952   0.2022153   0.2022134 
#>      caring     glowing     vibrant       proud   surprised   attracted 
#>   0.2020372   0.2019306   0.2018866   0.2013347   0.2009025   0.2008401 
#>        knew       thank       cheer        sure  beginnings        miss 
#>   0.2007124   0.2006891   0.2002314   0.2001261   0.1999750   0.1999670 
#>      better       bring        fair        past     younger    thrilled 
#>   0.1994686   0.1987255   0.1985127   0.1984401   0.1982115   0.1981868 
#>       earth     welcome       bless      summer       faces      afraid 
#>   0.1981317   0.1980738   0.1978137   0.1976383   0.1976157   0.1973665 
#>  prosperous        true       sweet         fun        grew   greetings 
#>   0.1970473   0.1970078   0.1968552   0.1968065   0.1967096   0.1967075 
#>  complexion  optimistic    hometown        turn    remember   fortunate 
#>   0.1966697   0.1963466   0.1963115   0.1961961   0.1961387   0.1960371 
#>      thanks         let     believe      chance        hurt  positively 
#>   0.1958364   0.1957286   0.1957258   0.1953343   0.1952308   0.1947304 
#>  remembered        hear encouraging        dear  enthusiasm 
#>   0.1943526   0.1942415   0.1939956   0.1939174   0.1938838
knitr::kable(meta_centroid$summary[, -1], digits = 3)
n_terms n_expanded sim.space sim.min sim.q1 sim.median sim.mean sim.q3 sim.max
furniture 5 34 glove_crawl -0.093 0.018 0.096 0.146 0.19 0.675
well_adjusted 6 47 glove_crawl -0.170 0.018 0.191 0.199 0.36 0.651

The previous examples looked at terms within a single space, but we can also aggregate across multiple spaces, which might result in more reliable comparisons:

meta_multi <- dictionary_meta(
  dict[3:4], "multi",
  suggest = TRUE, space_dir = "~/Latent Semantic Spaces"
)
#> preparing terms (0)
#> expanding terms (0.26)
#> loading spaces (0.45)
#> calculating term similarities (79.7)
#> identifying potential additions (163.71)
#> preparing results (166.01)
#> done (166.15)

meta_multi$suggested
#> $furniture
#>       cabriole       roundoff           rsha  contemporaine          tansu 
#>      0.1482792      0.1339643      0.1319545      0.1316325      0.1313030 
#>            kvm             qn     documentos    automorphic         spinet 
#>      0.1298259      0.1275958      0.1269086      0.1262035      0.1254962 
#>          seiza        trafico           mmse           parl       benchtop 
#>      0.1242864      0.1241114      0.1239376      0.1237238      0.1228883 
#>          vises       castling           iiie        treadle     secretaire 
#>      0.1224446      0.1216169      0.1214022      0.1209836      0.1206658 
#>           ortf     katholieke          bombe         univac          kilim 
#>      0.1206569      0.1204230      0.1202421      0.1200579      0.1200574 
#>          ahmes           muet            bdb         kosuth           banc 
#>      0.1197782      0.1194817      0.1187592      0.1187553      0.1187366 
#>       gasifier         thonet      hollerith     cleanrooms            nle 
#>      0.1183109      0.1182349      0.1179614      0.1179571      0.1173060 
#>       bentwood       cableway        arcinfo        iseries         cahier 
#>      0.1172149      0.1169206      0.1168743      0.1168297      0.1166965 
#>    hepplewhite    guillotines        hassock      assistent     dispositif 
#>      0.1163738      0.1163604      0.1161890      0.1160527      0.1159809 
#>         reeded satisfiability         penser      exercices         digiti 
#>      0.1159393      0.1156503      0.1155371      0.1155054      0.1153704 
#>       pushdown        tonneau      microform 
#>      0.1152019      0.1151824      0.1148661 
#> 
#> $well_adjusted
#>      merry   thrilled     joyous      amity  esperanza      cheer   thankful 
#>  0.1323126  0.1257853  0.1206332  0.1194965  0.1188206  0.1175747  0.1158717 
#>  attracted    wooster      proud     cheers      windy       love blossoming 
#>  0.1148290  0.1147411  0.1134269  0.1123962  0.1119232  0.1118704  0.1115397 
#>     loving    excited   branford     joyful generosity       avon     selena 
#>  0.1112393  0.1108994  0.1097011  0.1095772  0.1095450  0.1094822  0.1093923 
#>  greetings  delighted       hope     ramona  affection   youthful       glad 
#>  0.1085926  0.1078952  0.1077104  0.1073784  0.1068428  0.1063933  0.1062241 
#>     wilmer      loyal     unkind     stormy     scouts    dawning    hopeful 
#>  0.1060294  0.1058156  0.1056248  0.1055818  0.1052770  0.1052233  0.1048291 
#>    attract     thrive   feelings enthusiasm 
#>  0.1043037  0.1042586  0.1042461  0.1042268
knitr::kable(meta_multi$summary[, -1], digits = 3)
n_terms n_expanded sim.space sim.min sim.q1 sim.median sim.mean sim.q3 sim.max
furniture 5 33 glove_crawl, paragram_sl999, paragram_ws353, sensembed, CoNLL17_skipgram 0.138 0.223 0.259 0.259 0.304 0.369
well_adjusted 6 45 glove_crawl, paragram_sl999, paragram_ws353, sensembed, CoNLL17_skipgram 0.141 0.200 0.261 0.249 0.296 0.357
Text Coverage

Suggested terms can help improve the theoretical coverage of these categories in themselves, but another type of coverage is how much of the category is covered by the text it’s scoring. Low coverage of this sort isn’t inherently an issue, but it puts more pressure on the covered terms to be unambiguous. For instance, compare the score versus coverage in these texts:

texts <- c(
  furniture = "There is a chair positioned in the intersection of a desk and table.",
  still_furniture = "I'm selling this chair, since my new chair replaced that chair.",
  business = "The chair took over from the former chair to introduced the new chair.",
  business_mixed = "The chair sat down at their desk to table the discussion."
)
lma_termcat(texts, dict[3], coverage = TRUE)
#>                 furniture coverage_furniture
#> furniture               3                  3
#> still_furniture         3                  1
#> business                3                  1
#> business_mixed          3                  3
#> attr(,"WC")
#> [1] 13 11 13 11
#> attr(,"time")
#>     dtm termcat 
#>       0       0 
#> attr(,"type")
#> [1] "count"

These examples illustrate how this sort of coverage could relate to score validity (i.e., how much the category is actually reflected in the text), but also how it is not a perfect indicator. Generally, a smaller variety of term hits within a category should make us less confident in the category score.

Applied Example

For a more realistic example, we might look at the Agency and Communion dictionary (Pietraszkiewicz et al., 2019; osf.io/p7fzb), which has been rather thoroughly developed, but from a different perspective than presented here.

Our first step will be to read in the dictionary:

original <- read.dic("https://osf.io/download/62txv")

To start assessing this dictionary, we’ll need to unpack the fuzzy terms. Often, this is done manually, based on a reasonable set of target terms (such as the root work with common variants added). This would be valid if the terms were pre-expanded to such a limited set, but when kept fuzzy, we should be considering a broader set of terms, which is where the report_term_matches function comes in:

matches <- report_term_matches(original)
#> preparing dict (0)
#> extracting matches (0)
#> preparing results (6.27)
#> done (6.37)
knitr::kable(matches[2:10, c("term", "categories", "variants", "matches")])
term categories variants matches
2 love* communion 173 love, loveu, loved, loven, lovel, lover, lovem, loves, lovey, lovex, lovee, lovez, loverz, loveli, lovely, lovett, lovers, lovery, lovell, loveth, lovech, lovera, lovest, lovein, lovece, loveed, lovess, loveis, loveys, lovern, loveme, loveit, loveya, lovegod, loveand, lovelyy, lovelly, lovetta, lovebox, lovells, loveley, loveall, lovelys, loverro, loveday, lovejoy, loveman, lovelle, love-in, lovesey, lovenox, loveyou, lovecat, lovebug, loverde, loveher, loveing, loveitt, lovesac, loverly, lovette, love.you, lovehate, lovethis, lovebite, lovesome, lovefest, lovelace, loveseat, lovedale, lovefilm, lovelife, lovelies, loveless, lovebird, lovecats, lovemore, lovelorn, lovesong, lovesick, lovesexy, lovethem, lovethat, loverboy, lovelady, loveable, loveluck, lovejoys, loverman, lovegood, lovelier, lovedrug, lovefool, lovegame, loveland, loveleen, lovebugs, loveline, lovelock, lovering, lovetown, loverdos, lovewell, lovemark, lovelove, loveably, lovelocks, lovedolls, lovenotes, lovespell, lovedrunk, love-song, lovegrass, lovechild, lovestory, lover-boy, lovebites, loveville, lovecraft, love-fest, lovergirl, lovebirds, loveliest, loveridge, lovebytes, lovegrove, love-hate, lovesongs, loverance, loveshack, lovelight, lovehoney, lovestone, love-life, loved-one, love/hate, love-sick, loveseats, lovescopes, love-child, lovemaster, lovestruck, love-match, loveliness, lovesounds, lovemakers, lovemaking, loveletter, lovestoned, lovestrong, love-story, lovellette, loveladies, lovemachine, lovehandles, love-affair, lovey-dovey, love-struck, lovenkrands, love-letter, love-making, love-stories, love-starved, lovelovelove, lovecraftian, lovelessness, lovesickness, lovettsville, love-letters, love-triangle, lovelaceville, lovelovelovelove, lovesliescrushing
3 commun* communion 98 commun, communs, communi, commune, communty, communed, communal, communio, communic, communes, communis, communter, communial, communaut, communard, communica, communing, community, communize, communism, communist, communion, communiqu, communtiy, communista, communtity, communcate, communisim, communitys, communally, communards, communique, communties, communists, communauto, communites, communions, communitas, communipaw, communiste, communitty, communizing, communistes, communalist, communiques, communalism, communistic, communicate, communicant, communality, communities, communicado, communiquer, communispace, communiction, communicatio, communicated, communicable, communityone, communicants, communicator, communicatin, communicates, communcation, communisation, communization, communicative, communicatory, communist-era, communitarian, communication, communicators, communicating, communautaire, communicaiton, communist-led, communitywide, communiversity, communications, community-wide, communitarians, communicability, communalisation, communist-style, community-based, communalization, communicational, community-owned, communicatively, communitychannel, communitarianism, community-driven, community-college, communicativeness, community-service, community-building, community-oriented, communist-dominated
4 socia* communion 85 socia, socias, social, sociate, socialy, socialt, sociaux, socials, sociais, sociale, sociaty, socialis, socially, sociales, sociable, socialst, sociably, socialbro, sociality, socialise, socialism, socialize, socialnet, socialite, socialcam, sociables, socialist, socialble, sociation, socialiser, socialwork, socialflow, socialable, socialness, socialcast, socializer, socialised, socialisme, socialites, socialista, socialiste, socialtext, socialismo, socialists, socialized, socializes, socialises, socialisms, social-work, sociability, socializers, socialising, socialistes, socialmedia, socializing, socialisten, socialscope, socialistic, socialistas, social-class, socialengine, socialogical, socialnomics, socialnetwork, socialisation, social-policy, socialstudies, socialization, social-service, social-justice, socialnetworks, socialsecurity, social-welfare, social-science, socialistically, social-democrat, social-economic, social-climbing, social-security, social-political, socialnetworking, social-democrats, social-democracy, social-networking, social-democratic
5 know* agency 79 know, knowm, knowi, knowl, knowe, knowz, knows, known, knoww, know.i, knowwe, knowme, knowle, knowns, knowem, knower, knowin, knowed, knowen, knowne, knowth, knowled, knoweth, knowles, knowhow, knowest, knowing, knowers, knowyou, knowwhen, knowthis, knowbody, knowings, knowwhat, knowlton, knowable, knowlege, knowling, knowland, knowshon, knowsley, knowning, knowhere, knowedge, know-all, knowlson, know-how, know-who, knowldge, knowlage, knowitall, knowingly, knowlegde, knowledge, know-what, knoweldge, knowbility, knowledges, knowledged, knowledege, knowlingly, knowingness, know-it-all, knowability, knowlesville, knowledgably, knowledgable, know-nothing, knowledgeble, knowledgement, knowledgelake, know-nothings, knowledgeably, knowledgeable, knowledgebase, knowledgeworks, knowledge-base, knowledge-based, knowledgeability
6 human* communion 79 human, humanx, humanz, humans, humand, humano, humani, humane, humann, humana, humanit, humanos, humanly, humanic, humanus, humanas, humanae, humanics, humanize, humanoid, humanist, humanity, humanely, humanite, humanism, humanise, humanitys, humanhood, humanness, humanists, humanitas, humanized, humanitie, humanoids, humanidad, humanised, humankind, humanizes, humanlike, humanises, humankinds, humanlight, humanscale, humanizing, humaneness, humaniores, humanistic, humanities, humanbeing, humanidade, humanising, human-like, human-made, humansdorp, human-being, humansville, humanidades, human-sized, humanbeings, humanrights, human-beings, humanisation, humanitarian, human-rights, human-caused, humanization, humancentipad, humanitarians, human-induced, human-powered, human-centred, human-resource, humanistically, human-centered, human-computer, human-readable, human-to-human, humanresources, humanitarianism
7 talk* communion 63 talk, talko, talky, talke, talks, talkd, talki, talkn, talkk, talkin, talked, talkng, talken, talkie, talker, talkes, talkto, talkig, talkman, talking, talkign, talkbox, talkies, talkers, talkinq, talkboy, talkest, talketh, talkings, talktime, talktalk, talkpage, talktome, talkleft, talkfest, talkshow, talkshoe, talkshop, talkback, talkiing, talkable, talkbacks, talkgroup, talk-show, talkradio, talksport, talkshows, talkative, talk-back, talkabout, talkathon, talkhouse, talkeetna, talkspace, talking-to, talkington, talk-radio, talkability, talked-about, talking-head, talkativeness, talking-point, talkingpointsmemo
8 frat* communion 50 frat, frato, frats, frati, fratz, frata, frate, fratus, fratty, frater, fratto, frates, fratboy, frattin, fratrum, fratton, fraters, fratkin, fratres, frat-boy, frattini, fratello, frattali, fratelli, fratianne, frathouse, fratellis, fraternal, fratianni, fratantoni, fratercula, frat-house, frattaroli, fratricide, fraticelli, fraternity, fraternize, fraternite, fraternise, fraternised, fraternized, fratricidal, fraternidad, fraternally, fraternizing, fraternising, fraternities, fraternalism, fraternization, fraternisation
9 help* communion 47 help, helpp, helpd, helpt, helps, helpy, helpe, helppp, helpme, helpin, helper, helpfu, helped, helpus, helpen, helpng, helpyou, helpout, helping, helpers, helpful, helpeth, helpage, helprin, helpern, helpmeet, helperby, helpline, helpless, helpdesk, helptext, helpfull, helpings, helpmate, helpmann, helpston, helpmeets, helpdesks, helpfully, helpmates, helplines, help-desk, helplessly, helpfulness, helpfullness, helplessness, help-yourself
10 compet* agency 46 compet, competi, compete, competir, competin, competed, competes, competit, competant, competing, competion, competent, competely, competive, competente, competeing, competions, competency, competitor, competence, compettive, competiton, competitve, competetion, competiting, competitive, competitors, competition, competences, competetive, competencia, competative, competently, competitiion, competencias, competencies, competitions, competitiors, competizione, competiveness, competitively, competetively, competitivity, competitividad, competitiveness, competetiveness

Right away, this highlights some terms that are likely over-matching, which makes it difficult to take the next steps in assessing the dictionary.