Introduction to Dictionary Creation
Source:vignettes/dictionary_creation.Rmd
dictionary_creation.Rmd
Walks through the process of creating and assessing a dictionary.
Built with R 4.3.3
See the dictionary builder for a way to make dictionaries interactively.
Background
Dictionaries in this context are sets of term lists that each represent a category. Categories could range from concepts or topics (e.g., “furniture” or “shopping”) to more abstract aspects of a text (e.g., “emotionality”). Terms may be single literal words (e.g., “term”, “terms”), glob-like or fuzzy words (e.g., “term*“, where the asterisk matches any number of additional letters), or arbitrary patterns (literal or regular expressions, e.g.,”a phrase” or “an? (?:person|object)”).
The most straightforward way to make a dictionary is to manually assign terms to categories, but dictionaries can also be created with data, either by extracting cluster-like structures and assigning them category names (using unsupervised learning methods; see the introduction to word vectors article), or by training a classifier to distinguish between provided tags (using supervised learning methods; see the introduction to text classification article).
Implementation
In lingmatch
, dictionaries are ultimately implemented as
either lists or data.frames.
As lists, categories are named entries with terms as character vectors:
(dict_unweighted <- list(
a = c("aa", "ab", "ac"),
b = c("ba", "bb", "bc")
))
#> $a
#> [1] "aa" "ab" "ac"
#>
#> $b
#> [1] "ba" "bb" "bc"
As data.frames, terms are stored in a single column, and categories are defined by columns containing weights:
(dict_dataframe <- data.frame(
term = c("aa", "ab", "ac", "ba", "bb", "bc"),
a = c(1, .9, .8, 0, 0, 0),
b = c(0, 0, 0, .9, 1, .8)
))
#> term a b
#> 1 aa 1.0 0.0
#> 2 ab 0.9 0.0
#> 3 ac 0.8 0.0
#> 4 ba 0.0 0.9
#> 5 bb 0.0 1.0
#> 6 bc 0.0 0.8
Lists can also store weights in named character vectors:
(dict_list <- list(
a = c("aa" = 1, "ab" = .9, "ac" = .8),
b = c("ba" = .9, "bb" = 1, "bc" = .8)
))
#> $a
#> aa ab ac
#> 1.0 0.9 0.8
#>
#> $b
#> ba bb bc
#> 0.9 1.0 0.8
Weighted dictionaries can be discretized using the
read.dic
function:
read.dic(dict_dataframe, as.weighted = FALSE)
#> $a
#> [1] "aa" "ab" "ac"
#>
#> $b
#> [1] "ba" "bb" "bc"
And un-weighted dictionaries can be converted to a weighted format with binary weights:
read.dic(dict_unweighted, as.weighted = TRUE)
#> term a b
#> 1 aa 1 0
#> 2 ab 1 0
#> 3 ac 1 0
#> 4 ba 0 1
#> 5 bb 0 1
#> 6 bc 0 1
Dictionaries can also be read in from LIWC’s .dic
format:
# this can also be written to and read from a file
raw_dic <- write.dic(dict_unweighted, save = FALSE)
cat(raw_dic)
#> %
#> 1 a
#> 2 b
#> %
#> aa 1
#> ab 1
#> ac 1
#> ba 2
#> bb 2
#> bc 2
read.dic(raw = raw_dic)
#> $a
#> [1] "aa" "ab" "ac"
#>
#> $b
#> [1] "ba" "bb" "bc"
For use in web applications, JavaScript Object Notation (JSON) may be a useful format, which can be converted to from lists:
cat(jsonlite::toJSON(dict_unweighted, pretty = TRUE))
#> {
#> "a": ["aa", "ab", "ac"],
#> "b": ["ba", "bb", "bc"]
#> }
.dic
or JSON
dictionaries can be used in
the
adicat
highlighter (from the menu: Dictionary > load/create/edit >
load), which can be used to see word matches, or process files.
Any format can also be read into the dictionary builder (from the left-side menu: New > File), which can be used to assess and edit the dictionary.
Assessment
Dictionary categories can be thought of as measures of the construct identified by the category’s name. Constructs can range from being well defined by the text itself, to being more subtly embedded in the text.
Some question we might ask when assessing a dictionary category are:
- How well will each term capture the word or words we have in mind?
- That is, are there unaccounted for variants, or unintended wildcard matches?
- How well does this set of terms cover the possible instances of the
construct?
- That is, might the construct appear in a way that isn’t covered?
- How confident would we be that the resulting score reflects the
construct?
- That is, how much room is there for false positives due to varying contexts or word senses?
For instance, consider this small dictionary:
dict <- list(
a_words = "a*",
self_reference = c("i", "i'*", "me", "my", "mine", "myself"),
furniture = c("table", "chair", "desk*", "couch*", "sofa*"),
well_adjusted = c("happy", "bright*", "friend*", "she", "he", "they")
)
Character Variants
The a_words
category is only defined by characters, so
it is perfect in that its scores can be expected to perfectly align with
any other reliable means of scoring (such as a human counter). The only
threat to this category (assuming texts are lowercased) is special
characters that should count as a
s, but converting such
characters could be a part of pre-processing:
clean <- lma_dict("special", as.function = gsub)
lma_process(clean("Àn apple and à potato ærosol."), dict = dict[1], meta = FALSE)
#> text a_words
#> 1 an apple and a potato aerosol. 5
Use Variants
The self_reference
category is made up of words, so in
addition to possible character variants, there are spelling/formatting
variants to try and account for. Here “i’*” is particularly vulnerable
since the apostrophe may be curly or omitted. A new danger introduced in
this category is of false positives due to alternative uses of “i”
(e.g., as a list item label) and alternate senses of “mine”. These
issues make for possible differences between the automatic score, and
that theoretically calculated by a human:
lma_process(
c("I) A mine.", "Mmeee! idk how but imma try!"),
dict = dict[2], meta = FALSE
)
#> text self_reference
#> 1 I) A mine. 2
#> 2 Mmeee! idk how but imma try! 0
Coverage
Term Variants
The furniture
and well_adjusted
categories
introduces two main additional considerations: First, they uses broader
wildcards, which are probably intended to simply catch plural forms, but
are in danger of over-extending. We can use the
report_term_matches
function to check this:
report <- report_term_matches(dict[3:4], space_dir = "~/Latent Semantic Spaces")
#> preparing dict (0)
#> extracting matches (0)
#> preparing results (0.16)
#> done (0.16)
knitr::kable(report[, c("term", "categories", "variants", "matches")])
term | categories | variants | matches |
---|---|---|---|
bright* | well_adjusted | 63 | bright, brights, brighter, brighton, brighten, brightly, brightway, brightley, brightest, brightens, brightful, brighttag, brightman, brighting, brightener, brightling, brightstar, brightwell, brighteyes, brightwork, brightbill, brightmail, brightside, brightened, brightness, brightwood, brightline, bright-red, brightmoor, brightview, brightroll, brightcove, brightkite, brightspot, bright-eyed, brightfield, brightpoint, brightwater, brightwells, brightening, brighthouse, brightscope, brighteners, brightsolid, brightlight, bright-blue, brightblack, brightspark, brightstone, brightworks, bright-field, brightnesses, bright-green, brightwaters, brightsource, brightly-lit, brightonians, bright-yellow, bright-orange, brightlingsea, bright-colored, brightly-colored, brightly-coloured |
friend* | well_adjusted | 38 | friend, friende, friends, friendz, friendy, friendo, frienda, friended, friendly, friendlly, friendlys, friending, friendless, friendlier, friendship, friendster, friendfeed, friendlily, friendlies, friendzone, friendlist, friendcode, friendships, friendliest, friendzoned, friend-zone, friendswood, friendstream, friendlyness, friendliness, friendsville, friendraising, friendly-fire, friendsgiving, friends/family, friendlessness, friendsreunited, friend-of-the-court |
desk* | furniture | 23 | desk, deska, desko, desks, deskew, deskop, desker, deskset, deskman, desking, deskpro, desktop, deskjet, deskins, desk-top, desktops, deskside, deskstar, deskilled, deskbound, desk-bound, deskilling, desktop-publishing |
couch* | furniture | 18 | couch, couche, couches, coucher, couched, couchdb, couchie, couchois, couchers, couchman, couchant, couching, couchette, couchiching, couchpotato, couch-potato, couchsurfers, couchsurfing |
sofa* | furniture | 9 | sofa, sofas, sofar, sofast, sofaer, sofala, sofabed, sofamor, sofa-bed |
table | furniture | 1 | table |
chair | furniture | 1 | chair |
happy | well_adjusted | 1 | happy |
she | well_adjusted | 1 | she |
he | well_adjusted | 1 | he |
they | well_adjusted | 1 | they |
By default, this searches for matches in a large set of common words found across latent semantic spaces (embeddings), but it can also be run on sets of text.
Category Coverage
Term Coverage
The second consideration is that these categories are trying to cover broad concepts, so there are likely to be obvious but overlooked terms to include. One thing we could do to improve this sort of coverage is search for similar words within a latent semantic space:
meta <- dictionary_meta(
dict[3:4],
suggest = TRUE, space_dir = "~/Latent Semantic Spaces"
)
#> preparing terms (0)
#> expanding terms (0.64)
#> loading space (0.79)
#> calculating term similarities (18.53)
#> identifying potential additions (33.79)
#> preparing results (36.51)
#> done (36.84)
meta$suggested
#> $furniture
#> armchairs loveseat wallia banquette taiaha cabriole
#> 0.06596930 0.06543237 0.06279725 0.06192136 0.06116122 0.05991823
#> armless settees roset eyalet aeron sporophila
#> 0.05951132 0.05949731 0.05883960 0.05879392 0.05864686 0.05832045
#> futon upholstered workstation siphuncle kosuth settee
#> 0.05830094 0.05736860 0.05715411 0.05704506 0.05700139 0.05680740
#> chaises phyfe workstations beylik collinder satavahana
#> 0.05649333 0.05624031 0.05619739 0.05596926 0.05586822 0.05566318
#> ahmes
#> 0.05548118
#>
#> $well_adjusted
#> hope smile glad smiles glow love cheerful
#> 0.09415846 0.08316053 0.07921073 0.07897790 0.07880737 0.07644744 0.07598900
#> eyes shine thankful excited way wishes coming
#> 0.07564743 0.07539943 0.07492050 0.07490184 0.07400350 0.07392510 0.07339015
#> wish youthful shining celebrate loving know joyful
#> 0.07336494 0.07320441 0.07310913 0.07309839 0.07264139 0.07229958 0.07185999
#> hopes bring life shines hoping facebook honest
#> 0.07158810 0.07142593 0.07138101 0.07119462 0.07084740 0.07082714 0.07076935
#> welcome loved say tell young face merry
#> 0.07073722 0.07052697 0.07044166 0.07035872 0.07023679 0.07019802 0.07013483
#> sure joyous promise sunshine grateful grow come
#> 0.07003938 0.06991035 0.06983251 0.06950521 0.06929901 0.06924857 0.06920617
#> sad better fun shone let promising turn
#> 0.06918161 0.06916552 0.06908122 0.06876386 0.06872918 0.06868631 0.06855629
#> seeing summer eye meet hearts 'll vibrant
#> 0.06853710 0.06853068 0.06850545 0.06840545 0.06837274 0.06829185 0.06823692
#> helped thank blessed light faces sweet remember
#> 0.06808839 0.06788728 0.06779164 0.06777502 0.06749256 0.06747807 0.06745387
#> little sparkles look alive glowing past sincere
#> 0.06740187 0.06738964 0.06736148 0.06735414 0.06733943 0.06733193 0.06727468
#> proud miss sky 're make thanks earth
#> 0.06724195 0.06719751 0.06713858 0.06711578 0.06707854 0.06703660 0.06700477
#> day 's last caring cheer positive new
#> 0.06686473 0.06682727 0.06682465 0.06678533 0.06674122 0.06668531 0.06661365
#> true helping younger wonderful hopeful surprised knew
#> 0.06658897 0.06655671 0.06655004 0.06654981 0.06650599 0.06631459 0.06616553
#> chance fair give attract ones believe growing
#> 0.06615762 0.06605126 0.06586211 0.06578738 0.06568497 0.06562074 0.06556792
#> healthy befriend looking night
#> 0.06545284 0.06540390 0.06540297 0.06536912
We can also use the space to assess category cohesiveness by looking at summaries of pairwise cosine similarities between terms:
knitr::kable(meta$summary[, -1], digits = 3)
n_terms | n_expanded | sim.space | sim.min | sim.q1 | sim.median | sim.mean | sim.q3 | sim.max | |
---|---|---|---|---|---|---|---|---|---|
furniture | 5 | 34 | glove_crawl | -0.034 | 0.015 | 0.035 | 0.052 | 0.085 | 0.152 |
well_adjusted | 6 | 47 | glove_crawl | -0.014 | 0.017 | 0.077 | 0.081 | 0.137 | 0.183 |
Or look at those similarities within categories and expanded terms:
knitr::kable(
meta$terms[meta$terms$category == "furniture", ],
digits = 3, row.names = FALSE
)
category | term | match | sim.term | sim.category |
---|---|---|---|---|
furniture | table | table | 1.000 | 0.520 |
furniture | chair | chair | 1.000 | 0.644 |
furniture | desk* | desk-top | 0.218 | 0.020 |
furniture | desk* | desk | 1.000 | 0.543 |
furniture | desk* | desking | 0.229 | 0.215 |
furniture | desk* | deskpro | 0.067 | -0.030 |
furniture | desk* | deskilled | -0.050 | -0.139 |
furniture | desk* | desktop | 0.481 | 0.197 |
furniture | desk* | desktops | 0.266 | 0.120 |
furniture | desk* | desks | 0.706 | 0.488 |
furniture | desk* | deskjet | 0.067 | 0.014 |
furniture | desk* | deskbound | -0.033 | 0.040 |
furniture | desk* | deskins | -0.074 | -0.093 |
furniture | desk* | deskilling | -0.111 | -0.127 |
furniture | desk* | desker | -0.088 | -0.101 |
furniture | desk* | deskside | 0.102 | -0.018 |
furniture | desk* | deskstar | 0.032 | -0.078 |
furniture | couch* | couchdb | 0.089 | 0.048 |
furniture | couch* | couchant | 0.040 | 0.002 |
furniture | couch* | couchsurfing | 0.116 | 0.046 |
furniture | couch* | couche | 0.059 | 0.043 |
furniture | couch* | couchette | 0.156 | 0.139 |
furniture | couch* | couchman | 0.004 | 0.021 |
furniture | couch* | couching | 0.098 | 0.026 |
furniture | couch* | couched | 0.009 | -0.023 |
furniture | couch* | coucher | 0.054 | 0.136 |
furniture | couch* | couches | 0.615 | 0.643 |
furniture | couch* | couch | 1.000 | 0.778 |
furniture | sofa* | sofaer | -0.175 | -0.175 |
furniture | sofa* | sofabed | 0.493 | 0.493 |
furniture | sofa* | sofala | -0.017 | -0.017 |
furniture | sofa* | sofas | 0.738 | 0.738 |
furniture | sofa* | sofar | -0.095 | -0.095 |
furniture | sofa* | sofa | 1.000 | 1.000 |
And we can visualize this together with the most similar suggested terms as a network:
library(visNetwork)
display_network <- function(meta, cat = 1, n = 10, min = .1, seed = 2080) {
cat_name <- meta$summary$category[[cat]]
top_suggested <- meta$suggested[[cat_name]][seq_len(n)]
terms <- meta$expanded[[cat_name]]
nodes <- data.frame(
id = c(terms, names(top_suggested)),
label = c(terms, names(top_suggested)),
group = rep(
c("original", "suggested"),
c(length(terms), length(top_suggested))
),
shape = "box"
)
suggested_sim <- lma_simets(lma_lspace(
nodes$id,
space = meta$summary$sim.space[[1]]
), metric = "cosine")
nodes$size <- rowMeans(suggested_sim)
edges <- data.frame(
from = rep(colnames(suggested_sim), each = nrow(suggested_sim)),
to = rep(rownames(suggested_sim), nrow(suggested_sim)),
value = as.numeric(suggested_sim),
title = as.numeric(suggested_sim)
)
visNetwork(
nodes, within(
edges[edges$value > min & edges$value < 1, ], value <- (value * 10)^4
)
) |>
visEdges("title", smooth = FALSE, color = list(opacity = .6)) |>
visLegend(width = .07) |>
visLayout(randomSeed = seed) |>
visPhysics("barnesHut", timestep = .1) |>
visInteraction(
dragNodes = TRUE, dragView = TRUE, hover = TRUE, hoverConnectedEdges = TRUE,
selectable = TRUE, tooltipDelay = 100, tooltipStay = 100
)
}
display_network(meta)
Or we could look at terms across categories within a dimensionally-reduced version of the space:
library(plotly)
display_reduced_space <- function(
meta, space_name = "glove_crawl", method = "umap", dim_prop = 1,
color_seeds = c("#25cb1a", "#c8fd9e", "#1b85ed", "#91b8fb")) {
suggestions <- unlist(unname(meta$suggested))
terms <- rbind(meta$terms[, c("category", "match", "sim.category")], data.frame(
category = rep(
paste0(names(meta$suggested), "_suggested"), vapply(meta$suggested, length, 0)
),
match = names(suggestions),
sim.category = suggestions
))
space <- lma_lspace(terms$match, space_name)
space <- space[, Reduce(unique, lapply(meta$expanded, function(l) {
order(-colMeans(space[l, ]))[seq_len(min(ncol(space), ncol(space) * dim_prop))]
}))]
st <- proc.time()[[3]]
m <- as.data.frame(if (method == "umap") {
uwot::umap(space, 15, 3, metric = "cosine")
} else if (method == "taffy") {
m <- lusilab::taffyInf(lma_simets(space, metric = "cosine"), 3)
colnames(m) <- c("V1", "V2", "V3")
m
} else if (method == "kmeans") {
m <- t(kmeans(lma_simets(space, metric = "cosine"), 3)$centers)
colnames(m) <- c("V1", "V2", "V3")
m
} else if (method == "svd") {
m <- svd(lma_simets(space, metric = "cosine"), 3)$u
rownames(m) <- rownames(space)
m
} else {
m <- prcomp(lma_simets(space, metric = "cosine"))$rotation[, 1:3]
dimnames(m) <- list(rownames(space), paste0("V", 1:3))
m
})
message(
"reduced space via ", method, " method in ",
round(proc.time()[[3]] - st, 4), " seconds"
)
d <- cbind(terms, m)
d$color <- color_seeds[as.numeric(as.factor(d$category))]
ds <- split(d, d$category)
p <- plot_ly(
do.call(rbind, ds),
x = ~V1, y = ~V2, z = ~V3, textfont = list(size = 9)
) |>
layout(
showlegend = TRUE, paper_bgcolor = "#000000", font = list(color = "#ffffff"),
margin = list(r = 0, b = 0, l = 0)
)
for (cat in names(ds)) {
p <- p |> add_text(
data = ds[[cat]], text = ~match, name = cat, textfont = list(color = ~color)
)
}
p
}
display_reduced_space(meta)
#> reduced space via umap method in 1.39 seconds
Here it seems the suggested terms strengthen cores of related terms within categories, leaving unrelated terms to group together between categories.
Instead of looking at pairwise comparisons, it might also make sense to compare with category centroids:
meta_centroid <- dictionary_meta(
dict[3:4],
pairwise = FALSE, suggest = TRUE, space_dir = "~/Latent Semantic Spaces"
)
#> preparing terms (0)
#> expanding terms (0.29)
#> loading space (0.46)
#> calculating term similarities (18.86)
#> identifying potential additions (21.66)
#> preparing results (24.22)
#> done (24.54)
meta_centroid$suggested
#> $furniture
#> armchairs loveseat futon armless upholstered workstation
#> 0.2517359 0.2474119 0.2238488 0.2218516 0.2198026 0.2195312
#> banquette settees workstations settee chaise aeron
#> 0.2191410 0.2169368 0.2152506 0.2146026 0.2130485 0.2088532
#> chaises recliner cabriole credenza seater
#> 0.2070558 0.2039399 0.2018048 0.2013844 0.1996919
#>
#> $well_adjusted
#> hope smile glad glow smiles thankful
#> 0.2833616 0.2475967 0.2413412 0.2372855 0.2366536 0.2309778
#> wishes youthful excited love cheerful eyes
#> 0.2256275 0.2254221 0.2249745 0.2235930 0.2231660 0.2230449
#> honest shine celebrate joyful shining joyous
#> 0.2227368 0.2219152 0.2209088 0.2190795 0.2177220 0.2174133
#> loving shines wish hopes coming way
#> 0.2168939 0.2162090 0.2156694 0.2155521 0.2136403 0.2128437
#> shone facebook know sad sincere promising
#> 0.2127467 0.2123793 0.2121608 0.2120186 0.2111781 0.2108967
#> hoping merry loved sparkles say promise
#> 0.2105333 0.2101729 0.2101528 0.2099162 0.2094993 0.2093688
#> grateful life hopeful blessed tell befriend
#> 0.2085683 0.2080989 0.2079336 0.2078590 0.2070684 0.2064690
#> seeing hearts positive feelings face young
#> 0.2059093 0.2049330 0.2045796 0.2038124 0.2037544 0.2035204
#> sunshine attract eye alive grow helped
#> 0.2027419 0.2025035 0.2024249 0.2022952 0.2022153 0.2022134
#> caring glowing vibrant proud surprised attracted
#> 0.2020372 0.2019306 0.2018866 0.2013347 0.2009025 0.2008401
#> knew thank cheer sure beginnings miss
#> 0.2007124 0.2006891 0.2002314 0.2001261 0.1999750 0.1999670
#> better bring fair past younger thrilled
#> 0.1994686 0.1987255 0.1985127 0.1984401 0.1982115 0.1981868
#> earth welcome bless summer faces afraid
#> 0.1981317 0.1980738 0.1978137 0.1976383 0.1976157 0.1973665
#> prosperous true sweet fun grew greetings
#> 0.1970473 0.1970078 0.1968552 0.1968065 0.1967096 0.1967075
#> complexion optimistic hometown turn remember fortunate
#> 0.1966697 0.1963466 0.1963115 0.1961961 0.1961387 0.1960371
#> thanks let believe chance hurt positively
#> 0.1958364 0.1957286 0.1957258 0.1953343 0.1952308 0.1947304
#> remembered hear encouraging dear enthusiasm
#> 0.1943526 0.1942415 0.1939956 0.1939174 0.1938838
knitr::kable(meta_centroid$summary[, -1], digits = 3)
n_terms | n_expanded | sim.space | sim.min | sim.q1 | sim.median | sim.mean | sim.q3 | sim.max | |
---|---|---|---|---|---|---|---|---|---|
furniture | 5 | 34 | glove_crawl | -0.093 | 0.018 | 0.096 | 0.146 | 0.19 | 0.675 |
well_adjusted | 6 | 47 | glove_crawl | -0.170 | 0.018 | 0.191 | 0.199 | 0.36 | 0.651 |
The previous examples looked at terms within a single space, but we can also aggregate across multiple spaces, which might result in more reliable comparisons:
meta_multi <- dictionary_meta(
dict[3:4], "multi",
suggest = TRUE, space_dir = "~/Latent Semantic Spaces"
)
#> preparing terms (0)
#> expanding terms (0.25)
#> loading spaces (0.44)
#> calculating term similarities (80.21)
#> identifying potential additions (180.6)
#> preparing results (182.94)
#> done (183.08)
meta_multi$suggested
#> $furniture
#> cabriole roundoff rsha contemporaine tansu
#> 0.1482792 0.1339643 0.1319545 0.1316325 0.1313030
#> kvm qn documentos automorphic spinet
#> 0.1298259 0.1275958 0.1269086 0.1262035 0.1254962
#> seiza trafico mmse parl benchtop
#> 0.1242864 0.1241114 0.1239376 0.1237238 0.1228883
#> vises castling iiie treadle secretaire
#> 0.1224446 0.1216169 0.1214022 0.1209836 0.1206658
#> ortf katholieke bombe univac kilim
#> 0.1206569 0.1204230 0.1202421 0.1200579 0.1200574
#> ahmes muet bdb kosuth banc
#> 0.1197782 0.1194817 0.1187592 0.1187553 0.1187366
#> gasifier thonet hollerith cleanrooms nle
#> 0.1183109 0.1182349 0.1179614 0.1179571 0.1173060
#> bentwood cableway arcinfo iseries cahier
#> 0.1172149 0.1169206 0.1168743 0.1168297 0.1166965
#> hepplewhite guillotines hassock assistent dispositif
#> 0.1163738 0.1163604 0.1161890 0.1160527 0.1159809
#> reeded satisfiability penser exercices digiti
#> 0.1159393 0.1156503 0.1155371 0.1155054 0.1153704
#> pushdown tonneau microform
#> 0.1152019 0.1151824 0.1148661
#>
#> $well_adjusted
#> merry thrilled joyous amity esperanza cheer thankful
#> 0.1323126 0.1257853 0.1206332 0.1194965 0.1188206 0.1175747 0.1158717
#> attracted wooster proud cheers windy love blossoming
#> 0.1148290 0.1147411 0.1134269 0.1123962 0.1119232 0.1118704 0.1115397
#> loving excited branford joyful generosity avon selena
#> 0.1112393 0.1108994 0.1097011 0.1095772 0.1095450 0.1094822 0.1093923
#> greetings delighted hope ramona affection youthful glad
#> 0.1085926 0.1078952 0.1077104 0.1073784 0.1068428 0.1063933 0.1062241
#> wilmer loyal unkind stormy scouts dawning hopeful
#> 0.1060294 0.1058156 0.1056248 0.1055818 0.1052770 0.1052233 0.1048291
#> attract thrive feelings enthusiasm
#> 0.1043037 0.1042586 0.1042461 0.1042268
knitr::kable(meta_multi$summary[, -1], digits = 3)
n_terms | n_expanded | sim.space | sim.min | sim.q1 | sim.median | sim.mean | sim.q3 | sim.max | |
---|---|---|---|---|---|---|---|---|---|
furniture | 5 | 33 | glove_crawl, paragram_sl999, paragram_ws353, sensembed, CoNLL17_skipgram | 0.138 | 0.223 | 0.259 | 0.259 | 0.304 | 0.369 |
well_adjusted | 6 | 45 | glove_crawl, paragram_sl999, paragram_ws353, sensembed, CoNLL17_skipgram | 0.141 | 0.200 | 0.261 | 0.249 | 0.296 | 0.357 |
Text Coverage
Suggested terms can help improve the theoretical coverage of these categories in themselves, but another type of coverage is how much of the category is covered by the text it’s scoring. Low coverage of this sort isn’t inherently an issue, but it puts more pressure on the covered terms to be unambiguous. For instance, compare the score versus coverage in these texts:
texts <- c(
furniture = "There is a chair positioned in the intersection of a desk and table.",
still_furniture = "I'm selling this chair, since my new chair replaced that chair.",
business = "The chair took over from the former chair to introduced the new chair.",
business_mixed = "The chair sat down at their desk to table the discussion."
)
lma_termcat(texts, dict[3], coverage = TRUE)
#> furniture coverage_furniture
#> furniture 3 3
#> still_furniture 3 1
#> business 3 1
#> business_mixed 3 3
#> attr(,"WC")
#> [1] 13 11 13 11
#> attr(,"time")
#> dtm termcat
#> 0 0
#> attr(,"type")
#> [1] "count"
These examples illustrate how this sort of coverage could relate to score validity (i.e., how much the category is actually reflected in the text), but also how it is not a perfect indicator. Generally, a smaller variety of term hits within a category should make us less confident in the category score.