Reduces the dimensions of a document-term matrix by dictionary-based categorization.
Usage
lma_termcat(dtm, dict, term.weights = NULL, bias = NULL,
bias.name = "_intercept", escape = TRUE, partial = FALSE,
glob = TRUE, term.filter = NULL, term.break = 20000,
to.lower = FALSE, dir = getOption("lingmatch.dict.dir"),
coverage = FALSE)
Arguments
- dtm
A matrix with terms as column names.
- dict
The name of a provided dictionary (osf.io/y6g5b/wiki) or of a file found in
dir
, or alist
object with named character vectors as word lists, or the path to a file to be read in byread.dic
.- term.weights
A
list
object with named numeric vectors lining up with the character vectors indict
, used to weight the terms in eachdict
vector. If a category indict
is not specified interm.weights
, or thedict
andterm.weights
vectors aren't the same length, the weights for that category will be 1.- bias
A list or named vector specifying a constant to add to the named category. If a term matching
bias.name
is included in a category, it's associatedweight
will be used as thebias
for that category.- bias.name
A character specifying a term to be used as a category bias; default is
'_intercept'
.- escape
Logical indicating whether the terms in
dict
should not be treated as plain text (including asterisk wild cards). IfTRUE
, regular expression related characters are escaped. Set toTRUE
if you get PCRE compilation errors.- partial
Logical; if
TRUE
terms are partially matched (not padded by ^ and $).- glob
Logical; if
TRUE
(default), will convert initial and terminal asterisks to partial matches.- term.filter
A regular expression string used to format the text of each term (passed to
gsub
). For example, if terms are part-of-speech tagged (e.g.,'a_DT'
),'_.*'
would remove the tag.- term.break
If a category has more than
term.break
characters, it will be processed in chunks. Reduce from 20000 if you get a PCRE compilation error.- to.lower
Logical; if
TRUE
will lowercase dictionary terms. Otherwise, dictionary terms will be converted to match the terms if they are single-cased. Set toFALSE
to always keep dictionary terms as entered.- dir
Path to a folder in which to look for
dict
;
will look in'~/Dictionaries'
by default.
Set a session default withoptions(lingmatch.dict.dir = 'desired/path')
.- coverage
Logical; if
TRUE
, will calculate coverage (number of unique term matches) for each category.
Value
A matrix with a row per dtm
row and columns per dictionary category
(with added coverage_
versions if coverage
is TRUE
),
and a WC
attribute with original word counts.
See also
For applying pattern-based dictionaries (to raw text) see lma_patcat()
.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
Examples
dict <- list(category = c("cat", "dog", "pet*"))
lma_termcat(c(
"cat, cat, cat, cat, cat, cat, cat, cat",
"a cat, dog, or anything petlike, really",
"petite petrochemical petitioned petty peter for petrified petunia petals"
), dict, coverage = TRUE)
#> category coverage_category
#> [1,] 8 1
#> [2,] 3 3
#> [3,] 8 8
#> attr(,"WC")
#> [1] 8 7 9
#> attr(,"time")
#> dtm termcat
#> 0.02 0.02
#> attr(,"type")
#> [1] "count"
if (FALSE) { # \dontrun{
# Score texts with the NRC Affect Intensity Lexicon
dict <- readLines("https://saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt")
dict <- read.table(
text = dict[-seq_len(grep("term\tscore", dict, fixed = TRUE)[[1]])],
col.names = c("term", "weight", "category")
)
text <- c(
angry = paste(
"We are outraged by their hateful brutality,",
"and by the way they terrorize us with their hatred."
),
fearful = paste(
"The horrific torture of that terrorist was tantamount",
"to the terrorism of terrorists."
),
joyous = "I am jubilant to be celebrating the bliss of this happiest happiness.",
sad = paste(
"They are nearly suicidal in their mourning after",
"the tragic and heartbreaking holocaust."
)
)
emotion_scores <- lma_termcat(text, dict)
if (require("splot")) splot(emotion_scores ~ names(text), leg = "out")
## or use the standardized version (which includes more categories)
emotion_scores <- lma_termcat(text, "nrc_eil", dir = "~/Dictionaries")
emotion_scores <- emotion_scores[, c("anger", "fear", "joy", "sadness")]
if (require("splot")) splot(emotion_scores ~ names(text), leg = "out")
} # }