Map a document-term matrix onto a latent semantic space, extract terms from a
latent semantic space (if dtm
is a character vector, or map.space =
FALSE
),
or perform a singular value decomposition of a document-term matrix (if dtm
is a matrix
and space
is missing).
Usage
lma_lspace(dtm = "", space, map.space = TRUE, fill.missing = FALSE,
term.map = NULL, dim.cutoff = 0.5, keep.dim = FALSE,
use.scan = FALSE, dir = getOption("lingmatch.lspace.dir"))
Arguments
- dtm
A matrix with terms as column names, or a character vector of terms to be extracted from a specified space. If this is of length 1 and
space
is missing, it will be treated asspace
.- space
A matrix with terms as rownames. If missing, this will be the right singular vectors of a singular value decomposition of
dtm
. If a character, a file matching the character will be searched for indir
(e.g.,space = 'google'
). If a file is not found and the character matches one of the available spaces, you will be given the option to download it, as handled bydownload.lspace
. Ifdtm
is missing, the entire space will be loaded and returned.- map.space
Logical: if
FALSE
, the original vectors ofspace
for terms found indtm
are returned. Otherwisedtm
%*%
space
is returned, excluding uncommon columns ofdtm
and rows ofspace
.- fill.missing
Logical: if
TRUE
and terms are being extracted from a space, includes terms not found in the space as rows of 0s, such that the returned matrix will have a row for every requested term.- term.map
A matrix with
space
as a column name, terms as row names, and indices of the terms in the given space as values, or a numeric vector of indices with terms as names, or a character vector of terms corresponding to rows of the space. This is used instead of reading in an "_terms.txt" file corresponding to aspace
entered as a character (the name of a space file).- dim.cutoff
If a
space
is calculated, this will be used to decide on the number of dimensions to be retained:cumsum(d) / sum(d) < dim.cutoff
, whered
is a vector of singular values ofdtm
(i.e.,svd(dtm)$d
). The default is.5
; lower cutoffs result in fewer dimensions.- keep.dim
Logical: if
TRUE
, and a space is being calculated from the input, a matrix in the same dimensions asdtm
is returned. Otherwise, a matrix with terms as rows and dimensions as columns is returned.- use.scan
Logical: if
TRUE
, reads in the rows ofspace
withscan
.- dir
Path to a folder containing spaces.
Set a session default withoptions(lingmatch.lspace.dir = 'desired/path')
.
Value
A matrix or sparse matrix with either (a) a row per term and column per latent dimension (a latent
space, either calculated from the input, or retrieved when map.space = FALSE
), (b) a row per document
and column per latent dimension (when a dtm is mapped to a space), or (c) a row per document and
column per term (when a space is calculated and keep.dim = TRUE
).
Note
A traditional latent semantic space is a selection of right singular vectors from the singular
value decomposition of a dtm (svd(dtm)$v[, 1:k]
, where k
is the selected number of
dimensions, decided here by dim.cutoff
).
Mapping a new dtm into a latent semantic space consists of multiplying common terms:
dtm[, ct]
%*% space[ct, ]
, where ct
=
colnames(dtm)[colnames(dtm)
%in%
rownames(space)]
– the terms common between the dtm and the space. This
results in a matrix with documents as rows, and dimensions as columns, replacing terms.
See also
Other Latent Semantic Space functions:
download.lspace()
,
select.lspace()
,
standardize.lspace()
Examples
text <- c(
paste(
"Hey, I like kittens. I think all kinds of cats really are just the",
"best pet ever."
),
paste(
"Oh year? Well I really like cars. All the wheels and the turbos...",
"I think that's the best ever."
),
paste(
"You know what? Poo on you. Cats, dogs, rabbits -- you know, living",
"creatures... to think you'd care about anything else!"
),
paste(
"You can stick to your opinion. You can be wrong if you want. You know",
"what life's about? Supercharging, diesel guzzling, exhaust spewing,",
"piston moving ignitions."
)
)
dtm <- lma_dtm(text)
# calculate a latent semantic space from the example text
lss <- lma_lspace(dtm)
# show that document similarities between the truncated and full space are the same
spaces <- list(
full = lma_lspace(dtm, keep.dim = TRUE),
truncated = lma_lspace(dtm, lss)
)
sapply(spaces, lma_simets, metric = "cosine")
#> $full
#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)
#>
#> [1,] I . . .
#> [2,] 0.999420475 I . .
#> [3,] 0.140738442 0.10695580 I .
#> [4,] 0.001947292 -0.03209365 0.990319 I
#>
#> $truncated
#> 4 x 4 sparse Matrix of class "dtCMatrix" (unitriangular)
#>
#> [1,] I . . .
#> [2,] 0.999420475 I . .
#> [3,] 0.140738442 0.10695580 I .
#> [4,] 0.001947292 -0.03209365 0.990319 I
#>
if (FALSE) { # \dontrun{
# specify a directory containing spaces,
# or where you would like to download spaces
space_dir <- "~/Latent Semantic Spaces"
# map to a pretrained space
ddm <- lma_lspace(dtm, "100k", dir = space_dir)
# load the matching subset of the space
# without mapping
lss_100k_part <- lma_lspace(colnames(dtm), "100k", dir = space_dir)
## or
lss_100k_part <- lma_lspace(dtm, "100k", map.space = FALSE, dir = space_dir)
# load the full space
lss_100k <- lma_lspace("100k", dir = space_dir)
## or
lss_100k <- lma_lspace(space = "100k", dir = space_dir)
} # }