Retrieve information and links to latent semantic spaces (sets of word vectors/embeddings) available at osf.io/489he, and optionally download their term mappings (osf.io/xr7jv).
Usage
select.lspace(query = NULL, dir = getOption("lingmatch.lspace.dir"),
terms = NULL, get.map = FALSE, check.md5 = TRUE, mode = "wb")
Arguments
- query
A character used to select spaces, based on names or other features. If length is over 1,
get.map
is set toTRUE
. Useterms
alone to select spaces based on term coverage.- dir
Path to a directory containing
lma_term_map.rda
and downloaded spaces;
will look ingetOption('lingmatch.lspace.dir')
and'~/Latent Semantic Spaces'
by default.- terms
A character vector of terms to search for in the downloaded term map, to calculate coverage of spaces, or select by coverage if
query
is not specified.- get.map
Logical; if
TRUE
andlma_term_map.rda
is not found indir
, the term map (lma_term_map.rda) is downloaded and decompressed.- check.md5
Logical; if
TRUE
(default), retrieves the MD5 checksum from OSF, and compares it with that calculated from the downloaded file to check its integrity.- mode
Passed to
download.file
when downloading the term map.
Value
A list with varying entries:
info
: The version of osf.io/9yzca stored internally; adata.frame
with spaces as row names, and information about each space in columns:terms
: number of terms in the spacecorpus
: corpus(es) on which the space was trainedmodel
: model from which the space was traineddimensions
: number of dimensions in the model (columns of the space)model_info
: some parameter details about the modeloriginal_max
: maximum value used to normalize the space; the original space would be(vectors *
original_max) /
100
osf_dat
: OSF id for the.dat
files; the URL would be https://osf.io/osf_dat
osf_terms
: OSF id for the_terms.txt
files; the URL would be https://osf.io/osf_terms
wiki
: link to the wiki for the spacedownloaded
: path to the.dat
file if downloaded, and''
otherwise.
selected
: A subset ofinfo
selected byquery
.term_map
: Ifget.map
isTRUE
orlma_term_map.rda
is found indir
, a copy of osf.io/xr7jv, which has space names as column names, terms as row names, and indices as values, with 0 indicating the term is not present in the associated space.
See also
Other Latent Semantic Space functions:
download.lspace()
,
lma_lspace()
,
standardize.lspace()
Examples
# just retrieve information about available spaces
spaces <- select.lspace()
spaces$info[1:10, c("terms", "dimensions", "original_max")]
#> terms dimensions original_max
#> 100k 99188 300 129.8630950
#> 100k_cbow 99186 300 7.3177360
#> 100k_lsa 99188 300 1144.5951390
#> blogs 27277 300 0.4780927
#> CoNLL17_skipgram 459818 100 4.1434050
#> dcp_cbow 215142 400 1.8966740
#> dcp_svd 215142 500 0.6560584
#> eigenwords 159908 200 1.0000000
#> eigenwords_tscca 55879 200 1.5136493
#> facebook_crawl 81653 300 4.5530000
# retrieve all spaces that used word2vec
w2v_spaces <- select.lspace("word2vec")$selected
w2v_spaces[, c("terms", "dimensions", "original_max")]
#> terms dimensions original_max
#> 100k_cbow 99186 300 7.3177360
#> CoNLL17_skipgram 459818 100 4.1434050
#> dcp_cbow 215142 400 1.8966740
#> google 345655 300 2.8437500
#> nasari 151776 300 0.7317708
#> paragram_sl999 456295 300 4.0129000
#> paragram_ws353 456295 300 4.0129000
#> sensembed 409078 400 1.7170870
#> ukwac_cbow 171187 400 6.6162160
if (FALSE) { # \dontrun{
# select spaces by terms
select.lspace(terms = c(
"part-time", "i/o", "'cause", "brexit", "debuffs"
))$selected[, c("terms", "coverage")]
} # }