Skip to contents

Built with R 4.3.3 on April 20 2024


These examples assume the lingmatch package is loaded:

If the text you want to analyze is already in R, you can process it:

# individual words
data = lma_process(texts)

# with a dictionary
data = lma_process(texts, dict = 'inquirer', dir = '~/Dictionaries')

# with a latent semantic space
data = lma_process(texts, space = 'glove', dir = '~/Latent Semantic Spaces')

Or, you can just calculate similarity:

# pairwise cosine similarity in terms of all words
sims = lingmatch(texts)$sim

# pairwise Canberra similarity in terms of function word categories
sims = lingmatch(texts, type = 'lsm')$sim

# pairwise cosine similarity in terms of latent semantic space dimensions
sims = lingmatch(texts, type = 'lss')$sim

Or, if you have processed data (such as LIWC output), you can enter that:

# if all dictionary categories are found in the input, only those variables
# will be used
sims = lingmatch(data, dict = 'function')$sim

# otherwise, enter just the columns you want a part of the comparison
sims = lingmatch(data[, c('cat1', 'cat2', 'catn')])$sim

Continue for more about loading text into R, processing texts, and measuring similarity, or see the comparisons guide for more about defining comparisons.

Loading your texts

You will need a path to the file containing your texts. You could…

  • Select interactively with file.choose(), which returns the path.
  • Give the full path (you can use / or \\ as separators), e.g.,
    • Windows: 'c:/users/name/documents/texts.txt'
    • Linux: '/home/Name/Documents/texts.txt'
    • Mac: '/users/name/documents/texts.txt'
  • Give a relative path (use normalizePath('example') to see the full path):
    • with a tilde (see what that starts at with path.expand('~')), e.g., '~/texts.txt'
    • from the working directory (see what that is with getwd(), set it with setwd()), e.g., 'texts.txt'
    • from the parent of the working directory, e.g., '../texts.txt'

In the following examples, just the relative path to the file will be shown. This would be like if you set the working directory to the folder containing the files.

From plain-text files

When there is one entry per line:

texts = readLines('texts.txt')

When you want to segment a single file:

# with multiple lines between entries
segs = read.segments('texts.txt')

# into 5 even segments
segs = read.segments('texts.txt', 5)

# into 100 word chunks
segs = read.segments('texts.txt', segment.size = 100)

# then get texts from segs
texts = segs$text

When you want to read in multiple files from a folder:

texts = read.segments('foldername')$text

When your files are just text, you could also enter the path into lingmatch functions, without first loading them:

results = lingmatch('texts.txt')

From a delimited plain-text file

When texts are in a column of a spreadsheet, stored in a plain-text file:

# comma delimited
data = read.csv('data.csv')

# tab delimited (sometimes with extension .tsv)
data = read.delim('data.txt')

# Other delimiters; define with the sep argument.
# might also need to change the quote or other arguments
# depending on your file's format
data = read.delim('data.txt', sep = 'delimiting character')

# then get texts from data
texts = data$name_of_text_column

From a Microsoft Office file

Install and load the readtext package:

From a .doc or .docx file:

texts = readtext('texts.docx')$text

# this returns all lines in one, so you could
# use read.segments to split them up if needed
texts = read.segments(texts)$text

From a .xls or .xlsx file:

texts = readtext('data.xlsx')$name_of_text_column

Processing text

Processing texts represents them numerically, and this representation defines matching between them.

For example, matching between structural features (e.g., number of words and their average length) gives a sense of how similar the text itself is between texts:

structural_features = lma_meta(texts)

You could also look at exact matching between words by making a document-term matrix:

# all words
dtm = lma_dtm(texts)

# excluding stop words (function words) and rare words (those appearing in
# fewer than 3 texts)
dtm = lma_dtm(texts, exclude = 'function', dc.min = 2)

The raw texts in the next examples are processed with the lma_dtm function, using its defaults, but you could also enter a document-term matrix in place of texts, processed separately as in the previous examples.

Other structural features are function word categories, which would give a sense of how stylistically similar texts are:

function_cats = lma_termcat(texts, lma_dict())

To get at similarity in something like tone, you could use a sentiment dictionary:

sentiment = lma_termcat(texts, 'huliu', dir = '~/Dictionaries')

To get at similarity in overall meaning, you could use a content analysis focused dictionary like the General Inquirer:

inquirer_cats = lma_termcat(texts, 'inquirer', dir = '~/Dictionaries')

Or a set of embeddings:

glove_dimensions = lma_lspace(
  lma_dtm(texts), 'glove', dir = '~/Latent Semantic Spaces'
)

Measuring matching

Once you have processed texts, you can measure matching between them.

You could calculate similarity between each of them with different metrics:

# Inverse Canberra distance
can_sims = lma_simets(function_cats, metric = 'canberra')

# Cosine similarity
cos_sims = lma_simets(function_cats, metric = 'cosine')

Or between each text and the average across texts, with all available metrics:

sims_to_mean = lma_simets(function_cats, colMeans(function_cats))

Or just between the first and second text:

lma_simets(function_cats[1,], function_cats[2,])