Read and Segment Multiple Texts — read.segments • lingmatch

Split texts by word count or specific characters. Input texts directly, or read them in from files.

Usage

read.segments(path = ".", segment = NULL, ext = ".txt", subdir = FALSE,
  segment.size = -1, bysentence = FALSE, end_in_quotes = TRUE,
  preclean = FALSE, text = NULL)

Arguments

path: Path to a folder containing files, or a vector of paths to files. If no folders or files are recognized in path, it is treated as text.
segment: Specifies how the text of each file should be segmented. If a character, split at that character; '\n' by default. If a number, texts will be broken into that many segments, each with a roughly equal number of words.
ext: The extension of the files you want to read in. '.txt' by default.
subdir: Logical; if TRUE, files in folders in path will also be included.
segment.size: Logical; if specified, segment will be ignored, and texts will be broken into segments containing roughly segment.size number of words.
bysentence: Logical; if TRUE, and segment is a number or segment.size is specified, sentences will be kept together, rather than potentially being broken across segments.
end_in_quotes: Logical; if FALSE, sentence-ending marks (.?!) will not be considered when immediately followed by a quotation mark. For example, '"Word." Word.' would be considered one sentence.
preclean: Logical; if TRUE, text will be cleaned with lma_dict(special) before segmentation.
text: A character vector with text to be split, used in place of path. Each entry is treated as a file.

Value

A data.frame with columns for file names (input), segment number within file (segment), word count for each segment (WC), and the text of each segment (text).

Examples

# split preloaded text
read.segments("split this text into two segments", 2)
#>   input segment WC              text
#> 1     1       1  3   split this text
#> 2     1       2  3 into two segments

if (FALSE) { # \dontrun{

# read in all files from the package directory
texts <- read.segments(path.package("lingmatch"), ext = "")
texts[, -4]

# segment .txt files in dir in a few ways:
dir <- "path/to/files"

## into 1 line segments
texts_lines <- read.segments(dir)

## into 5 even segments each
texts_5segs <- read.segments(dir, 5)

## into 50 word segments
texts_50words <- read.segments(dir, segment.size = 50)

## into 1 sentence segments
texts_1sent <- read.segments(dir, segment.size = 1, bysentence = TRUE)
} # }