Split texts by word count or specific characters. Input texts directly, or read them in from files.
Usage
read.segments(path = ".", segment = NULL, ext = ".txt", subdir = FALSE,
segment.size = -1, bysentence = FALSE, end_in_quotes = TRUE,
preclean = FALSE, text = NULL)
Arguments
- path
Path to a folder containing files, or a vector of paths to files. If no folders or files are recognized in
path
, it is treated astext
.- segment
Specifies how the text of each file should be segmented. If a character, split at that character; '\n' by default. If a number, texts will be broken into that many segments, each with a roughly equal number of words.
- ext
The extension of the files you want to read in. '.txt' by default.
- subdir
Logical; if
TRUE
, files in folders inpath
will also be included.- segment.size
Logical; if specified,
segment
will be ignored, and texts will be broken into segments containing roughlysegment.size
number of words.- bysentence
Logical; if
TRUE
, andsegment
is a number orsegment.size
is specified, sentences will be kept together, rather than potentially being broken across segments.- end_in_quotes
Logical; if
FALSE
, sentence-ending marks (.?!
) will not be considered when immediately followed by a quotation mark. For example,'"Word." Word.'
would be considered one sentence.- preclean
Logical; if
TRUE
, text will be cleaned withlma_dict(special)
before segmentation.- text
A character vector with text to be split, used in place of
path
. Each entry is treated as a file.
Value
A data.frame
with columns for file names (input
),
segment number within file (segment
), word count for each segment (WC
), and the text of
each segment (text
).
Examples
# split preloaded text
read.segments("split this text into two segments", 2)
#> input segment WC text
#> 1 1 1 3 split this text
#> 2 1 2 3 into two segments
if (FALSE) { # \dontrun{
# read in all files from the package directory
texts <- read.segments(path.package("lingmatch"), ext = "")
texts[, -4]
# segment .txt files in dir in a few ways:
dir <- "path/to/files"
## into 1 line segments
texts_lines <- read.segments(dir)
## into 5 even segments each
texts_5segs <- read.segments(dir, 5)
## into 50 word segments
texts_50words <- read.segments(dir, segment.size = 50)
## into 1 sentence segments
texts_1sent <- read.segments(dir, segment.size = 1, bysentence = TRUE)
} # }