adicat
Adicat is a JavaScript library for dictionary-based text processing. Its main role is in the dictionary highlighter, but it's made to be modular, so it could be used in other applications, such as this simple, cue-based chat bot.
It's called adicat (as-needed dictionary-based categorization) because it stores different representations of the processed text, and only performs additional processing when called for. See the docs site for structured documentation.
For example, text = new adicat('Text to process.')
will just set up an object containing the original, unprocessed text. Then,
text.tokenize()
will add cleaned versions of the original text, arrays of words, and a word and character count. Finally,
text.categorize()
will add a vectorized representation, and if a dictionary is loaded, dictionary category scores.
Adicat is loaded in this page, so you can try these examples out in your browser's console (F12).
The data from each processing level are added to the assigned object like this (partially):
{
WC: 3,
string: {
raw: 'Text to process.',
clean: 'Text to process.',
stripped: ' text to process '
},
words: {
print: ['Text', 'to', 'process.'],
token: ['test', 'to', 'process'],
categories: [' none ', ' none ', ' none ']
},
vector: {
text: 1,
to: 1,
process: 1
},
_processLevel: 2
}
similarity
Adicat's similarity function calculates the cosine similarity or inverse Canberra distance between the target text and a comparison.
The sample means reported in the LIWC 2015 manual are stored in Adicat.liwc_means
. These include the standard function word
categories that go into Language Style Matching, so texts have to be processed by a function word dictionary (like
adicat's default, not loaded here).
If a dictionary with these categories was assigned to Adicat.patterns.dict
, this would return the inverse
Canberra distance between the entered text and the expressive standard:
new adicat("I'm not feeling very expressive.").similarity('expressive')
The entered text can also be compared to another text, which can be a preprocessed or entered into the call to
similarity to be processed. Similarity can be calculated between category scores, meta category scores (such as
punctuation and number of words), or each vectorized text. For example,
new adicat('compare this bit of text').similarity('with this bit', 'cosine', 'vector')
would return
the cosine similarity between the vectorized forms of each entered text.
To measure similarity, the text has to be processed up to level 2 (categorization), and potentially have meta categories added with
text.procmeta()
. It isn't necessary to explicitly perform lower order processing because higher order processes perform
those if necessary. For instance, the most explicit form of that last example would be this:
a = new adicat('compare this bit of text').tokenize().categorize();
b = new adicat('with this bit').tokenize().categorize();
a.similarity(b, 'cosine', 'vector')
The similarity function triggers categorization if needed, and the categorize function triggers tokenization if needed.
highlight
The dictionary highlighter was originally made to look at Linguistic Inquiry and Word Count categories. It now uses its own dictionaries, and allows for new dictionaries to be created, and external dictionaries to be loaded.
With the highlighter, you can...
- See which words are being captured by dictionary categories.
- Toggle categories in the dictionary menu.
- Set to display counts or percentages in the settings menu.
- Calculate composite categories for, and similarities between texts.
- Composite categories can be added and edited in the dictionary's load/create/edit menu.
- Set a stored text for comparison in the saved texts menu.
- Set the comparison categories and metric in the settings menu.
- Create or import, edit, and export dictionaries.
- Cycle between stored dictionaries in the dictionary's load/create/edit menu.
- Download the results of text files scored by the selected dictionary.
- Drag and drop a text file anywhere on the page, or navigate to the process file menu.
- Specify output values, categories, composites, and comparisons in the other menus.
- Specify formatting and splitting in the process file menu.
Implementation
A dictionary may start out as an object of arrays like this:
{
term:['term*', 'word*', 'token*', '[n\\d-]*gram*'],
category:['categor*', 'list*', 'dictionar*', 'topic*']
}
But it ends up as an object of regular expressions like this, as converted by the Adicat.toRegex()
function:
{
term:/^term|^word|^token|^[n\d-]*gram/,
category:/^categor|^list|^dictionar|^topic/
}
The asterisks at the end of each term represents a greedy wildcard, such that the entry will match any word starting with the entered
string. Adicat translates initial and terminal asterisks to regular expression equivalents. Asterisks not at the beginning or end of
an entered string are treated as regular expressions, and other valid regular expression is retained—in this case,
/[n\d-]*gram/
will match "gram", and variants preceded by any number of ns, digits, or dashes, such as "n-gram".
This is an example of a complete implementation of the highlighter code—you could copy this code and save it as an html file for a
functional highlighter:
<!DOCTYPE html><html><head><meta charset = 'utf-8'/></head><body>
<!-- (1) set up HTML elements for input, and optionally a reprocess button -->
Type into the input element -- words matching those in the dictionary will be colored by category.
<div id = 'input' contenteditable = 'true'></div>
<button type = 'button' onclick = 'Adicat.hl.display_text()'>reprocess</button>
<!-- (2) load in the core and highlight scripts -->
<script type = 'text/javascript' src = 'https://miserman.github.io/adicat/core.min.js'></script>
<script type = 'text/javascript' src = 'https://miserman.github.io/adicat/highlight.min.js' async></script>
<script type = 'text/javascript'>
window.onload = function(){
// (2.1) assign a dictionary
Adicat.patterns.dict = {
'term':['term*', 'word*', 'token*', '[n\\d-]*gram*'],
'category':['categor*', 'list*', 'dictionar*', 'topic*']
}
// (2.2) assign an input/output element
Adicat.hl.input = document.getElementById('input')
// (2.3) add event listeners if you want text to be processed as it's typed
Adicat.hl.input.addEventListener('keypress', Adicat.hl.spanner)
Adicat.hl.input.addEventListener('keyup', Adicat.hl.process_span)
}
</script>
<!-- (3) set colors for categories -- a class name for each dictionary category -->
<style type = 'text/css'>
#input{padding: 1em; margin: 1em 0; outline: 1px solid}
.term{color: #b454da}
.category{color: #60b346}
</style>
</body></html>
The result will be a text box that highlights keywords as they're matched, like this:
chat
The chat prototype is an example of how adicat might be used to process and react to text as it's received.
Chat makes use of adicat's cue processor. Unlike the categorizer used by the highlighter, the cue processor takes in full text strings (rather than tokens) and says whether any of a set of expressions were found (rather than counting up each expression).
For example, text = new adicat('Hey there :)').detect()
would process the
input up to level 1 (tokenization), then check the input for each of the default cue categories. The output goes to
text.cues
, which is an object of logical entries indicating whether any matches were found. In this example,
text.cues.greeting
and text.cues.happy
would both be true.
Cue dictionaries are stored like the other dictionaries. The only difference is in the boundary between cue terms—terms aimed at
tokens are bounded by the beginning and end of the string (^ and $ respectively), whereas terms aimed at full strings are bounded
by word boundaries (standardized to single blank spaces). Objects of arrays of terms can be converted to regular expressions with
Adicat.toRegex(obj, level)
, where setting the level argument to true converts to string-targeted patterns.
This is an example of a complete implementation of the chat code:
<!DOCTYPE html><html><head><meta charset = 'utf-8'/></head><body>
<!-- (1) set up HTML elements for an input and chat log -->
<div id = 'chat_log'></div>
<input id = 'chat_input' placeholder = 'Message @adicat'/>
<!-- (2) load in the core and chat scripts -->
<script type = 'text/javascript' src = 'https://miserman.github.io/adicat/core.min.js'></script>
<script type = 'text/javascript' src = 'https://miserman.github.io/adicat/chat.min.js' async></script>
<script type = 'text/javascript'>
window.onload = function(){
// (2.0) maybe add an additional cue
Adicat.patterns.cues.followup = Adicat.toRegex(['fine', 'good', 'alright*', 'ok', 'noth*'], true)
/*
(2.1) set up a set of replies. Arrays in each entry are sampled from:
- initial for opening message
- general for when no match is found
- additional entires named for each cue
*/
Adicat.chat.replies = {
initial: ['Hello', 'Hi', 'Hey there!'],
general: ['Hmm, what to say', 'You talk now', '<conversation starter>'],
greeting: ['How are you?', "What's up?", "What's on your mind?"],
followup: ['Cool, cool... ', 'Same here... ', 'Alright then... ']
}
// (2.2) assign each chat element
Adicat.chat.parts.log = document.getElementById('chat_log'),
Adicat.chat.parts.input = document.getElementById('chat_input')
// (2.3) add an event listener to send on enter
Adicat.chat.parts.input.addEventListener('keypress', function(k){if(k.which === 13) Adicat.chat.send()})
// (2.4) send an initial message
Adicat.chat.receive('', Adicat.chat.replies.initial.sample())
}
</script>
<!-- (3) add styling to differentiate incoming and outgoing messages -->
<style type = 'text/css'>
.outgoing{text-align:right}
</style>
</body></html>
The result will be a chat interface like this (in its simplest form):