adicat

Adicat is a JavaScript library for dictionary-based text processing. Its main role is in the dictionary highlighter, but it's made to be modular, so it could be used in other applications, such as this simple, cue-based chat bot.

It's called adicat (as-needed dictionary-based categorization) because it stores different representations of the processed text, and only performs additional processing when called for. See the docs site for structured documentation.

For example, text = new adicat('Text to process.') will just set up an object containing the original, unprocessed text. Then, text.tokenize() will add cleaned versions of the original text, arrays of words, and a word and character count. Finally, text.categorize() will add a vectorized representation, and if a dictionary is loaded, dictionary category scores. Adicat is loaded in this page, so you can try these examples out in your browser's console (F12).

The data from each processing level are added to the assigned object like this (partially): { WC: 3, string: { raw: 'Text to process.', clean: 'Text to process.', stripped: ' text to process ' }, words: { print: ['Text', 'to', 'process.'], token: ['test', 'to', 'process'], categories: [' none ', ' none ', ' none '] }, vector: { text: 1, to: 1, process: 1 }, _processLevel: 2 }

similarity

Adicat's similarity function calculates the cosine similarity or inverse Canberra distance between the target text and a comparison.

The sample means reported in the LIWC 2015 manual are stored in Adicat.liwc_means. These include the standard function word categories that go into Language Style Matching, so texts have to be processed by a function word dictionary (like adicat's default, not loaded here).

If a dictionary with these categories was assigned to Adicat.patterns.dict, this would return the inverse Canberra distance between the entered text and the expressive standard: new adicat("I'm not feeling very expressive.").similarity('expressive')

The entered text can also be compared to another text, which can be a preprocessed or entered into the call to similarity to be processed. Similarity can be calculated between category scores, meta category scores (such as punctuation and number of words), or each vectorized text. For example, new adicat('compare this bit of text').similarity('with this bit', 'cosine', 'vector') would return the cosine similarity between the vectorized forms of each entered text.

To measure similarity, the text has to be processed up to level 2 (categorization), and potentially have meta categories added with text.procmeta(). It isn't necessary to explicitly perform lower order processing because higher order processes perform those if necessary. For instance, the most explicit form of that last example would be this: a = new adicat('compare this bit of text').tokenize().categorize(); b = new adicat('with this bit').tokenize().categorize(); a.similarity(b, 'cosine', 'vector') The similarity function triggers categorization if needed, and the categorize function triggers tokenization if needed.

highlight

The dictionary highlighter was originally made to look at Linguistic Inquiry and Word Count categories. It now uses its own dictionaries, and allows for new dictionaries to be created, and external dictionaries to be loaded.

With the highlighter, you can...

Implementation

A dictionary may start out as an object of arrays like this: { term:['term*', 'word*', 'token*', '[n\\d-]*gram*'], category:['categor*', 'list*', 'dictionar*', 'topic*'] }

But it ends up as an object of regular expressions like this, as converted by the Adicat.toRegex() function: { term:/^term|^word|^token|^[n\d-]*gram/, category:/^categor|^list|^dictionar|^topic/ }

The asterisks at the end of each term represents a greedy wildcard, such that the entry will match any word starting with the entered string. Adicat translates initial and terminal asterisks to regular expression equivalents. Asterisks not at the beginning or end of an entered string are treated as regular expressions, and other valid regular expression is retained—in this case, /[n\d-]*gram/ will match "gram", and variants preceded by any number of ns, digits, or dashes, such as "n-gram".

This is an example of a complete implementation of the highlighter code—you could copy this code and save it as an html file for a functional highlighter: <!DOCTYPE html><html><head><meta charset = 'utf-8'/></head><body>   <!-- (1) set up HTML elements for input, and optionally a reprocess button --> Type into the input element -- words matching those in the dictionary will be colored by category. <div id = 'input' contenteditable = 'true'></div> <button type = 'button' onclick = 'Adicat.hl.display_text()'>reprocess</button>   <!-- (2) load in the core and highlight scripts --> <script type = 'text/javascript' src = 'https://miserman.github.io/adicat/core.min.js'></script> <script type = 'text/javascript' src = 'https://miserman.github.io/adicat/highlight.min.js' async></script> <script type = 'text/javascript'> window.onload = function(){ // (2.1) assign a dictionary Adicat.patterns.dict = { 'term':['term*', 'word*', 'token*', '[n\\d-]*gram*'], 'category':['categor*', 'list*', 'dictionar*', 'topic*'] } // (2.2) assign an input/output element Adicat.hl.input = document.getElementById('input') // (2.3) add event listeners if you want text to be processed as it's typed Adicat.hl.input.addEventListener('keypress', Adicat.hl.spanner) Adicat.hl.input.addEventListener('keyup', Adicat.hl.process_span) } </script>   <!-- (3) set colors for categories -- a class name for each dictionary category --> <style type = 'text/css'> #input{padding: 1em; margin: 1em 0; outline: 1px solid} .term{color: #b454da} .category{color: #60b346} </style>   </body></html>

The result will be a text box that highlights keywords as they're matched, like this:

chat

The chat prototype is an example of how adicat might be used to process and react to text as it's received.

Chat makes use of adicat's cue processor. Unlike the categorizer used by the highlighter, the cue processor takes in full text strings (rather than tokens) and says whether any of a set of expressions were found (rather than counting up each expression).

For example, text = new adicat('Hey there :)').detect() would process the input up to level 1 (tokenization), then check the input for each of the default cue categories. The output goes to text.cues, which is an object of logical entries indicating whether any matches were found. In this example, text.cues.greeting and text.cues.happy would both be true.

Cue dictionaries are stored like the other dictionaries. The only difference is in the boundary between cue terms—terms aimed at tokens are bounded by the beginning and end of the string (^ and $ respectively), whereas terms aimed at full strings are bounded by word boundaries (standardized to single blank spaces). Objects of arrays of terms can be converted to regular expressions with Adicat.toRegex(obj, level), where setting the level argument to true converts to string-targeted patterns.

This is an example of a complete implementation of the chat code: <!DOCTYPE html><html><head><meta charset = 'utf-8'/></head><body>   <!-- (1) set up HTML elements for an input and chat log --> <div id = 'chat_log'></div> <input id = 'chat_input' placeholder = 'Message @adicat'/>   <!-- (2) load in the core and chat scripts --> <script type = 'text/javascript' src = 'https://miserman.github.io/adicat/core.min.js'></script> <script type = 'text/javascript' src = 'https://miserman.github.io/adicat/chat.min.js' async></script> <script type = 'text/javascript'> window.onload = function(){ // (2.0) maybe add an additional cue Adicat.patterns.cues.followup = Adicat.toRegex(['fine', 'good', 'alright*', 'ok', 'noth*'], true) /* (2.1) set up a set of replies. Arrays in each entry are sampled from: - initial for opening message - general for when no match is found - additional entires named for each cue */ Adicat.chat.replies = { initial: ['Hello', 'Hi', 'Hey there!'], general: ['Hmm, what to say', 'You talk now', '<conversation starter>'], greeting: ['How are you?', "What's up?", "What's on your mind?"], followup: ['Cool, cool... ', 'Same here... ', 'Alright then... '] } // (2.2) assign each chat element Adicat.chat.parts.log = document.getElementById('chat_log'), Adicat.chat.parts.input = document.getElementById('chat_input') // (2.3) add an event listener to send on enter Adicat.chat.parts.input.addEventListener('keypress', function(k){if(k.which === 13) Adicat.chat.send()}) // (2.4) send an initial message Adicat.chat.receive('', Adicat.chat.replies.initial.sample()) } </script>   <!-- (3) add styling to differentiate incoming and outgoing messages --> <style type = 'text/css'> .outgoing{text-align:right} </style>   </body></html>

The result will be a chat interface like this (in its simplest form):

introduction documentation