|
9 | 9 | </head>
|
10 | 10 | <body>
|
11 | 11 | <h1 id="chatscript-advanced-users-manual">ChatScript Advanced User's Manual</h1>
|
12 |
| -<p>Copyright Bruce Wilcox, gowilcox@gmail.com www.brilligunderstanding.com<br> <br>Revision 11/26/2020 cs10.8</p> |
| 12 | +<p>Copyright Bruce Wilcox, gowilcox@gmail.com www.brilligunderstanding.com<br> <br>Revision 1/1/2021 cs11.0</p> |
13 | 13 | <ul>
|
14 | 14 | <li><a href="ChatScript-Advanced-User-Manual.html#review-overview-of-how-cs-works">Review</a></li>
|
15 | 15 | <li><a href="ChatScript-Advanced-User-Manual.html#advanced-tokenization">Advanced Tokenization</a></li>
|
@@ -149,6 +149,8 @@ <h4 id="call-by-reference">Call by reference</h4>
|
149 | 149 | <p>Of course, had you tried to do <code>^argument2 += 1</code> then that would be the illegal <code>1 += 1</code> and the assignment would fail.</p>
|
150 | 150 | <h1 id="advanced-tokenization">ADVANCED TOKENIZATION</h1>
|
151 | 151 | <p>The CS natural language workflow consists of taking the user's input text, splitting it into tokens and stopping each time at a perceived sentence boundary. It continues with the input after processing that "sentence". That leaves two tricky bits: what is a token and what is a sentence boundary. The `$cs_token~ variable gives you some control over how these work. The naive definition of a token is a sequence of letters terminating in a space or end of input. But there are exceptions to that like some kind of sentence punctuation (comma, period, colon, exclamation) is not part of a bigger token. The sentence punctuation notion has exceptions, like the period within a floating point number or as part of an abbrviation or webaddress. And hyphens with more letters on the other side are generally not punctuation either. And normally we consider bracketing things like parens not part of a word (except in emoticons). So CS will normally break things apart as it believes they should be done. If you need to actually allow a token to have embedded punctuation in it, you can list the token in the LIVEDATA/SUBSTITUTES/abbreviations.txt file and the tokenizer will respect it.</p>
|
| 152 | +<h1 id="continuation-lines">Continuation lines</h1> |
| 153 | +<p>File or live user input ending in ^ will erase the ^ and join with the next read line.</p> |
152 | 154 | <h1 id="system-functions">System Functions</h1>
|
153 | 155 | <p>There are many system functions to perform specific tasks. These are enumerated in the <a href="ChatScript-System-Functions-Manual.html">ChatScript System Functions Manual</a> and the <a href="ChatScript-Fact-Manual.html">ChatScript Fact Manual</a>.</p>
|
154 | 156 | <h1 id="out-of-band-communication">Out of band Communication</h1>
|
@@ -314,7 +316,9 @@ <h2 id="dict-files">DICT files</h2>
|
314 | 316 | <p>The <code>facts0.txt</code> file contains hierarchy relationships in wordnet. You are unlikely to edit these.</p>
|
315 | 317 | <p>The <code>dict.bin</code> file is a compressed dictionary which is faster to read. If you edit the actual dictionary word files, then erase this file. It will regenerate anew when you run the system again, revised per your changes. The actual dictionary files themselves… you might add a word or alter the type data of a word. The type information is all in <code>dictionarySystem.h</code></p>
|
316 | 318 | <h2 id="livedata-files">LIVEDATA files</h2>
|
317 |
| -<p>The substitutions files consistof pairs of data per line. The first is what to match. Individual words are separated by underscores, and you can request sentence boundaries <code><</code> and <code>></code> .</p> |
| 319 | +<p>These files are dynamically read per language.</p> |
| 320 | +<h3 id="substitutions">SUBSTITUTIONS</h3> |
| 321 | +<p>The SUBSTITUTES folder files consist of pairs of data per line. The first is what to match. Individual words are separated by underscores, and you can request sentence boundaries <code><</code> and <code>></code> .</p> |
318 | 322 | <p>The output can be missing (delete the found phrase) or words separated by plus signs (substitute these words) or a <code>%word</code> which names a system flag to be set (and the input deleted). The output can also be prefixed with <code>![…]</code> where inside the brackets are a list of words separated by spaces that must not follow this immediately. If one does, the match fails. You can also use <code>></code> as a word, to mean that this is NOT at the end of the sentence. The files include:</p>
|
319 | 323 | <table>
|
320 | 324 | <colgroup>
|
@@ -375,6 +379,14 @@ <h2 id="livedata-files">LIVEDATA files</h2>
|
375 | 379 | </tbody>
|
376 | 380 | </table>
|
377 | 381 | <p>Processing done by various of these files can be suppressed by setting <code>$cs_token</code> differently. See Control over Input.</p>
|
| 382 | +<h3 id="dictionary-augmentation-files">Dictionary Augmentation Files</h3> |
| 383 | +<div style="white-space: pre-line;"><code>plurals.txt</code> | is a list of word pairs, singular and plural form |
| 384 | +<code>canonicals.txt</code> | is a list of word pairs, original and canonical form, that override what CS might have decided. |
| 385 | +<code>currencies.txt</code> | map currency words to currency concepts it defines |
| 386 | +<code>months.txt</code> | lines of month names and abbreviations |
| 387 | +<code>numbers.txt</code> | lines of words that have numeric value (see below) |
| 388 | +<code>systemfacts.txt</code> | lines of system concepts, declaring them as concepts</div> |
| 389 | +<p>Numbers.txt entries will list the word, give its value, and define how to interpret its type. REAL_NUMBER is a word that directly represents a number, like two. WORD_NUMBER is a word that implies a number value, like dozen. FRACTION_NUMBER is a word that implies a faction value like half.</p> |
378 | 390 | <h1 id="common-script-idioms">Common Script Idioms</h1>
|
379 | 391 | <h2 id="selecting-specific-cases-refine">Selecting Specific Cases <code>^refine</code></h2>
|
380 | 392 | <p>To be efficient in rule processing, I ofte
37FE
n catch a lot of things in a rule and then refine it.</p>
|
|
0 commit comments