Here I present a simple version of the machine text system described in my 2018-06-06 article, named ``Ellision''; lately, I've been spending too much time focusing on learning the toki pona conlang; I realized this had practical value in serving as a simple means for demonstrating my Ellision system.
Toki pona is surprisingly amenable to binary numbers; follows are those fourteen characters it uses:
A character thus consumes four bits at most, nicely fitting the typical octet storage. The language has one hundred and twenty-four words, which fit well in seven bits, listed in this following table:
To construct an Ellision dictionary, it's merely necessary to concatenate all words into a character table and to give each word an entry in an ordered table; thus every word is referenced by its code. For toki pona, more information sans length is unnecessary; the character table can be optimized so:
It's simple to optimize the character table by removing explicit storage for words entirely subsumed by another, most trivially any of the single-letter words. This optimization is a step removed from the rest of this system, so it can be performed at any time. There is still opportunity to optimize further, by combining words which are partial contiguous subsets from either end, and each combining can be weighed against every other to determine the optimal configuration. This will be done later.
Continuing, the toki pona dictionary can be organized for one hundred and twenty-four valid indices. In an octet-based system, each character will consume a nibble and each word code will leave one bit available for the most important system to layer next, the auxiliary dictionary; the Ellision system mustn't restrict thought, and enabling use of words outside of the language's dictionary is required for this. I mulled over how best to enable such, and settled on this auxiliary dictionary approach.
The auxiliary dictionary is merely a structure identical to that language dictionary, but containing different words; it needn't have the same number of indices, but uses that same character set. This greatly simplifies this second layer, by ensuring the overlying mechanism need merely check a single bit to decide which dictionary to use for the particular operation, then carried out normally. This maintains the important quality of word access being O(1), and so operations such as word count also having this complexity. My other approach of inlining words not in the dictionary was poor, as this disadvantages such words, makes the text more difficult to manipulate, and otherwise complicated it.
At this level, the system can now express basic sentences with outside words; follows is an example:
In typical character systems, this sentence consumes twenty-two octets; Ellision stores it as a mere four octets, with that character storage for pelisimilitu consuming twelve octets in the dictionary.
It makes no difference to Ellision if the toki pona is to be shown as the sitelen pona graphic form.
That next system to layer over the prior two would be the sentence layer, but this isn't appropriate for toki pona, not strictly having any punctuation. In any case, that sentence layer stores indices into the stream of word codes, indicating their beginnings and ends, not unlike the character table. Importantly, the quality of separating layers so preserves O(1) access and so on for all underlying.
I will elaborate on my Ellision system and its applications regarding toki pona specifically, later.