Index  Comments

Here I present a simple version of the machine text system described in my 2018-06-06 article, named ``Elision''. Lately, I've been spending too much time focusing on learning the toki pona conlang; I realized this had practical value, in serving as a simple means for demonstrating my Elision system.

Toki pona is surprisingly amenable to binary numbers; follows are those fourteen characters it uses:

aeijklmnopstuw

A character thus consumes four bits at most, nicely fitting the typical octet storage. The language has one hundred and twenty-four words, which fit well in seven bits, listed in this following table:

a akesi ala alasa ale ali anpa ante anu awen e en esun ijo ike ilo insa jaki jan jelo jo kala kalama kama kasi ken kepeken kili kin kiwen ko kon kule kulupu kute la lape laso lawa len lete li lili linja lipu loje lon luka lukin lupa ma mama mani meli mi mije moku moli monsi mu mun musi mute namako nanpa nasa nasin nena ni nimi noka o oko olin ona open pakala pali palisa pan pana pi pilin pimeja pini pipi poka poki pona pu sama seli selo seme sewi sijelo sike sin sina sinpin sitelen sona soweli suli suno supa suwi tan taso tawa telo tenpo toki tomo tu unpa uta utala walo wan waso wawa weka wile

To construct an Elision dictionary it's only required that all words be concatenated for a character table and to give each word an entry in an ordered table; thus every word is referenced by its code. For toki pona, more information sans length is unnecessary; the character table can be optimized so:

akesi alasa ale ante anu awen esun ijo ilo insa jaki jan kalama kama kasi kepeken kili kin kiwen kon kule kulupu kute lape laso lawa lete lili linja lipu loje lon luka lukin lupa mama mani meli mije moku moli monsi mun musi mute namako nanpa nasa nasin nena nimi noka oko olin ona open pakala palisa pana pilin pimeja pini pipi poka poki pona sama seli selo seme sewi sijelo sike sina sinpin sitelen sona soweli suli suno supa suwi tan taso tawa telo tenpo toki tomo tu unpa utala walo wan waso wawa weka wile

It's simple to optimize the character table by removing explicit storage for words entirely subsumed by another, most trivially any of the single-letter words. This optimization is a step removed from the rest of this system, so it can be performed at any time. There is still opportunity to optimize further, by combining words which are partial contiguous subsets from either end, and each combining can be weighed against every other to determine the optimal configuration. This will be done later.

Continuing, the toki pona dictionary can be organized for one hundred and twenty-four valid indices. In an octet-based system, each character will consume a nibble and each word code will leave one bit available for the most important system to layer next, the auxiliary dictionary. The Elision system mustn't restrict thought, and enabling use of words outside of the language's dictionary is required for this. I mulled over how best to enable such, and settled on this auxiliary dictionary approach.

The auxiliary dictionary is merely a structure identical to that language dictionary, but containing different words; it needn't have the same number of indices, but uses that same character set. This greatly simplifies this second layer, by ensuring the overlying mechanism need merely check a single bit to decide which dictionary to use for the particular operation, then carried out normally. This maintains the important quality of word access being O(1), and so operations such as word count also having this complexity. My other approach of inlining words not in the dictionary was poor, as this disadvantages such words, makes the text more difficult to manipulate, and otherwise complicated it.

At this level, the system can now express basic sentences with outside words; follows is an example:

I exude verisimilitude.
mi pana e pelisimilitu

In typical character systems, this sentence consumes twenty-two octets. Elision stores it as a mere four octets, with that character storage for pelisimilitu consuming twelve octets in the dictionary.

It makes no difference to Elision, if the toki pona is to be shown as the sitelen pona graphic form.

That next system to layer over the prior two would be the sentence layer, but this isn't appropriate for toki pona, not strictly having any punctuation. In any case, that sentence layer stores indices into the stream of word codes, indicating their beginnings and ends, not unlike the character table. Importantly, the quality of separating layers so preserves O(1) access and so on for all underlying.

I will elaborate on my Elision system, and its applications regarding toki pona specifically, later.