Index  Comments

As this writing is a collection of my ideas that have not been written down before now, this page will be gradually updated and refined whenever I recall more and more of my ideas concerning the topic and feel the want. Currently, the whole picture may not necessarily be quite all there.

When people discuss text, they often mean the written word; this is something very different from what constitutes machine text, which lacks many important properties of real text. The current state of machine text is marked by ASCII; it was designed for sending messages economically and reliably enough and a quarter of it is dedicated to controlling terminals with nearly another quarter for different letter casing. It is a good approximation, but it is not text; it is an optimization with a replacement well overdue. Real text can be underlined, italicised, made bold, or otherwise stylized. Real text can contain handwritten illustrations, nonstandard characters, and other custom symbols. Real text doesn't rigidly conform to a grid in all instances; it is not one-dimensional. This also has a large effect on programming languages. The mathematician can truly create notation that suits the context. The programmer cannot.

Firstly, I will attack the mistaken notion of storing words exclusively as character codes as being efficient. On modern machines, and as with many optimizations, the character code-by-character code format negatively impacts storage requirements and has morphed from an optimization into a hindrance; it is an insidious optimization. Consider storing English text; it would be far better to store words, rather than individual characters; it would be simple to have a compact dictionary containing the few millions of English words and their variations and to then simply store indices into this dictionary to represent text; if a twenty-four bit code was used, this would likely be enough to store every English word and would also be an advantage over all words with a length greater than two, which is the vast majority. It is important to note that all English words are composed of characters from an alphabet and so storing them as such in a dictionary is appropriate here. The implementation of this dictionary could be as so: A table with an entry for every word is formed; every entry is composed of an index into a table of necessary characters and a length, perhaps with other information; there is then the aforementioned table of continuous characters. This aforementioned table of characters could be compressed in many ways; firstly, it wouldn't be necessary to store all punctuation or most other non-alphabetical symbols; secondly, analysis of the English language could be done to decide the most efficient encoding; it could very well be the case that, say, a nibble-based encoding with character switching modes to encompass the entirety of the language would work well compared to other encodings. Regardless, the storage of the characters themselves is irrelevant, as an abstract interface would be provided.

Further, it is trivial to organize the storage of the body of the words in a way where the individual character cost of each word tends towards zero. Consider the words ``sea'', ``lion'', ``lions'', ``sealion'', and ``sealions''; those could all be stored simply as ``sealions'' in the character table, with all appropriate words referencing the same storage. Again, this optimization can be performed at any time, as the precise contents of the character table at this level are irrelevant. It is not necessarily advantageous to consider storing case; it may be better to simply use available bits to denote proper nouns and whatnot, instead, and so provide a higher level of specification; alternatively, it may be the case that words are so often used with a single case that there's no reason not to store the information with the word. Other advantages of this mechanism are ease of, say, organizing words if the table is ordered alphabetically, which would be suggested, as this then simply becomes a single integer comparison.

Further advantages are: spell checking, as it is impossible to specify a word not in the table (a higher level mechanism for specifying improper words and other languages is suggested and still being mulled over); word searching, as only integers need be searched for, which needs no complex algorithm for optimal efficiency and, due to the constant storage and organization of these integers, could easily be done in parallel; and lastly, such a system could also be used for obfuscatory purposes, if desired. Such a system could be extended with additional information about words, permitting a machine with a very high-level notion of what text is. To reinforce the claims of efficiency if the length of the word is greater than two, consider that most words with a length of three or less are a subset of longer words; this ensures that the storage for each tends towards zero, when optimized; then consider the storage consumed by the index and length, which should be rather small; it becomes obvious that the negative storage gains for these words is small and should be overshadowed by the savings for longer words many times over. In the lack of a system dictionary, it should be noted that this becomes a compression scheme and can be suited to a specific texts, then needing no higher mechanism for words not in the dictionary, as there would be none.

Secondly and lastly, I will attack the mistaken notion of solidifying a font set. While English words and those of many other languages are specified with a finite and predetermined set of characters, concepts are not best suited to this. Mathematicians can freely define notation that has a unique appearance and follows special ordering and typesetting rules, among other pleasant properties. While it seems that any notation can be reduced to a one-dimensional form, that is similar to describing human interaction in terms of atoms; it is possible, but the higher level makes correspondingly higher reasoning possible and easier. Real text permits custom symbols and notation. Machine text affords nothing of the sort and systems that attempt to allow this are grossly complicated. With a future project of mine, I seek to create a higher language not married to prevalent ideas of machine text; I believe such a system that affords programmers true customization of notation, will be a similar advancement to the transition from static to interactive languages. You can only poorly emulate interactivity in a language not built with that in mind; similarly, you can only poorly emulate higher notation in a language not built with that in mind. Such a language can and should allow all notation to be redefined, if any is present.

For a perhaps more tangible example of the issues caused by this, consider the Chinese and Japanese languages, which struggle with this, as new characters are still created; poor computing should not restrict such nice things. I've so far only discussed how English would work with such a specialized system. I'm not, as of writing, fluent in Chinese or Japanese, but I do know fairly well the fundamentals behind the Hanzi and Kanji systems. A system specialized towards Chinese could be quite nice; there are several different methods for entering Hanzi, including pinyin and handwriting recognition, with both generally boosted by predictive methods; the storage for the Hanzi could use a rule-based system that correctly composes a collection of ordered radicals, with the ability to override this system where needed. The tyranny of the eight-bit byte does hinder Chinese, as sixteen bits is sufficient for everyday usage, but not for all characters, yet twenty four bits is too many; one solution is having two systems or, instead of having a system that composes Hanzi as the Hanzi system composes radicals, combining them into the same layer and so simply having the first hundred thousand or so codes be lone Hanzi, with following codes being compositions into more elaborate concepts. As for Japanese, there could be a similar system that also adds its phonetic alphabets, but there are several ways this could be done and it would likely resemble a combination of the English and Chinese approaches, with special considerations for Japanese; an interesting idea for Japanese text is efficient Furigana generation, as there could be an integer associated with each use, where applicable, that denotes which reading is proper.

An issue with adding to the dictionary is the need to invalidate the previous dictionary. An example is adding a new word, but needing to do so in the middle in order to avoid issues of ordered qualities and also to avoid nonsense of historical circumstance from accumulating. A language such as Chinese may be less harmed by adding new words to the end than English would be. I envision that a new dictionary could be released on an annual basis, if needed. This is enough time to collect a sufficient amount of new words that have been noticed and to give proper and prior warning. Converting documents from one dictionary to another is a simple mapping affair. As an example of why this would be necessary, it can be foreseen that initial versions of this dictionary that I release would likely be incomplete. It can also be envisioned that a forum, say, could have a specialized dictionary for containing certain common or domain-specific words and that these would need to be updated periodically. In any case, this is one important reason why most any implementation of this system should permit circumventing the dictionary, with yet another reason being related to freedom of expression.

A particularly omnipresent notation across computer languages is those of the decimal numbers. There is no reason better notation can't be defined; such a redefinition would permit, say, superscripting and subscripting, along with allowing notation for any mathematical concept, rather than specializing on addition, subtraction, et al. and leaving other concepts, such as exponentiation and logarithms, to be described only with words or otherwise hindered. I've recently grown fond of the qualities of base thirty and it would be nice to be able to use it, rather than have every base secondary to decimal and using an alphabet for those not covered by it. Metaprogramming permits most any concept to feel as if it was already present; metanotating can do the same.

These descriptions of better systems should make it clear that the notion of specifying a language component-by-component is foolhardy and only locally optimal. In the presence of complex systems with many millions or billions of times of memory suitable to store such a dictionary, this should clearly be done to benefit the whole. It should also be clear that the current state of machine text, as with many things, is a very poor imitation of the real thing that prevails purely due to historical reasons. This is not to write that such a system is entirely unsuitable, as it's clearly not, but the mechanisms described in the preceding paragraphs are clearly better in the general case, on such capable machines, and I'd argue that even on a constrained system a miniature such mechanism would likely outshine the alternatives, especially if memory is constrained and the amount of text is large. You probably have a half dictionary wasting a great deal of space on your machine right now, but it's likely disorganized, unoptimized, incomplete, mostly unused, and largely useless. As for the notion of customizable notation, I believe that speaks for itself. An established language such as APL is proof of the power of notation, but giving the programmer the same freedom Iverson had would result in something even better. For those of you considering that this would result in poor code, look no further than the source code for your current system and judge that first, if you can even read all of it. If you've never considered any of this before, that's merely more proof that this optimization has become insidious.