Index

As this writing is a collection of my ideas that have not been written down before now, this page will be gradually updated and refined whenever I recall more and more of my ideas concerning the topic and feel the want. Currently, the whole picture isn't necessarily quite all there.

When people discuss text, they often mean the written word; this is something very different from what constitutes machine text, which lacks many important properties of real text. The current state of machine text is marked by ASCII; it was designed for sending messages economically and reliably enough and a quarter of it is dedicated to controlling terminals. It is a good approximation, but it is not text; it is an optimization with a replacement well overdue. Real text can be underlined, italicised, bold, or otherwise stylized. Real text can contain handwritten illustrations or nonstandard characters. Real text doesn't rigidly conform to a grid in all instances; it is not one-dimensional. This also has a large effect on programming languages. The mathematician can truly create notation that suits the context. The programmer cannot.

Firstly, I will attack the mistaken notion of storing words exclusively as character codes as being efficient. On modern machines, and as with many optimizations, the character code-by-character code format negatively impacts storage requirements and has morphed from an optimization into a hindrance; it is an insidious optimization. Consider storing English text; it would be far better to store words, rather than individual characters; it would be simple to have a compact dictionary containing the few millions of English words and their variations and to then simply store indices into this dictionary to represent text; if a twenty-four bit code was used, this would likely be enough to store every English word and would also be an advantage over all words with a length greater than two, which is the vast majority. It is important to note that all English words are composed of characters from an alphabet and so storing as such is appropriate here. The implementation of this dictionary could be as so: A table with an entry for every word is formed; every entry is composed of an index into a table of necessary characters and a length; there is then the aforementioned table of continuous characters. This aforementioned table of characters could be compressed in many ways; firstly, it wouldn't be necessary to store all punctuation or most other non-alphabetical symbols; secondly, analysis of the English language could be done to decide the most efficient encoding; it could very well be the case that a nibble-based encoding with character switching modes to encompass the entirety of the language would work well compared to other encodings. Regardless, the storage of the alphabetical characters themselves is irrelevant, if an abstract routine to manipulate the underlying storage format is provided.

Further, it is trivial to organize the storage of the body of the words in a way where the individual character cost of each word tends towards zero. Consider the words ``sea'', ``lion'', ``lions'', ``sealion'', and ``sealions''; those could all be stored simply as ``sealions'' in the character table, with all appropriate words referencing the same storage. Again, this optimization can be performed at any time, as the precise contents of the character table at this level are irrelevant. It is not necessarily advantageous to consider storing case; it may be better to simply use available bits to denote proper nouns and whatnot, instead, and so provide a higher level of specification. Other advantages of this mechanism is ease of, say, organizing words if the table is ordered alphabetically, which would be suggested, as this then simply becomes a single integer comparison. Further advantages are: spell checking, as it is impossible to specify a word not in the table (a higher level mechanism for specifying improper words and other languages is suggested); word searching, as only integers need be searched for, which needs no complex algorithm for optimal efficiency and, due to the constant storage and organization of these integers, could easily be done in parallel; and lastly, such a system could also be used for obfuscatory purposes, if desired. To reinforce the claims of efficiency if the length of the word is greater than two, consider that most words with a length of three or less are a subset of longer words; this ensures that the storage for each tends towards zero, when optimized; then consider the storage consumed by the index and length, which should be rather small; it becomes obvious that the negative storage gains for these words is small and should be overshadowed by the savings for longer words many times over.

Secondly and lastly, I will attack the mistaken notion of solidifying a font set. While English words and those of many other languages are specified with a finite and predetermined set of characters, concepts are not best suited to this. Mathematicians can freely define notation that has a unique appearance and follows special ordering and typesetting rules, among other pleasant properties. While it seems that any notation can be reduced to a one-dimensional form, that is similar to describing human interaction in terms of atoms; it is possible, but the higher level makes correspondingly higher reasoning possible and easier. Real text permits custom symbols and notation. Machine text affords nothing of the sort and systems that attempt to allow this are grossly complicated. With a future project of mine, I seek to create a higher language not married to prevalent ideas of machine text; I believe such a system that affords programmers true customization of notation, will be a similar advancement to the transition from static to interactive languages. You can only poorly emulate interactivity in a language not built with that in mind; similarly, you can only poorly emulate higher notation in a language not built with that in mind. Such a language can and should allow all notation to be redefined, if any is present. A particularly omnipresent notation across computer languages is those of the decimal numbers. There is no reason better notation can't be defined; such a redefinition would permit, say, superscripting and subscripting, along with allowing notation for any mathematical concept, rather than specializing on addition, subtraction, et al. and leaving other concepts, such as exponentiation and logarithms, to be described only with words or otherwise hindered. Metaprogramming permits most any concept to feel as if it was already present; metanotating can do the same.

These descriptions of better systems should make it clear that the notion of specifying a language component-by-component is foolhardy and only locally optimal. In the presence of complex systems with many millions or billions of times of memory suitable to store such a language table, this should clearly be done to benefit the whole. It should also be clear that the current state of machine text, as with many things, is a very poor imitation of the real thing that prevails purely due to historical reasons. This is not to write that such a system is entirely unsuitable, as it's clearly not, but the mechanisms described in the preceding paragraphs is clearly better in the general case, on such capable machines, and I'd argue that even on a constrained system a miniature such mechanism would likely outshine the alternatives, especially if memory is constrained and the amount of text is large. You probably have a half dictionary wasting a great deal of space on your machine right now, but it's likely disorganized, unoptimized, incomplete, and mostly unused. As for the notion of customizable notation, I believe that speaks for itself. An established language such as APL is proof of the power of notation, but giving the programmer the same freedom Iverson had would result in something even better. For those of you considering that this would result in poor code, look no further than the source code for your current system and judge that first, if you can even read all of it. If you've never considered any of this before, that's merely more proof that this optimization has become insidious.