Index  Comments

This article was spurred by ``UTF-8 Everywhere''.

As an aside, this domain is one of several I've seen which abuses the DNS in order to translate part of common social media nonsense to the greater Internet; that's to write it's common to attach to an idea a fragment of text and then proliferate this text. Registering an entire domain name purely to have it point to an HTTP server providing a single document is stupid and wasteful. It crossed mind to title this ``UTF-8 Nowhere'' but, clearly, reserving the UTF8NOWHERE second-level domain and then having it pointer to a Gopher hole and website is stupid and a spectre of social media nonsense with which I want naught to do.

Before reading the remainder of this article, I recommend reading my 2018-06-06 article detailing my ideal machine text representation.

The document being rebutted presents as its purpose the support of the UTF-8 encoding of Unicode and cites performance improvements, a reduction in system complexity, and the prevention of bugs as some benefits; it also advocates UTF-8 be used as the internal representation of text. The first section closes by downplaying the importance of iterating over discrete units of text and dismissing this as unimportant. The purpose of this article is to rebutt these ideas and emphasize the deficiencies of Unicode, which are numerous and have me believe it's a complicated and poorly-designed character set which should be discarded.

The background section begins by explaining how Unicode lacked proper foresight and reading the file linked reveals that its current iteration has utterly failed to meet its expectations. Particluarly amusing are the following excerpts:

Rather than struggling to salvage obsolete 8-bit encodings via horrendous `extension' contrivances, we need to recognize that the current absence of a standard international/multilingual encoding is a unique opportunity to rethink and revitalize the design concepts behind text encoding.
Nothing comes for free, and the price of Unicode's fixed-length 16-bit character code design is the twofold expansion of ASCII (or other 8-bit-based) text storage, as seen in the figure on the previous page. This initially repugnant consequence becomes a great deal more attractive once the alternative is considered.
The only alternative to fixed-length encoding is a variable-length scheme using some sort of flags to signal the length and interpretation of subsequent information units. Such schemes require flag-parsing overhead effort to be expended for every basic text operation, such as get next character, get previous character, truncate text, etc. Any number of variable-length encoding schemes are possible (this fact itself being a major drawback); several that have been implemented are described in a later section.

The facts section remarks on the several advantages UTF-8 has over UTF-16, and I don't disagree, yet it's amusing to see massively advantaging English listed as a point in favor of UTF-8. The supposed ideal and universal encoding disadvantages every other language. My ideal machine text system gives every supported language an optimized representation.

The fourth section, ``Opaque data argument'', lists the POSIX approach to filenames as somehow ideal through the trivial example of a file-manipulation tool. Firstly, it's important to note that UTF-8 was created by Ken Thompson and Rob Pike; unsurprisingly, this pair specifically designed it to have unnecessary qualities specifically for soothing the C language's delicate sensibilities. A placemat is an appropriate venue for such an encoding to have been designed. Both the ASCII NULL and ASCII / characters don't spontaneously appear in UTF-8 for other reasons, purely because this would burden C and POSIX, which are accustomed to being catered to and accomodating nothing. This section fails to mention the file-manipulation tool benefits from conflating characters and integers, which is common with C and POSIX, as almost no POSIX systems demand a filename be proper UTF-8; this damning flaw is inexcusable as it means there are filenames which can't rightly be accessed by some of the languages which do enforce a real notion of a character. A Common Lisp program represents file systems with a pathname abstraction, which must be a string composed of characters. An Ada program can access such malformed filenames, by virtue of Ada supporting several different variations on its Character type, the smallest being Latin-1; since Ada is designed for real work spanning decades, where the solution isn't to demand the world bend around you, Ada supports types of Character, Wide_Character, and also Wide_Wide_Character, while also supporting several different Unicode encodings and types of Strings. It's telling this section is quick to criticize Windows issues, then entirely ignore those of POSIX. Closing on this section, it's as if Ken Thompson thought ``I haven't done enough damage.''.

The fifth section lists various unnecessary Unicode concepts and sophistry intended to obscure basic concepts of various languages, such as characters, and is expanded upon in section eight.

The sixth section tries to dispel the obvious, great disadvantage of Asian text in UTF-8 by claiming that ASCII text is the most common, by virtue of being used in HTML and other such formats. This is really good reason to stop using so-called textual formats and instead use numerical formats, this I touch on in my 2019-04-30 article and that concerning my ideal machine text system. The notion that an inefficient storage format such as UTF-8 can be dealt with through compression is laughable; goes against the historical Unicode document again, in that it recommends storing a large text in special encoding; is misguided; and merely excuses inefficient formats. The Asian languages aren't the only which are disadvantaged, however; UTF-8 disadvantages each and every language that isn't English, by giving it the most efficient encoding. My ideal machine text system doesn't suffer this, as I don't agree with the idea that a multi-lingual document should be encoded with a single character set. In my system, every language used is tagged and the encoding thereof can then be optimal. I'm inclined to believe the reason Unicode and UTF-8 are promoted so is due to the incapability of POSIX to truly support multiple languages; such systems can only support a one true encoding and so the notion that one must be selected and the evil others stomped out arises. A proper system supports multiple ways to store text, including multiple encodings, and then has no such issues; no languages are then made disadvantaged.

The seventh and eighth sections regard operations and so-called myths, concerning Unicode and UTF-8. A decent programming language usually represents a string as an array of characters and this betrays many advantages, such as more generic handling, orthogonality, etc. Sophistry is used in an attempt to argue obvious and fundamental operations on strings, several of which are shared with arrays, are actually unnecessary. That section should read as insanity to those with good taste.

In my conclusion, Unicode is a very poorly-designed character set, which is fundamentally misguided. I dislike ASCII, in part because of its control character class which behaves differently from every other character, and yet ASCII is tolerable if for no other reason than it is simple. Proper system design takes pain unto itself to eliminate edge cases. A proper text system would localize language edge cases, so that in a language in which the notion of a character makes sense the notion could be used; in a language with the notion of one-to-one upper and lower cases, that could be employed; and in a language where all text follows a certain flow, that flow could be used without issue, as three examples. The Unicode approach seems to be to pour all edge cases into a single container, and then expect that the programmer will handle every single one, but this is unreasonable and doesn't happen in practice, leading to broken systems or those which only accept a subset of Unicode.

In the example of stream I/O, it seems reasonable to give the terminal, file system, and TCP similar interfaces, yet these have fundamentally different failure cases; reading from a terminal can't fail as with the others and may wait indefinitely; a file system can attempt a read and yet fail when the file doesn't exist or changes; and a TCP connection can fail at most any point and offers the fewest methods for correction, in the worst case of a true network failure.

Similarly, Unicode and UTF-8 remove invariants and introduce failure cases. There's real value with using the invariants of a language and for this reason alone Unicode is fundamentally misguided. In UTF-8, the simple act of collecting a character can result in an invalid character or be split along an improper boundary.

Another damning aspect of Unicode is its more recent undertaking of filling itself with garbage that serves only to complicate and entertain, such as characters representing flags and humans in various acts; one reason Unicode contains so many superfluous graphics is to accomodate those character sets which already featured such, but I believe another reason is to serve as graphical interface toolkit purely because the real toolkits for such on modern systems are overly complicated. The lowest part of the system which is reasonable to use then becomes this toolkit.

In closing, I believe my proposed machine text system is better, in that it encourages a far simpler system that is also smaller, lacking in superfluous qualities, and has a mechanism for the rare case of multiple languages in one document in a way that doesn't severely disadvantage most.

As an aside, I find it amusing how the eleventh FAQ answer recommends using the incorrect POSIX line ending rather than the proper carriage return and line feed. I don't consider it beyond possibility that this article is truly naught but sophistry from the cult of C and POSIX, considering UTF-8 also originated from the same place and all positions seem to conveniently align with that view.