Index  Comments

As this writing is a collection of my ideas that have not been written down before now, this page will be gradually updated and refined whenever I recall more and more of my ideas concerning the topic and feel the want. I will likely design more numerical logging formats and that will likely lead to new ideas I write here.

I prefer numerical data formats, also known as binary formats; these tend to be smaller, easier to manipulate and verify, easier to manipulate, among other advantages. A disadvantage of these is the need of specialized tooling for creation, manipulation, and display with regards to human use. What pass for textual formats can be thought of as having the advantage of generic tooling, but this falls short and I believe their proliferation is primarily due to sloth with regards to writing proper tools.

Every logging format should accomodate several logging failure strategies without good excuse. A good logger should preallocate some storage so that it can detect a failure before it happens and warn an operator. I see three primary strategies: A failure to continue logging causes the program to stop serving requests, preventing requests that aren't logged; a failure to continue logging causes the program to use the last available space to log that it is no longer able to properly log and continue serving requests; and a failure to log causes the program to begin using older space for logging, creating a modular log.

A log for a networked service will generally need to store the IP addresses related to the connections made. It is obvious to store the octets directly, but IPv4 and IPv6 presents a complication. A simple solution that provides a fixed-length is storing a complete IPv6 address and storing IPv4 addresses as a subset of these; the obvious solution is using the range of IPv6 addresses that map to IPv4. Actions of the server itself that are logged could be accomplished by using an IPv4 or IPv6 address of all zeroes, as both are special, and this is a valuable quality due to it being trivial to test for. A major disadvantage of this is that sixteen octets is one more than the worst-case for storing an IPv4 address textually, which is bad for the common case. Another mechanism that could be used for storing IP addresses is four or sixteen octets, as determined by a flag stored elsewhere; this has the disadvantage of a variable field, but has the major advantage of consuming far less space in the common case.

Time is a particularly troublesome thing to store, relatively, and following has many different methods that can be used to store it. The primary issues of time is its relentless march forward, leading to an unbounded nature, and the accuracy desired, which can exacerbate this; there are two main solutions to these that I see: One can use a sufficiently large unit or one can have a mechanism for seamlessly using a smaller unit without exhausting it. The former method is so popular that it should need no introduction; the idea is merely to have a measure that will last longer than the service ever will with any likelihood. The latter method can be without end, but does have complications; a system message could set the beginning of an epoch in some way, including simply telling the system to move to the next, providing unbounded measure analogous to moving on an infinite tape; all time is then relative to this epoch. The latter method has some unfortunate disadvantages, including the need to find a system message that sets the epoch before records can be decoded, leading to either periodically restating the epoch, wasting space, or requiring a potentially long period of time to sift through each record looking for such, assuming it can be found; further, this leads to complications involving the modular logging failure recovery strategy presented earlier, as records now depend on previous records in an intimate way.

Continuing with time, there's also the question of precision and convention. The unit of the second is generally chosen as the precision for logging, but I'm of the opinion it's often completely unnecessary to have this precision and a lower precision can lead to great storage savings. Varying by the service, it could be reasonable to have granularities of minutes, hours, or even days. As the precision lowers, it becomes more important for all records to be properly relatively ordered within a single unit, but this is already a very important concern for accurate logging. The most popular convention for storing time seems to be storing a count of the unit from an epoch; this approach is the easiest to convert into many different date formats, but is more difficult for any single format than other means; this method is also the easiest to verify, by virtue of having no real invalid states. The other main approach I see is storing the time symbolically; this has the advantage of being trivial to display for human use, but has the corresponding disadvantage of being more intimately tied to that particular way to display the time; this method requires more checking to determine if a date is valid, since there will likely be invalid states in the encoding. A BCD approach to storing dates symbolically is an early thought and can be used, but concern for storage use can quickly have that change to an integer encoding that is more compact. As an aside, a BCD approach to storing dates would need at least seven digits, but octet concerns would have this rounded to eight; four could be used for the year, with the latter four being used in pairs to represent the month and day; alternatively, five digits could be used to represent the year, with the latter three being used to indicate the day within that year; this is a pleasant arrangement.

I'll now describe thoughts for a numerical format for logging Gopher requests. The format should be simple, consume little storage, and support everything reasonable. The only data it should certainly store are the selectors used; a length-prefixed vector will suffice, with a single octet used for the length. Invalid selectors used should also be stored, which this accomodates. It is valuable to store a flag indicating whether the request completed successfully or not, perhaps. Gopher isn't a busy protocol, unfortunately, so accuracy of a second is entirely unnecessary; regardless of the precision and convention chosen, it should consume no more than three octets. The IP address should be stored, but how is an important question. As this format accumulates flags, it may make sense to have an octet solely for flags, as they can no longer be placed elsewhere, and this greatly affects the overall design of the format, as there are far fewer than eight flags needed, permitting flags that may not otherwise be used. An invalid selector not only references a resource that doesn't exist, but also fails to end in a CARRIAGE RETURN followed by a LINE FEED and two octets per valid request, which could be figured to be the majority of them, could be saved by using a flag to indicate that this was so and then omit storing these in the selector vector; this does add complexity to determining the length of a selector, since it is the length indicated if this flag is not set and two more if so, with an extra length check to determine that a selector length wouldn't exceed the limit of 255 with this addition, meaning values 254 and 255 become invalid in this case. Further, a flag could be used to determine if an IPv4 or IPv6 address is stored, along with a flag indicating an action of the server, rather than using an address of all zeroes. However, fewer flags could be stored in the time data or, more appropriately, the selector length, halving it each time; sixty-four isn't an inappropriate length; every selector in my Gopher hole, as of the time of writing, fits in less than half of this limit.

Ultimately, in the pursuit of simplicity, a flag octet would be avoided, as all my thinking would leave half of it unused and that I find poor, and so the final format is as follows: sixteen octets indicate the IPv4 or IPv6 address, with all zeroes indicating a system action; three octets represent the time, likely with fifteen bits representing the year as an integer and nine bits representing the day within that year; an octet has its top two bits determine if the request successfully completed and if the selector ended properly, with the latter six being used for a length of the only variable-length component, the selector itself. The fixed-length header is twenty octets, also being the minimum size, with the maximum size being eighty-three octets.

This format is exemplary, in that most numerical logging formats would likely heavily resemble it and this shows the strengths of this approach, as codes, lengths, and other concerns are stored in their most compact general representation. It is clear by description how to process this format programmatically and rather easily.

This then sums my thoughts on numerical logging formats.