The previous packing technique p4ck relied on stripping unused or irrelevant bits from each byte, and two decimal digits would fit in one byte, packed-BCD-style. It was quite efficient (close to entropy) but could not be directly examined in a text editor nor sorted with sort. Due to the crazy size of the scanning logs for w32, I must solve the most significant problem, which is the compatibility with sort.
Digging through the manual pages, I found that sort can sort binary data and I have even tested that it can select the sorting key as characters at arbitrary positions. The line and field delimiters can even be configured (see ‘--zero-terminated’ and ‘--field-separator=SEPARATOR’). Overall, it seems that sort could handle binary data but the end-of-line delimiter creates an "in-band signaling" problem. So whatever binary data is encoded, the value 0 should be avoided. As previously discussed at the end of log 91. Post-processing: the strategy is coming, the solution seems to be a "MIDI meets UTF-8" coding, where the byte's MSB selects between a normal ASCII character (MSB=0) and binary data, disguised as UTF-8 extended codes (MSB=1).
There are 2 corollary side effects. First, as we can select bytes for the sort key, there is no need for encoding the key delimiter and we save 3 bytes per entry, which were required previously to separate the 4 fields. In turn, the entries are all fixed-size and must reserve enough space for the largest values. This is particularly worrying for the arc length field which can grow a LOT at the end of the fusion process. Entropy always wins in the end, as one win is also a loss.
Fixed-size entries are good though for parsing speed, where strtoull() could slow down the process with bloated code from the standard libs. Parsing speed and output speed could then be much better, while the fusion process itself is not complex (one half of the arc fusions are not processed because of the backwards directions).
As the name p7k implies, the data are sliced in chunks of 7 bits. The 8th bit is set to 1 to help sort distinguish a line end. I won't even bother encoding in base-254 because the extra entropy gain is negligible, compared to the coding complexity and execution speed. This means that we can encode numbers containing multiples of 7 bits: 7, 14, 21, 28, 35, 42, 49, 56, 63, 70...
For w32 the format is then 35, 35, 70, 35 (Xorig, Xend, arclen and arccount). 70 bits is necessary because 63 is barely enough to store the full orbit length, which is slightly over 63 bits, sadly! (so the MSByte will be mostly 0)
This translates to 5, 5, 10, 5 bytes, plus EOL, for a total of 26 bytes per entry. Multiply this field by 2^32 and the raw initial scan logs will total 112GB. In the beginning, these are easy to compress because 11 bytes (EOL and arc count) will have no entropy. However this file must be sorted in 2 ways and the size will expand to > 222GB before the first reduction. This is close to the size of the extra SSD I have dedicated to this project, I'll have to get a 480GB one then... Unless I find a way to auto-detect the field width, but this would make the fusion program more complex.
Another important detail is that sort requires the first byte of a key to be the most significant, so the bytes must be stored in Big Endian order. Because the fields will have weird sizes (5 bytes for w32) I will manage each byte individually, without bothering with the bswap intrinsics or network order macros.
The w26 case could use a smaller entry size but the overall file size (2GB) fits in RAM and a tighter encoding will not provide significant gains.