New packing: p7k

The previous packing technique p4ck relied on stripping unused or irrelevant bits from each byte, and two decimal digits would fit in one byte, packed-BCD-style. It was quite efficient (close to entropy) but could not be directly examined in a text editor nor sorted with sort. Due to the crazy size of the scanning logs for w32, I must solve the most significant problem, which is the compatibility with sort.

Digging through the manual pages, I found that sort can sort binary data and I have even tested that it can select the sorting key as characters at arbitrary positions. The line and field delimiters can even be configured (see ‘--zero-terminated’ and ‘--field-separator=SEPARATOR’). Overall, it seems that sort could handle binary data but the end-of-line delimiter creates an "in-band signaling" problem. So whatever binary data is encoded, the value 0 should be avoided. As previously discussed at the end of log 91. Post-processing: the strategy is coming, the solution seems to be a "MIDI meets UTF-8" coding, where the byte's MSB selects between a normal ASCII character (MSB=0) and binary data, disguised as UTF-8 extended codes (MSB=1).

There are 2 corollary side effects. First, as we can select bytes for the sort key, there is no need for encoding the key delimiter and we save 3 bytes per entry, which were required previously to separate the 4 fields. In turn, the entries are all fixed-size and must reserve enough space for the largest values. This is particularly worrying for the arc length field which can grow a LOT at the end of the fusion process. Entropy always wins in the end, as one win is also a loss.

Fixed-size entries are good though for parsing speed, where strtoull() could slow down the process with bloated code from the standard libs. Parsing speed and output speed could then be much better, while the fusion process itself is not complex (one half of the arc fusions are not processed because of the backwards directions).

As the name p7k implies, the data are sliced in chunks of 7 bits. The 8th bit is set to 1 to help sort distinguish a line end. I won't even bother encoding in base-254 because the extra entropy gain is negligible, compared to the coding complexity and execution speed. This means that we can encode numbers containing multiples of 7 bits: 7, 14, 21, 28, 35, 42, 49, 56, 63, 70...

For w32 the format is then 35, 35, 70, 35 (Xorig, Xend, arclen and arccount). 70 bits is necessary because 63 is barely enough to store the full orbit length, which is slightly over 63 bits, sadly! (so the MSByte will be mostly 0)

This translates to 5, 5, 10, 5 bytes, plus EOL, for a total of 26 bytes per entry. Multiply this field by 2^32 and the raw initial scan logs will total 112GB. In the beginning, these are easy to compress because 11 bytes (EOL and arc count) will have no entropy. However this file must be sorted in 2 ways and the size will expand to > 222GB before the first reduction. This is close to the size of the extra SSD I have dedicated to this project, I'll have to get a 480GB one then... Unless I find a way to auto-detect the field width, but this would make the fusion program more complex.

Another important detail is that sort requires the first byte of a key to be the most significant, so the bytes must be stored in Big Endian order. Because the fields will have weird sizes (5 bytes for w32) I will manage each byte individually, without bothering with the bswap intrinsics or network order macros.

The w26 case could use a smaller entry size but the overall file size (2GB) fits in RAM and a tighter encoding will not provide significant gains.

Discussions

Ken Yap wrote 12/10/2021 at 03:35

GNU sort that is. Classic sort didn't even afford such options.

Are you sure? yes | no

Yann Guidon / YGDES wrote 12/10/2021 at 03:39

That was implicit :-)

... Still writing the log...

Are you sure? yes | no

Ken Yap wrote 12/10/2021 at 04:01

Actually if you can't tolerate record separators then Unix sort is the wrong tool because it's variable length record oriented. Maybe if your records are fixed length you need an old school punch card style sort. 😀

Are you sure? yes | no

Yann Guidon / YGDES wrote 12/10/2021 at 04:22

@Ken Yap yeah but even at 100 cards per second, how long will it take to even punch the cards for 4 billion records ?...

So as the rest of the log implies, I create a custom fixed size record which can pass as a special case of a variable size record.

Are you sure? yes | no

Ken Yap wrote 12/10/2021 at 04:51

Should be cinch once you get your ECL computer up and running. 😉

Are you sure? yes | no

Yann Guidon / YGDES wrote 12/10/2021 at 15:15

@Ken Yap but it works already ! in my head ;-)

Are you sure? yes | no

Ken Yap wrote 12/10/2021 at 22:18

Same here, everything works in my head. 😉

Are you sure? yes | no

Yann Guidon / YGDES wrote 12/10/2021 at 23:22

@Ken Yap it would be worrying, otherwise.

Are you sure? yes | no

Ken Yap wrote 12/10/2021 at 23:45

At least it works until I check the datasheet. 😉

Are you sure? yes | no

Going SIMD

New packing at work

Discussions

Become a Hackaday.io Member