Packing and unpacking numbers

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Yann Guidon / YGDESYann Guidon / YGDES 12/03/2021 at 17:380 Comments

The log 88. Back to the jobs considered the need and means to compress the scanning logs because the files could exceed 50B for w32. I looked at the idea of a base94 encoding but...

I think I have found a better method, packing two decimal digits in a single byte, as a pair of BCD digits. This is better than base94 because the latter would still use one full byte for each separator (space, carriage return etc.).

The resulting byte stream will still have some potential for compression, with bzip2 for example. The delicate part is how to map the extra codes but ASCII made it rather easy.

$ cat 1234
1.2,3 4
$ ./p4ck 1234 > 1234.p4k
$ od -An -v -w1 -t x1 1234.p4k 
$ ./kc4p 1234.p4k 
1.2,3 4

It seems to work. Mapping is managed with minimal effort, as noted in the source file:

  A simple filter that packs 2 decimal digits into one byte,
   by truncating the MSB.

   * LF=0x0A is directly mapped, equivalent to "*", ":", "J", "Z"...
   * CR=0x0D is directly mapped, equivalent to "]", "-", "=", "M"...
   * Space is 0x20 and collides with "0", remapped to ";"
   * '.' is mapped to '>'
   * ',' is mapped to '<'

  The last char maybe be mapped to 15/0xF/ShiftIn
    if the input stream's size is odd. The decoder
    will output a single '?' in this case.

You can find the source code in p4ck.c and kc4p.c. If you need a different mapping, edit the source code or use tr.



I had to test and it's quite convincing. I generated a log using w16, that looks like:

25537 27445
60251 8174
52725 16132
56907 1430
16989 55241
45423 47572
48670 58543
8348 15748
18630 18818
14930 9559
5439 8871
34938 2839
20149 58986

Then applied p4ck, which directly cut the size in half. Then I compressed them with gzip and bzip2:

$ l w16.log*
758979  3 déc.  23:41 w16.log
316391  3 déc.  23:41 w16.log.bz2
365463  3 déc.  23:41 w16.log.gz
379490  3 déc.  23:42 w16.log.p4k
312992  3 déc.  23:42 w16.log.p4k.bz2
334504  3 déc.  23:42 w16.log.p4k.gz

The size difference between w16.log.p4k.bz2 and w16.log.bz2 is marginal (about 1%) so bzip2 gets very close to the entropy right from the beginning. gzip works but remains farther from the entropy, since it gets boosted by p4ck.

The size difference between w16.log.p4k.bz2 and w16.log.pak is less insignificant but still small (17%). bzip2 spends a lot of efforts to remove 1/6th of the extra entropy, when simply repacking the digits immediately brings us near the entropy.

Conclusion: p4ck is a practical, efficient and lightweight method/filter to compress the arc logs for archival. For transmission of huge archives, where every gigabyte counts, the files could be packed further by tar/bzip2.