The log 88. Back to the jobs considered the need and means to compress the scanning logs because the files could exceed 50B for w32. I looked at the idea of a base94 encoding but...
I think I have found a better method, packing two decimal digits in a single byte, as a pair of BCD digits. This is better than base94 because the latter would still use one full byte for each separator (space, carriage return etc.).
The resulting byte stream will still have some potential for compression, with bzip2 for example. The delicate part is how to map the extra codes but ASCII made it rather easy.
cat 1234 0123456789 1.2,3 4 ./p4ck 1234 > 1234.p4k od -An -v -w1 -t x1 1234.p4k 10 32 54 76 98 1a 2e 3c 4b fa ./kc4p 1234.p4k 0123456789 1.2,3 4 ?
It seems to work. Mapping is managed with minimal effort, as noted in the source file:
A simple filter that packs 2 decimal digits into one byte, by truncating the MSB. notes: * LF=0x0A is directly mapped, equivalent to "*", ":", "J", "Z"... * CR=0x0D is directly mapped, equivalent to "]", "-", "=", "M"... * Space is 0x20 and collides with "0", remapped to ";" * '.' is mapped to '>' * ',' is mapped to '<' The last char maybe be mapped to 15/0xF/ShiftIn if the input stream's size is odd. The decoder will output a single '?' in this case.
I had to test and it's quite convincing. I generated a log using w16, that looks like:
... 25537 27445 60251 8174 52725 16132 56907 1430 16989 55241 45423 47572 48670 58543 8348 15748 18630 18818 14930 9559 5439 8871 34938 2839 20149 58986 ...
Then applied p4ck, which directly cut the size in half. Then I compressed them with gzip and bzip2:
$ l w16.log* 758979 3 déc. 23:41 w16.log 316391 3 déc. 23:41 w16.log.bz2 365463 3 déc. 23:41 w16.log.gz 379490 3 déc. 23:42 w16.log.p4k 312992 3 déc. 23:42 w16.log.p4k.bz2 334504 3 déc. 23:42 w16.log.p4k.gz
The size difference between w16.log.p4k.bz2 and w16.log.bz2 is marginal (about 1%) so bzip2 gets very close to the entropy right from the beginning. gzip works but remains farther from the entropy, since it gets boosted by p4ck.
The size difference between w16.log.p4k.bz2 and w16.log.pak is less insignificant but still small (17%). bzip2 spends a lot of efforts to remove 1/6th of the extra entropy, when simply repacking the digits immediately brings us near the entropy.
Conclusion: p4ck is a practical, efficient and lightweight method/filter to compress the arc logs for archival. For transmission of huge archives, where every gigabyte counts, the files could be packed further by tar/bzip2.