Post-processing: the strategy is coming

The last log Post-processing mentions some constraints for the processing required after the parallel scans. I have been thinking more about this lately and there is some progress.

First, the format of the logs: each line now contains 4 decimal values,

The destination/end of the semi-arc
The origin/start of the semi-arc
The length/number of steps of that semi-arc
The count of semi-arcs.

For w4: (slightly edited/formatted for readability)

$ cat w4.log
3   1 18 1
11  2  6 1
7   3  5 1
10  4 14 1
6   5  4 1
14  6  5 1
8   7 10 1
9   8  3 1
13  9 24 1
12 10  4 1
5  11  8 1
2  12 16 1
4  13  7 1
15 14  9 1
0  15  1 1

The 2nd column is just the increasing sequence of integers, starting at 1 because we start with C=0, and X=0 Y=0 will loop back to this "stuck state".

Corollary: the reported length of the orbit (sum of all the semi-arcs) is off-by-one since one semi-arc is missing, this is the special case going from C=1 X=0 Y=0 to C=0 X=1 Y=0.

The last column will be incremented little by little as lines are coalesced. This is is the hard part....

One critical aspect is that coalescence should be easily "streamed" (or "strip-mined") because the logs are way larger than my 16GB of RAM. Operations should be performed on files, streams of linear data. To ease this, I'll use existing POSIX tools where possible (to reduce coding efforts), in particular sort (once again, edited/formatted)

$ sort -g w4.log
 0 15  1 1
 2 12 16 1
 3  1 18 1
 4 13  7 1
 5 11  8 1
 6  5  4 1
 7  3  5 1
 8  7 10 1
 9  8  3 1
10  4 14 1
11  2  6 1
12 10  4 1
13  9 24 1
14  6  5 1
15 14  9 1

(see ? no line at 1 but it's implicit)

But even sort has limits. So the files should have a limited size, 1GB for example for each chunk.

For w32, with a total log size that could reach 100GB, the working dataset should be spread into 256 files (500MB each approx.). It's easy to program : take the 8MSB of the origin index and send to the corresponding file.

256 is also a good value because Linux has a limit on the number of open files (see ulimit -n) that is typically 1024. Working with 256 files per dataset (input and output) gives enough headroom without having to manually tweak the OS/configuration.

Sorting all the lines by their destination is very important because by taking both unsorted files and sorted files together, we can then stream the main operation of fusing consecutive semi-arcs. Having both sorted and unsorted versions will double the required storage capacity but the size of the dataset should shrink quickly through the fusion process... Eventually, an unsorted block could be sorted "in place" through pipes that stream data directly to our program. UNIX has a few tricks available for us but I would like to keep it simple.

The algorithm revolves around sort to do the hard work so we can handle the fusion of semi-arcs with a simple program (at first). The goal is to avoid lookups or seek() through the whole dataset. Here is how it is possible: let's look at the first lines of the unsorted and sorted files.

$ cat w4.log      $ sort -g w4.log
3   1 18 1        0 15  1 1
11  2  6 1        2 12 16 1
7   3  5 1        3  1 18 1

The first line has a little problem because it is missing the tricky transition from C=1 X=0 Y=0 to C=0 X=1 Y=0. We'll keep it as is to mark the "primary orbit" (which is open). So let's look at the 2nd line: X=2 is arrived by coming from X=12. So we can replace those 2 lines with one that reads:

11 12 22 2

X=12 goes to X=11 in 6+16=22 steps and 1+1=2 semi-arcs.

However the fusion means that the unsorted stream will see the entry a second time and it should not be processed again. Our program should contain a scoreboard that marks the value as used so it will be skipped. For w32, it could be held in 512MB (1 bit per value) and leave enough room for the rest. Interestingly : the scoreboard is written out-of-order but read back in-order so the penalty is not enormous.

Fusion can only go forwards because otherwise the stream would have to modify its own history. It's not impossible if the program keeps some of the past values in a buffer before committing them to storage, but that's pretty inconvenient. This should be solved by the sorting phase.

The fusion can only progress forwards but should also alternate with backwards passes, which is possible with sort -r. The dataset size would shrink by about 25% after each fusion pass or 50% after a pair of forward-backwards passes. Local fusions are also possible (and parallel) though I don't expect much gain: if a file contains 1/256 of the dataset, there is only 1 chance in 256 that the origin is also in the same file, so that's much processing for little gain... Overall if one pair of passes shrinks the dataset by half, then 32 passes will be required for w32. The first passes will be the longest, then as soon as the dataset fits in RAM, the RAMdisks will speed up the access and the rest will be faster.

Another issue is that merged entries will be output totally out-of-order by the fusion program. I need a way to tell sort to ignore the first column...

    -k, --key=KEYDEF
          sort via a key; KEYDEF gives location and type

seems to assume that each line has a fixed format, which is not very practical.

OTOH the option

   -z, --zero-terminated
        line delimiter is NUL, not newline

sounds interesting, particularly in light of

    ***  WARNING  *** The locale specified by the environment
    affects sort order.  Set LC_ALL=C to get the  traditional
    sort order that uses native byte values.

So sort could sort binary data, as long as these data don't contain a zero. However, binary doesn't mean "raw" because the bytes must be stored "Big Endian"-style, MSB first. A 32-bit word will require 5 bytes, as the MSB of each byte is set to 1, à la MIDI, and 7 bits remain. Space/separator is 7F and EOL is 00. The Xorig et Xend fields need to be fixed-size but the arc count and length can have variable size. Unlike the previous formats, there is almost no base conversion, though it's not user-readable.

From there, the number of "bins" or "chunks" could be reduced to 128, as this is the range encoded by the first byte of the number (either Xorig or Xend).

Wait...

Example: To sort on the second field, use ‘--key=2,2’ (‘-k 2,2’).

And the doc says that -n should be preferred to -g.

That looks good...

(stay tuned, tbc)

Post-processing

Post-processing: the strategy looks good

Discussions

Become a Hackaday.io Member