I slept on the last log 91. Post-processing: the strategy is coming and I just realised that the join/coalesce/merge/fusion program can work in any direction, so there is only one program to write, yay. There is only one command line option to add, to allow fusions with forwards or backwards arcs. Otherwise, since there are 2 synchronised streams, the fusion program doesn't care if it's sorted in increasing or decreasing order.
So far, the rest of the post-processing algorithm relies heavily on the features of the stock GNU sort program. It provides all the required options (-n, -k) and a load test just showed it uses all the available processors, without having to enable a flag/option. The only remaining issue is that it only deals with text data, which more than doubles the required storage space. Later, I'll see if/how I can re-encode the logs with the "MIDI-like" format (with 7 bits of data per byte).
Anyway, the algorithm to process w32 spends maybe 4 or 5 passes with files that exceed the RAM size, each time halving the size of the dataset. I use an external SSD for development, I could add a second one to double the bandwidth. I should check the laptop's ports if enough high speed links are available.
Each pass contains a pair of merge operations:
- First, sort the data set from the primary SSD in increasing order of Xorigin
- Create a copy that is sorted in increasing order of Xend
- Run the fusion program (to be coded): it "streams" the contents of both files above to create a new stream with half the size on the secondary SSD. The program contains a RAM-resident scoreboard: 1 bit set for every entry that has been processed/removed from the output stream.
- Sort the result again, this time in decreasing order
- Run the fusion again.
- If the output stream still contains arcs, loop again.
The burden of development is reasonable, I suppose that most of the time will be spent on sorting the huge log files. Sadly sort can't work on p4k files directly, I must develop another file/encoding format... p7ck ?
The post-processing method seems clear enough and I can return to develop the parallel scanners ;-)