The original idea was derived from the standard data organisation with a "main playing field" and an auxiliary dual-purpose stack. See below.
But writing the project description made me realise that it was not really adapted to how I would use the algo and organise the data to process PEAC data. In short: as a reordering buffer.
So in the ideal case I have a dripping stream of numbers that must be sent to 2 output streams, one ordered by one key with quite a lot of locality (but not perfectly in order due to wildly varying computation times), the order with a different key that is randomly spread all across the possible range. So these are 2 different strategies though hopefully a single algo would work well enough.
The first big departure is to get rid of the traditional data organisation ("you get a whole buffer already filled with data and you shuffle them in place") and simply consider that each buffer has a predetermined size (constrained by available RAM) that is progressively filled by data in random order. So this would be some kind of insertion sort ?
One interesting characteristic of the data generated by PEAC is that all keys are different, but I don't want to rely on this. In a special case, the equality test could be a "canary", a hint that something wrong happened upstream. But this is not the subject here.
So I have a "reorder buffer" for each of the 256 chunks of my data, let's say 16MB per chunk. It gets progressively filled by incoming data in a more-or-less random order.
The first idea is to merge the "playing field" with the stack space, noting that if a looooong run was stacked before being copied to the main field, it would fill either one of the buffers yet leave one half of the total space unused. So the new hybrid organisation would have a "heap-stack"-like organisation, with the incoming/reordering stack at the top of the space, growing down, and the main playing field (or bulk of the data) at the bottom, the first addresses.
Well my first example of under-use of the total space is not the best because it would work only for very long downwards/decreasing data runs, since data are aggregated from the top, and the "top" of the stack (going down) would reach the top of the highest run. At this point, the whole existing buffer would have to be flushed to make room for the new run... Or the buffer would have to be interrupted. These two things have to happen anyway because the buffer is likely to be smaller than the data to hold.
At these scales, where the whole data set spans a couple of external SSD, I'm not much concerned by the fine-tuning of the sorts and I just want to shave anything I can and go straight to the point. And the point is that now, to me, it looks a lot like... Have you ever watched DEFRAG.EXE running ?
Things like that pop up in my head...
The point is for the data to be inserted the earliest possible at the right place to limit the further moves, so how can one determine this supposed place ? A hash table now appears in my mind, directly indexed by the key. The merge idea totally jumps out of the window for now, at least for the low-level reorganisation. But this ensures, unlike the stack idea, that there will be few small mergers for random data, and this would be directly proportional to the size of the buffer. Good bye stack !
The whole range of the keys can then be bucketted to death, with as many sub-hashes as necessary to prevent cascading effects from one bucket to another, when a bucket becomes congested. When the time comes to dump them as they are too annoying to optimise (above 50% fill) each bucket gets "coalesced" (all the entries are successive entries by removing the empty spaces, this is O(n) operations) then the coalesced block is written to disk consecutively with the previous sub-hash. If a whole dump can manage runs of at least 1GB each time, it's a pretty effective strategy for PEAC scanning.
The "sort" needs to know how many bits are used by the key, then how many key bits index the hash table number, then the key bits for the given hash... I want to keep the sub-hashes, or "buckets", small to ease management and keep the algo simple and fast, because now the problem has shifted to a "collision avoidance and mitigation" strategy. This also helps with input data with a narrow range. The merge itself is now pushed to a later stage when the data sink will open and read many files that are locally sorted.
Damnit, the more I write, the more the whole original thing falls apart.