Finally getting to the heart of merge sorting

A project log for YAMS Yet Another Merge Sort

There are already so many sorting algorithms... So let's make another!

yann-guidon-ygdesYann Guidon / YGDES 09/04/2022 at 08:050 Comments

The last log has shown the results of the latest development work, which ensures that input data are correctly pre-sorted into ascending runs as they arrive, at the cost of more than doubling the working area (which is not surprising for a merge sort). In return, there is very little data movement, as the reference frame of the data moves with the data (in the dual-ended circular buffer). So now we can finally focus on merging the runs !

The merge sort idea has a huge appeal because it is so conceptually simple yet efficient. Take two streams of data, and only pick the highest of both values to put in the output stream, and update the pointer to the elected stream. That's all. The implementation spends more code handling the pointers than sorting. There is a lot of headroom for enhancement then.

Modern cores handle many "in flight" instructions at a time, and branch prediction becomes a significant performance bottleneck so the criteria that selected the old, historical, classical sorting algorithms don't hold as before when certain operations were not as costly or cheap. So the idea here is to fill the pipelines with useful work, eventually some might be scrapped but the pipeline should not be flushed as much, data-driven branches are to be avoided. Parallelising operations is also a boost, so several independent instruction threads are interleaved in the pipeline to keep it full, on independent data. That's why processing the runs from both ends is such a great idea and almost doubles the throughput with little effort.

Another aspect to consider is that we could merge/fuse several consecutive runs at the same time. I have decided to limit to 4 runs at once because this makes a total of 5 pointers and counters to keep in the register set, which is tight if you have only 16 registers. Keep some room for the stack pointer & frame, PC and a few other housekeeping variables, and your register set is saturated...

If you can do more, you can do less. What is less than a merge operation? A move operation ! And inbetween the 2 and 4 runs version, a 3-runs version is necessary. The whole runsort algorithm contains the 4 versions in decreasing size order because as each run exhausts, it's better to use a smaller version. But all the versions need to remains available for the cases when fewer than 4 runs are available (like sin60 testcase).

The function would look like this:

function merge4(run1, run2, run3, run4, runcount) {
  switch(runcount) {
    case 4:
      // merge 4 runs until one is exhausted
      // move the exhausted run out of the list
      // fold!
    case 3:
      // merge 4 runs until one is exhausted
      // move the exhausted run out of the list
      // fold!
    case 2:
      // merge 2 runs until one is exhausted
      // move the exhausted run out of the list

      // move the last run
      alert("error ! you shouldn't get there");

working with a global array of 4 runs woud probably work best. But wait, there's more because in some cases, the sort can only be work "lowest first" and in other cases, "highest first" (when mixing blocks from both ends of the stack). The case that uses runs from only one side can use the faster version that works with both ends of the runs. So that's 3 versions of the same function, we can start from making a simple one, then the reverse, then both combined : merge_low(), merge_high() and merge_both().

Then 2 big questions arise :

  1. When to merge ? How many runs are to be in store before we decide to merge some ?
  2. Which and where to merge them ? From which end(s) and to where ?

The last question in particular is complicated because for the first one, we could use a heuristic or some arbitrary numbers. But the other choice can't be as easily handwaved away. For the case where 2 runs can be merged, there are 4 combinations (assuming there are at least 3 runs) :

  1. merge 2 tail runs to the head
  2. merge 2 head runs to the tail
  3. merge 1 head and 1 tail to the head
  4. merge 1 head and 1 tail to the tail

Which to choose ?