As the need for distributed computation increases, as well as the variety of platforms, I have to make sure that all the systems (dispatchers, calculators and tabulators) can communicate efficiently and reliably.
For w32 I expect that the whole collection of arcs will require tens of gigabytes of storage (and transmission) so the format of the exchange files must be both compact and convenient. I do not expect a particular size for the files, or the duration of computation, because a range of coordinates can vary wildly depending on the available time and computational power.
Each file describes a collection of contiguous arcs, from a start to an end index (inclusive), so each arc's origin is implicit (there is no need to include it in the file). The start and end are given in 0-prefix-padded hexadecimal in the file name, separated by an hyphen, for convenience. For security and traceability (when the range is computed by a third party), a non-guessable key/ID can be added. And the file name extension is the width. Like, for w32:
The format is expected to be good up to w32 so the file will contain "records"/structs with one uint32_t for the end coordinate of each arc. No need to overthink this because the permutations of the arcs are really unpredictable.
OTOH as shown in the log 44. Test jobs, the arc lengths have a characteristic distribution, where 1/3 of the arcs have a length longer than the total number of arcs, so this will not fit in a uint32_t slot. There is no chance it will fill a uint64_t slot either because that's the total length of all the arcs. As for 48 bits: there is no theory yet that can predict the maximum arc length for a given width. A primitive variable-length code would work nicely (1 bit for continuation, 7 bits for mantissa) but the records will not be easy to tabulate, unless there is an encoder and decoder program. Or maybe we could simply select the size "per block". Fine...
So I have to write the translators, as a C library most probably. The original idea used a fixed width structure, supplemented with bzip2 compression but some binary format is required anyway.
Let's not forget to add a checksum to the block as well ;-)
Anyway the above format is for transmission and storage. During computation, the block is held in RAM with the fully expanded struct.
There could be several levels of dispatchers, with the root-level one
- handling all the storage
- talking with the contributing machines
- allocating the ranges according to the relative speeds of the contributors
and the leaf-level dispatchers that handle the jobs directly, depending on the target hardware (POSIX multi-core or CUDA or FPGA or ...).
The leaf-level dispatcher(s) will receive a job ID with a range, corresponding to the available memory size, computational performance, expected run time... and because each job has an unknown/random run time, the results will be gathered "out of order", more or less. When a given number of consecutive arcs are completed (let's say, 16K ? or 1 hour ? whichever comes first ?) then these results are compacted and written to a file and/or sent over network. So it's like it's working with a "sliding window" of results.
Now the communication between the leaf-level dispatcher and the individual jobs needs to be defined as well. Of course we still assume a constant width. This has already been covered with the logs 43. An individual job and all the necessary processing and 45. A job in VHDL.
- Input : Xinit
- Output : Xout, N, and probably also Xinit so the dispatcher knows where to put the result.
The out-of-order results make the leaf-level dispatcher system less trivial, though it's nothing to fear for a POSIX version. This creates other kinds of issues in HW because data (Xinit) might be duplicated and waste resources...