w32 in CUDA

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Yann Guidon / YGDESYann Guidon / YGDES 01/28/2022 at 00:380 Comments

Le /me is taking an envelope and scribbling things on it while watching YT tutos about CUDA...

Computing w32 in 1 month means that 2³² semi-arcs must be calculated in about 2600K seconds. Each semi-arc is in average 2³¹ iterations. That's approx. 800 semi-arcs per second, or approx. 3.5 tera iterations per second.

Considering that 1 CUDA core can only compute 1 ALU32 value per clock cycle, the RTX3070 is limited to 5888×1G7=10T ALU ops / second. So to fit the whole run into one month, the inner loop of the scanner should contain only 3 instructions... Which is not realistic.

NB: The 2080 is almost as fast as the 3080 in INT32 throughput. The 30x series doubled the number of FP32 units but not INT32 units... and the clock is roughly as fast. That's good to know to increase the price-performance of the compute farm...

I know, the whole run is not strictly required to take only one month, the point is that it's 10K× better than 1K years. And I will enlist the 2nd computer to add 60% capacity, of course. But the inner loop must be as short as possible. For example I will unroll by 2 to replace the moves/assignations with renaming.

I am also not very familiar with the behaviour of the "warps" and their scheduling, but having programmed a LOT on the vintage Pentium processors, it shouldn't be too hard: each instruction takes one time slot... or so.

Unfortunately, I can't cheat and use denormals to emulate integers with FP32 because the mantissa is not large enough, and the FP64 is not well supported on GA104.

The current C code uses int64 for the computations but that may not be practical, or efficient (worse) on CUDA. w32 is on the edge of this. The support of carry with CUDA is unknown and the addition is in fact 2 additions, because of the carry, so there is a problem with 2 carries... The carry register must be 64 bits while X & Y are 32-bit wide.

The addition itself would take 4 ALU cycles if performed in 32-bit chunks. Then the 64-bit iteration counter takes another pair of ALU cyles (because the length will exceed 32 bits often). Then there are 2×32-bit comparisons for a total of 8 cycles, which translates to 2 months of computation at least.

I wanted to save the individual iteration update by running it in parallel as another scanner "lane", as a "global" counter, which would be subtracted from an individual counter at the end (or something like that). If we have a warp of 32 parallel scans, the 32nd thread would simply count the iterations, though I wonder if there would be thread synchronisation issues. Or maybe I can count the iterations in denormal format...

I don't find any information about the instruction set of the Ampere microarchitecture. However I think I have a new perspective on how to organise the computations and test it on the x86.

The previous idea was to run 4, 8 or whatever simultaneous scans in parallel/SIMD, then stop the loop if/when any of the scans would reach the condition for end of loop. This implies quite some work to detect the 2 conditions, gather a single-bit flag and control the loop... And we're short on CPU cycles.

The new idea takes the problem transversally. It takes a warp and computes 10K or 100K iterations at most. This reduces the instruction counter's size/cost but also there is no data-dependent condition to end the loop. Inside the loop, the individual conditions are tested and branchless operations will set the appropriate flags.

At the end of the batch of iterations, the program will examine all the results and flags and act accordingly: either continuing the individual lane, or register the result and reinitialise it with new data from a queue.

The goal is to maximise ALU occupation.

I think that 64Ki is good, as it is √2³² and fits nicely in 16 bits or a denormal. The higher bits of the counter is updated every 64Ki iterations.

Let's see how it would work, starting with the variables:

  uint64_t arclen=0, C=0;
    stop = 0,
    X = Xorigin,
    Y = 0;

The newcomer is stop, which is both a flag and a value. The inner loop would look like:

Beware: untested code!

for (uint32_t i=0; i<65536; i+=2) {  // 1 ALU
  if (stop == 0) {
    // Even 
    C += X + Y;   // 4 ALU
    Y = C & MASK;  // Not AND required in 32 bits
    C = C >> WIDTH;  // could be a simple internal reallocation too
    if ((X==0) || (X==MASK))  // 2 ALU
      stop = i+1;      // rarely taken
    else {

      // Odd
      C += X + Y;    // 4 ALU
      X = C & MASK;
      C = C >> WIDTH;
      if ((Y==MASK) || (Y==0))  // 2 ALU
        stop = i+2;   // rarely taken

The mask and shift operations are one way to reorganise the int32 registers so they could be optimised out if properly managed. Using a casting union could work, possibly.

The carry to the 33rd bit of C occurs only for w32 and larger. Up to w31, it is possible to use only 32 bits for C, and fit everything in a tidy SIMD register. However there is no way to disable a whole SIMD lane...