Speeding up by 20%, profitability almost reached

Computing SHA256 is a challenge for 6502 CPU inside C64. The main problem is that this 8-bit CPU has only three 8-bit registers while 32-bit arithmetic is required. You need at least 4 instructions (one for every byte) for each 32-bit operation. In practice it will be closer to twelve: 4 times (load, change, store).

Even though C64 has only 64KB of RAM, the primary way of speeding up programs is using more memory: unrolling loops and heavy use of lookup tables. For example for drawing pixels on the screen you could sacrifice some RAM to precalculate lookup tables for the memory address of every line of the screen.

If you look on SHA256 pseudocode in Wikipedia there are only few basic operations needed: NOT, AND, XOR, addition and circular right bit shift rotation with varying shift length.

Using lookup tables won't get us anywhere for anything except bit rotation. There is no benefit of using lookup tables for AND or XOR - it would take more time than the built in instructions.

The right shift case is different because it depends on the shift length. The naïve and memory-conserving approach is to use a loop. This is what CC65 runtime does in the arithmetic shift to the right built-in function.

But we need a circular shift - bits that fall out on the right end have to appear on the left side. So the actual operation is:

uint32_t right_rot_gen(uint32_t value, uint8_t count) {
    return value >> count | value << (32 - count);
}

There are four 32-bit calculations here: two shifts, subtraction and OR.

Can we do better? Yes.

Let's dive into right_rot function from sha2.c

The first thing we need is to get into 8-bit mood and start thinking in terms of bytes.

A 32-bit unsigned integer is just 4 consecutive bytes. 6502 is little-endian so the first byte of this array contains the lowest 8 bits.

uint32_t right_rotbuf;
uint8_t *r = (uint8_t*)&right_rotbuf; // could be uint8_t r[4] here

Now it should be clear that shifting these 32-bits by 8, 16 or 24 bits means just moving those bytes around, without extra calculations. Note that only one of those conditions can trigger. This runs almost in constant time - one subtraction and five/six byte moves.

if (c>=24) {
    c = c - 24;
    tmp = r[0]; r[0] = r[3]; r[3] = r[2]; r[2] = r[1]; r[1] = tmp;
}
if (c>=16) {
    c = c - 16;
    tmp = r[0]; r[0] = r[2]; r[2] = tmp;
    tmp = r[1]; r[1] = r[3]; r[3] = tmp;
}
if (c>=8) {
    c = c - 8;
    tmp = r[0]; r[0] = r[1]; r[1] = r[2]; r[2] = r[3]; r[3] = tmp;
}

What's left to do is to handle the case of shifting an 8-bit value by 1..7 bits to the right. What happens there? A naive approach still means a loop (up to 7) that would apply bit shift four times - for every byte of the result.

But observe that the shift is done up to 7 bits only. So only the current byte and the following one can be affected.

This is where a lookup table comes into picture. We need 2 bytes of result: new low bits of the current byte and new top bits of following byte for 7 possible shift counts, for every 256 possible values of the current byte.

These values are precalculated in generate_rot_tables() function using the standard C library for simplicity:

for (c=0;c<8;c++) {
    rot_lobytes[c] = malloc(256);
    rot_hibytes[c] = malloc(256);
    for (i=0;i<256;i++) {
        j = right_rot_gen(i << 8, c);
        rot_lobytes[c][i] = (j & 0xff00) >> 8;
        rot_hibytes[c][i] = (j & 0x00ff);
    }
}

Going back to right_rot function we can use this data directly just by indexing into rot_lobytes and rot_hibytes. For every byte of the result we take the result of operation on current value (rot_hi) OR result of operation on the byte to the left (rot_lo).

    if (c>0) {
        rot_lo = rot_lobytes[c];
        rot_hi = rot_hibytes[c];
         tmp = rot_lo[r[0]];
        r[0] = rot_hi[r[0]] | rot_lo[r[1]];
        r[1] = rot_hi[r[1]] | rot_lo[r[2]];
        r[2] = rot_hi[r[2]] | rot_lo[r[3]];
        r[3] = rot_hi[r[3]] | tmp;
    }

If c>0, then this part also runs in constant time, independent of the actual value of c.

Taken together this change improved the whole round calculation (two hashes) by 20%, by going down from 4.6s to 3.6s.

This whole thing can now be simply rewritten directly in assembly, removing C overhead.

Discussions

John Moxley wrote 01/16/2024 at 19:43

Thank you for sharing your insightful analysis of optimizing SHA256 computation on the 6502 CPU inside the C64. Your detailed breakdown of the challenges and the subsequent solutions, especially the emphasis on leveraging memory and lookup tables, provides a valuable resource for those working with similar constraints.

The transition from 32-bit calculations to 8-bit thinking and the clever use of little-endian architecture showcase a deep understanding of the underlying hardware. The optimization of the right rotation function is particularly commendable, and the step-by-step explanation, along with the code snippets, makes it accessible for others to implement similar improvements.

Your discussion on the benefits and limitations of lookup tables for specific operations adds a practical dimension to the optimization process. Furthermore, the consideration of shifting an 8-bit value by 1-7 bits to the right, and the use of a lookup table for this purpose, demonstrates a keen eye for detail and efficiency.

The inclusion of performance metrics, showcasing a 20% improvement in the whole round calculation, is a testament to the effectiveness of the proposed changes. The suggestion to rewrite the optimized code directly in assembly, eliminating C overhead, is a logical next step for those aiming for further efficiency gains.

There, you'll find additional resources and information on mining algorithms, as well as other topics related to optimizing cryptographic operations on resource-constrained systems.

Overall, your post is not only informative but also serves as a valuable guide for those navigating the challenges of optimizing cryptographic algorithms on resource-constrained systems. Thank you for sharing your expertise and contributing to the community's understanding of low-level optimizations.

Are you sure? yes | no

Ken KD5ZXG wrote 03/21/2022 at 17:34

If you want but havn't room for every table of shift/rotate, one table of reverse can help: 76543210~01234567.

Then shift left A+A+C. Carry in/out becomes predictable, always at same ends. Variants use a carry in MUX.

I realize your difficulty is fewest 6502 cycles, not a minimalist CPU lacking rotate right.

Are you sure? yes | no

Discussions

Become a Hackaday.io Member