Game of Life bit-parallel algorithm

Description

Game of Life can be seen as a saturating "population count" (popcount) function followed by a bit of boolean twiddling to get a final bit. But when taken as vectors, a lof of computations can be factored between neighbouring cells! This realisation is the key to the great speed-up brought by this algorithm.

I have used bit-parallel, or "lateral" algorithms to compute "Lattice Gas Automata", such as the FHP3 model (see my Master's thesis there: http://ygdes.com/memoire/
Computing GoL with this method is probaly 100x easier and faster, so why didn't I think about it before, and why haven't I encountered it yet ?
It took at most a day to create the boolean equations and code the corresponding operations in pseudo-assembly language that I could test with JavaScript...

This is the first "useful" application for the YGREC architecture.

Details

All the dirty details and the preliminary design are described in this log:

https://hackaday.io/project/14628-ambap-a-modest-bitslice-architecture-proposal/log/49599-what-about-a-game-of-life

I have working equations and now I'll try to enhance the code, apply some graph analysis and optimise the register allocation to fit inside the 4-registers relay computer ( #YGREC-РЭС15-bis based on #AMBAP: A Modest Bitslice Architecture Proposal). Because YGREC has 16-bits wide registers, this code can update 16 cells per iteration, using the most out of this bare metal computer (litterally).

The computation uses 25 boolean operations and some more shifts (which can be factored in various ways). The YGREC datapath was being designed at the same time as this algorithm and, because it uses quite a lot of shift operations, it was decided that YGREC would include the shifter in front of the ALU, not in parallel like the YASEP.

However, bit shifting is only one aspect of the program: bit vectors must be correctly managed, dataflows must be optimised to reduce register spill.

For now I only care about running a world that is as wide as one register. In JavaScript, that's 32 bits, and 16 bits for AMBAP/YGREC. The length can be arbitrary though and I provide a C version with 64 bits.

Shrinking the code and saving an instruction here or there is not just a matter of beating a dead horse again. Conway's Game of Life is famous for having consumed incalculable amounts of CPU time and energy since its creation, decades ago. If you still need to be convinced that GoL is a computational blackhole, just look at people computing GoL with GoL, create Turing Machines, a computer, a clock, or other crazy GoL systems on YouTube.

But more down to earth, computing GoL on the #YGREC-РЭС15-bis is a challenge in itself, due to the slow clock (around 25 instructions per second), very limited and expensive parts and storage (very few registers and less than 1K bytes of DRAM) and all the instructions must be coded by hand on DIP switches!

Logs:
1. The parallel boolean computation
2. Handling data
3. So it works.

@Jean-François Poilpret designed #Conway's Game of Life on ATtiny84 for the The 1kB Challenge using the ideas of this algorithm and introduced some conceptual enhancements, which I have to examine in order to write an even better version. He explains thusly:

Hi Yann, calculating the full popcount was intentional, to be consistent at this level of computation. I did not change the rules of the game though, I just rephrased them to fit the whole count:

any count other than 3 or 4 means a dead cell
a count of 3 means that the cell must be live, whatever its previous state
a count of 4 means that the cell must keep its previous state

So it seems my optimised version is still not optimal :-D

Here is the author's own definition:

Files

GoL_03.OK.c

The code that generates the animation shown in the youtube video

x-csrc - 3.33 kB - 11/29/2016 at 05:00

Download

GoL.js

The computation (enough to show that it should work but not really functional yet)

javascript - 3.44 kB - 11/29/2016 at 04:02

Download

GoL.html

The page that loads GoL.js

HyperText Markup Language (HTML) - 548.00 bytes - 11/29/2016 at 04:02

Download

GoL_02_adjust_range.c

skeleton file to show how data are managed

x-csrc - 2.36 kB - 11/29/2016 at 03:39

Download

Project Logs

Collapse

Article in GLMF
Yann Guidon / YGDES • 03/01/2018 at 01:27 • 0 comments

The article is out !

I have about 14 pages in GNU/Linux Magazine France #213 where I go through all the optimisation process and go even further.

This article goes beyond the methods shown so far and I will try to develop them. The AVX SIMD extension brings some significant advantages and require a deep redesign of the algorithm but the basic ideas remain the same, summarised in the following graph:

I'll have to update the project, or maybe create a new one. Meanwhile, I'm pretty proud :
So it works.
Yann Guidon / YGDES • 11/29/2016 at 05:07 • 0 comments
I just uploaded another version of the C file, this time with a copy-paste-adjust-warnings block from the JS version, and it worked right away.
For the record, you have to press ENTER to compute the next iteration. Add to this the cost of displaying data through the Xterm and you understand that it's far from the true speed that this algo can reach. But I'll optimise speed for #AMBAP: A Modest Bitslice Architecture Proposal later.
For now, I rejoice that it worked just like I expected :-)
```
                       #
                     ####
            ##     #### ##         ##
          #  #   #   ## ###        ##
 ##      #       #   ## ##
 ##      #      #   #####
         #       ###   #
          #  #
            ##

                         #
                          ##
                         ##

              ####
             # ## #
             ##  ##
             ##  ##
             # ## #             # #
              ####               ##
                                 #
```

Handling data

Yann Guidon / YGDES • 11/29/2016 at 03:59 • 0 comments

I've just uploaded a C file that works with 64 bits (though it can be anything) and provides a skeleton for the data shuffling that reduces memory usage.

There is no double-buffering, the cells are computated "in place". To prevent time-distorsions (the result being directly used for the next iteration), the current result is written to the WriteBuffer variable. This variable is then written back to memory after the past value is read (again).

    L1=world[i-1];          // make sure we read
    world[i-1]=WriteBuffer; // the value of t-1, not the
                            // result computed in the last iteration
    L2=world[i  ];   // I know, a rotating register barrel
    L3=world[i+1];   // would work better but not for AMBAP

To emulate "some sort of expanding computation", I have used the following simple formula :

WriteBuffer = L1+L2+L3;

This helps me check that the ranges of indices can be dynamically adjusted. For example, with the following initial configuration:

start: 13

                         #
                       # ##
             ##      ## ## #       ##
            # ###    # #  ##       #  #
 ##        #  # # #   # #  #       ####
 #  #      #   # ### # ####        ####
 ####       ## #### # # #  #       #  #
 ####      ## ### #### #  ##       ##
 #  #       ## #### ### ## #
 ##        #   # ###   # ##
           #  # # #      #
            # ###
             ##
               ##
              # # #
             #  # ##
            #    # ##
            #  # # # #
            #### ## ##
            #### ## ##
            #  # # # #
            #    # ##
             #  # ##
              # # #
               ##
max=40

After a few cycles, the range has grown :

start: 10

                         #
                       #  ##
             ##      ### # # #     ##
            #### #    ####   ##     #  #
 ##        #####  ## ###  ## # #     ####
  #  #       ##### # ##### # ###   ###    #
   ####    # ####   ##   # #    #  #    ###
 ###    #   #   # ## #    ##    #       #  #
 #    ###   #  # # # ##    #    #       #  #
      #  #  # # ### # # # ##    #  #    ###
      #  # # ### ###      ##    #  ###    #
 #    ###   # # ### # ### ## ###     ####
 ###    #   #  # # # ##  ### # #    #  #
   ####     #  # ### # #  #  ##    ##
  #  #     # # ### ## # ## # #
 ##           ###   # ##  ##
           # ##    ####  #
            ##     #####
              # ##  # ###
            #  ##   #### #
            ###### #      #
             # # ##### #  #
             # # ##### #  #
            ###### #      #
            #  ##   #### #
             ### #  # ###
              # # # ####
            ### ## # ##
             #### #  #
              ###  #
               ##
max=43

It is important to keep an empty line before first line and after the last line, so more cells can grow there.

The adjustment of the last line is done with the variable run: it is first cleared to 0 then set to the line number where the line is not empty. This value is then assigned at the end of the scan:

  MaxLine=run;

MaxLine will be the upper limit for the next time step, but it should preserve the previously mentioned "margin" for the growth of new cells beyond the last line. That's why run is set with a little offset, inside the loop:

    if (WriteBuffer)
      run=i+2;  // push the end of the work range

Dealing with the start index is a bit more tricky but run provides us with a way to handle it nicely. It remains stuck to 0 while empty lines are found so we can increase MinLine as long as run is cleared. The complete code is simply:

    if (WriteBuffer)
      run=i+2;  // push the end of the work range
    else
      if (run==0) // no life found yet
        MinLine=i-1; // push the start of the work range

Overall, it's a bit delicate but not totally arcane, since it's only a 1D array. I keep it simple because a 2D array has some very tricky side effects. The AMBAP is limited to 16-bits wide data anyway, for computation AND display.

The parallel boolean computation

Yann Guidon / YGDES • 11/28/2016 at 20:36 • 0 comments

I uploaded GoL.js and GoL.html. They don't "work" yet but they allowed me to test the following, functional code:

    L2=world[i];
    L1=world[i-1];
    L3=world[i+1];

    // half adder:
    S20 = L1 ^ L3;
    S21 = L1 & L3;
    // full adder:
    S31 = L1 | L3;
    S30 = S20 ^ L2;
    S31 = S31 & L2;
    S31 = S31 | S21;

    // Sum LSB
    S0 = (S30>>>1) ^ S20;
    S0 = (S30<< 1) ^ S0;

    // LSB Carry
    S0C = (S30<< 1) | S20;
    S20_= (S30<< 1) & S20;
    S0C = (S30>>>1) & S0C;
    S0C = S0C | S20_;

    // Sum middle bit
    S1 = S31>>>1 ^ S0C;
    S1 = S31<<1  ^ S1;
    S1 = S21     ^ S1;

// Inhibit
    INH = S31<<1 & S21;
    S1_ = S31<<1 ^ S21;
    C2 = S31>>1 & S0C;
    INH = INH | C2;
    S2_ = S31>>1 ^ S0C;
    S2_ = S2_ & S1_;
    INH = INH | S2_;

    X2 = S0 | L2;
    X2 = X2 & S1;
    X2 = X2 & ~INH; // The result is now in X2

This is the contents of the inner computational loop, made of a series of half-adder and full-adder "blocks" that perform the bit 2D counting (popcount). This will evolve, as well as the surrounding code, to keep the number of needed registers as low as possible (4 for AMBAP, plus 2 mapped to memory).

View all 4 project logs

Discussions

Yann Guidon / YGDES wrote 03/21/2017 at 03:03

Oh my, just thinking about this discussion makes me realize there are much more optimisations left :-D

Are you sure? yes | no

Frank Buss wrote 03/21/2017 at 01:23

How is the performance compared to https://en.wikipedia.org/wiki/Hashlife ?

Are you sure? yes | no

Yann Guidon / YGDES wrote 03/21/2017 at 02:57

Hi Frank,

it can't be compared because it's a different scale. I try to run the bit parallel on a very very constrained computer that couldn't run Hashlife. The absence of a lookup table or other memory structure (for example as described in Michael Abrash's book) makes it suitable for my purpose (where the only memory I have is some bytes to be displayed).

On a slightly larger scale, the bit-parallel method can process one time step of tens or hundreds of sites in under 50 CPU cycles. Just use ultrawide SIMD registers on your favorite CPU.

But as far as I understand, Hashlife stores results of previous computations to advance by many steps.

I'd say that they are complementary...

Furthermore, the bit-parallel version has to be applied on zones that actually change, I suspect maybe 2 to 10% of the considered bits undergo a transition. Some intelligence (or heuristics) must be applied to select which zone must be updated. And if you only have a few cells, you process all the 16/32/64/128/256 bits anyway....

A further improvement of my scheme would be to store temporary results of the computations (the sums) to help select the computation sites. But I have no room for this in the #YGREC16 - YG's 16bits Relay Electric Computer :-D

Are you sure? yes | no

Game of Life bit-parallel algorithm

Description

Details

Files

GoL_03.OK.c

GoL.js

GoL.html

GoL_02_adjust_range.c

Project Logs

Collapse

Article in GLMF

So it works.

Handling data

The parallel boolean computation

Discussions

Similar Projects

YASEP News archives

µδ code

Chemical Reaction Solver in Verilog -- NO ODEs!

3RGB image lossless compression format

Game of Life bit-parallel algorithm

Become a Hackaday.io member

Just one more thing

Description

Details

Files

GoL_03.OK.c

GoL.js

GoL.html

GoL_02_adjust_range.c

Project Logs Collapse

Enjoy this project?

Discussions

Become a Hackaday.io Member

Similar Projects

Does this project spark your interest?

Report project as inappropriate

Send message

Remove Member

Project Logs

Collapse