Close
0%
0%

Game of Life bit-parallel algorithm

A different kind of computation for Conway's most famous puzzle, borrowing from my computer design experiences

Similar projects worth following
I have used bit-parallel, or "lateral" algorithms to compute "Lattice Gas Automata", such as the FHP3 model (see my Master's thesis there: http://ygdes.com/memoire/
Computing GoL with this method is probaly 100x easier and faster, so why didn't I think about it before, and why haven't I encountered it yet ?
It took at most a day to create the boolean equations and code the corresponding operations in pseudo-assembly language that I could test with JavaScript...

All the dirty details and the preliminary design are described in this log:

https://hackaday.io/project/14628-ambap-a-modest-bitslice-architecture-proposal/log/49599-what-about-a-game-of-life

I have working equations and now I'll try to enhance the code, apply some graph analysis and optimise the register allocation to fit inside the 4-registers relay computer (#AMBAP: A Modest Bitslice Architecture Proposal). This code can update 16 cycles per iteration, using the most out of this bare metal computer (litterally).

The computation uses 25 boolean operations and some more shifts (which can be factored in various ways). However it's only one aspect of the program and data must be managed.

For now I only care about running a world that is as wide as one register. In JavaScript, that's 32 bits, and 16 bits for AMBAP. The length can be arbitrary though and I provide a C version with 64 bits.


Logs:

1. The parallel boolean computation
2. Handling data
3. So it works.

GoL_03.OK.c

The code that generates the animation shown in the youtube video

x-csrc - 3.33 kB - 11/29/2016 at 05:00

Download

GoL.js

The computation (enough to show that it should work but not really functional yet)

javascript - 3.44 kB - 11/29/2016 at 04:02

Download

GoL.html

The page that loads GoL.js

HyperText Markup Language (HTML) - 548.00 bytes - 11/29/2016 at 04:02

Download

GoL_02_adjust_range.c

skeleton file to show how data are managed

x-csrc - 2.36 kB - 11/29/2016 at 03:39

Download

View all 4 files

  • So it works.

    Yann Guidon / YGDES11/29/2016 at 05:07 0 comments

    I just uploaded another version of the C file, this time with a copy-paste-adjust-warnings block from the JS version, and it worked right away.

    For the record, you have to press ENTER to compute the next iteration. Add to this the cost of displaying data through the Xterm and you understand that it's far from the true speed that this algo can reach. But I'll optimise speed for #AMBAP: A Modest Bitslice Architecture Proposal later.

    For now, I rejoice that it worked just like I expected :-)

                           #
                         ####
                ##     #### ##         ##
              #  #   #   ## ###        ##
     ##      #       #   ## ##
     ##      #      #   #####
             #       ###   #
              #  #
                ##
    
                             #
                              ##
                             ##
    
                  ####
                 # ## #
                 ##  ##
                 ##  ##
                 # ## #             # #
                  ####               ##
                                     #

  • Handling data

    Yann Guidon / YGDES11/29/2016 at 03:59 0 comments

    I've just uploaded a C file that works with 64 bits (though it can be anything) and provides a skeleton for the data shuffling that reduces memory usage.

    There is no double-buffering, the cells are computated "in place". To prevent time-distorsions (the result being directly used for the next iteration), the current result is written to the WriteBuffer variable. This variable is then written back to memory after the past value is read (again).

        L1=world[i-1];          // make sure we read
        world[i-1]=WriteBuffer; // the value of t-1, not the
                                // result computed in the last iteration
        L2=world[i  ];   // I know, a rotating register barrel
        L3=world[i+1];   // would work better but not for AMBAP
    
    To emulate "some sort of expanding computation", I have used the following simple formula :
    WriteBuffer = L1+L2+L3;
    

    This helps me check that the ranges of indices can be dynamically adjusted. For example, with the following initial configuration:

    start: 13
    
                             #
                           # ##
                 ##      ## ## #       ##
                # ###    # #  ##       #  #
     ##        #  # # #   # #  #       ####
     #  #      #   # ### # ####        ####
     ####       ## #### # # #  #       #  #
     ####      ## ### #### #  ##       ##
     #  #       ## #### ### ## #
     ##        #   # ###   # ##
               #  # # #      #
                # ###
                 ##
                   ##
                  # # #
                 #  # ##
                #    # ##
                #  # # # #
                #### ## ##
                #### ## ##
                #  # # # #
                #    # ##
                 #  # ##
                  # # #
                   ##
    max=40
    

    After a few cycles, the range has grown :

    start: 10
    
                             #
                           #  ##
                 ##      ### # # #     ##
                #### #    ####   ##     #  #
     ##        #####  ## ###  ## # #     ####
      #  #       ##### # ##### # ###   ###    #
       ####    # ####   ##   # #    #  #    ###
     ###    #   #   # ## #    ##    #       #  #
     #    ###   #  # # # ##    #    #       #  #
          #  #  # # ### # # # ##    #  #    ###
          #  # # ### ###      ##    #  ###    #
     #    ###   # # ### # ### ## ###     ####
     ###    #   #  # # # ##  ### # #    #  #
       ####     #  # ### # #  #  ##    ##
      #  #     # # ### ## # ## # #
     ##           ###   # ##  ##
               # ##    ####  #
                ##     #####
                  # ##  # ###
                #  ##   #### #
                ###### #      #
                 # # ##### #  #
                 # # ##### #  #
                ###### #      #
                #  ##   #### #
                 ### #  # ###
                  # # # ####
                ### ## # ##
                 #### #  #
                  ###  #
                   ##
    max=43
    It is important to keep an empty line before first line and after the last line, so more cells can grow there.

    The adjustment of the last line is done with the variable run: it is first cleared to 0 then set to the line number where the line is not empty. This value is then assigned at the end of the scan:

      MaxLine=run;
    

    MaxLine will be the upper limit for the next time step, but it should preserve the previously mentioned "margin" for the growth of new cells beyond the last line. That's why run is set with a little offset, inside the loop:

        if (WriteBuffer)
          run=i+2;  // push the end of the work range
    

    Dealing with the start index is a bit more tricky but run provides us with a way to handle it nicely. It remains stuck to 0 while empty lines are found so we can increase MinLine as long as run is cleared. The complete code is simply:

        if (WriteBuffer)
          run=i+2;  // push the end of the work range
        else
          if (run==0) // no life found yet
            MinLine=i-1; // push the start of the work range
    Overall, it's a bit delicate but not totally arcane, since it's only a 1D array. I keep it simple because a 2D array has some very tricky side effects. The AMBAP is limited to 16-bits wide data anyway, for computation AND display.

  • The parallel boolean computation

    Yann Guidon / YGDES11/28/2016 at 20:36 0 comments

    I uploaded GoL.js and GoL.html. They don't "work" yet but they allowed me to test the following, functional code:

        L2=world[i];
        L1=world[i-1];
        L3=world[i+1];
    
        // half adder:
        S20 = L1 ^ L3;
        S21 = L1 & L3;
        // full adder:
        S31 = L1 | L3;
        S30 = S20 ^ L2;
        S31 = S31 & L2;
        S31 = S31 | S21;
    
        // Sum LSB
        S0 = (S30>>>1) ^ S20;
        S0 = (S30<< 1) ^ S0;
    
        // LSB Carry
        S0C = (S30<< 1) | S20;
        S20_= (S30<< 1) & S20;
        S0C = (S30>>>1) & S0C;
        S0C = S0C | S20_;
    
        // Sum middle bit
        S1 = S31>>>1 ^ S0C;
        S1 = S31<<1  ^ S1;
        S1 = S21     ^ S1;
    
    // Inhibit
        INH = S31<<1 & S21;
        S1_ = S31<<1 ^ S21;
        C2 = S31>>1 & S0C;
        INH = INH | C2;
        S2_ = S31>>1 ^ S0C;
        S2_ = S2_ & S1_;
        INH = INH | S2_;
    
        X2 = S0 | L2;
        X2 = X2 & S1;
        X2 = X2 & ~INH; // The result is now in X2
    This is the contents of the inner computational loop, made of a series of half-adder and full-adder "blocks" that perform the bit 2D counting (popcount). This will evolve, as well as the surrounding code, to keep the number of needed registers as low as possible (4 for AMBAP, plus 2 mapped to memory).

View all 3 project logs

Enjoy this project?

Share

Discussions

Yann Guidon / YGDES wrote 03/21/2017 at 03:03 point

Oh my, just thinking about this discussion makes me realize there are much more optimisations left :-D

  Are you sure? yes | no

Frank Buss wrote 03/21/2017 at 01:23 point

How is the performance compared to https://en.wikipedia.org/wiki/Hashlife ?

  Are you sure? yes | no

Yann Guidon / YGDES wrote 03/21/2017 at 02:57 point

Hi Frank,

it can't be compared because it's a different scale. I try to run the bit parallel on a very very constrained computer that couldn't run Hashlife. The absence of a lookup table or other memory structure (for example as described in Michael Abrash's book) makes it suitable for my purpose (where the only memory I have is some bytes to be displayed).

On a slightly larger scale, the bit-parallel method can process one time step of tens or hundreds of sites in under 50 CPU cycles. Just use ultrawide SIMD registers on your favorite CPU.

But as far as I understand, Hashlife stores results of previous computations to advance by many steps.

I'd say that they are complementary...

Furthermore, the bit-parallel version has to be applied on zones that actually change, I suspect maybe 2 to 10% of the considered bits undergo a transition. Some intelligence (or heuristics) must be applied to select which zone must be updated. And if you only have a few cells, you process all the 16/32/64/128/256 bits anyway....

A further improvement of my scheme would be to store temporary results of the computations (the sums) to help select the computation sites. But I have no room for this in the #YGREC16 - YG's 16bits Relay Electric Computer :-D

  Are you sure? yes | no

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates