Multiple ways, and rings

A project log for PEAC Pisano with End-Around Carry algorithm

Add X to Y and Y to X, says the song. And carry on.

Yann Guidon / YGDESYann Guidon / YGDES 07/21/2021 at 14:530 Comments

As I try to re-validate w26 and look at the unreachable w32, I have identified methods to get more computations done per second, including a cluster of RPi that recently led to the creation of a new project/fork : #Clunky McCluster...

I also explore the possibilities offered by the PolarFire FPGA I recently got. I am stuck at the internal network level and am considering a weird hierarchical token ring...


So far, I favour 2 approaches : The RPi cluster, and the FPGA acceleration. CUDA running on GPGPU rented on the cloud from AWS or Google could have a great potential for huge calculation but at an unknown price, which increases with time used, which in turns increases with the development on a platform I can't really "get my greasy hands on". And the clock is ticking while credits evaporate...

OTOH : I already own a great PolarFire board and a collection of RPi boards of various mileages. I can develop at my own pace and reuse the hardware I already own, as well as the skills, for other future projects, when I want, how I want.

I don't know yet which one I will choose : each has their own strength but I have the tendency to spread and dissipate the focus/efforts.

The FPGA/Polarfire route has a direct speedup factor of around 1000 as I think I can fit 1000 to eventually 2000 calculating circuits on this large chip. The constraints are pretty clear and there are a few, well-understood, major challenges to overcome. But once solved, that's it, it will stream results. The only real issue I see is scalability: it's a fixed-size solution and I'd have to buy more boards to get faster results.

The RPi side is more evolutive, as you can extend a cluster, swap in faster boards, with more cores, progressively tune the software and eventually exploit the GPU. This is more flexible but in the beginning, the payout is low. The speedup would start at 5 or 10, with GPU eventually bringing another 10x speedup or so. Maybe if I succeed in using the Pi's GPU, I can then port the system to CUDA and get that sweet 10K speedup one day...

All these ways share a common framework and algorithms for managing and reconstructing the orbits, so this is another branch to work on, on POSIX for now.

But for now I have not chosen. One on hand, I have started the spin-off project #Clunky McCluster, OTOH I keep searching an appropriate intra-chip network topology that uses the least gates.