RAIN is an open-source project to design and build open, efficient, and accessible supercomputers. RAIN's mission is to make supercomputing accessible to a wider and more diverse audience and encourage the development of new, innovating and compelling high-performance applications.This Hackaday.io project is focused on the RAIN Mark II Supercomputer Trainer. The goal of this phase of the RAIN project is to create a small, inexpensive computer with the same architecture as a large scale cluster to facilitate learning and the development of new high-performance applications while making owning and operating the system accessible to a wide-range of programmers, designers and users.Completion of the Mark II SCT will provide a platform for research & development on the next phase of the project to make the design even more open (switch from ARM to RISC-V) and more powerful.
Up until now I've been documenting my work on RAIN (formerly known as *Raiden*) on my blog at https://jjg.2soc.net/category/rain/. This includes information about the previous phase of the project (Mark I) and longer discussions of the philosophy behind the work (and why I think it matters).
Here I'm going to focus on documenting ongoing work on a specific machine, the RAIN Mark II Personal Supercomputer.
Mark II is an 8-node ARM computing cluster using a Gigabit Ethernet interconnect. It utilizes a combination of PINE64 SOPINE modules and PINE A64 single-board computer to create a distributed-memory supercomputer with up to 32 ARM cores, 16 vector/GPU cores and 16 GB of memory in a small, clean (and I think, beautiful), easy-to-use package.
In addition to the hardware, the RAIN project involves making the creation of new high-performance applications approachable by people with little or no experience with supercomputers. My goal is to do for supercomputers what the personal computer did for mainframes. This means more than putting the box on someones desk, it also means giving them tools to use it the way they want to.
The design has gone through many revisions and I've recently abandoned some of my own work in favor of using the PINE64 Clusterboard. I didn't take this change lightly as my designs for assembling arrays of PINE A64 boards is more flexible and scalable than the Clusterboard, but the board is just so perfectly suited for the Mark II's design and meets the needs of the goals I have for Mark II with almost no compromise, so it's the obvious choice.
This means I can get the machine operational faster, and design a machine that is much easier for others to reproduce (sticking to off-the-shelf components is one of the design goals for Mark II). All of this means I can turn my attention to the software side of the system sooner and I think that's where I have the most value to provide anyway.
The most egregious was in my interpretation of the high-performance Linpack (hpl) results. I was mystified by the fact that as I added nodes to the cluster performance would improve to a point and then suddenly drop-off. I attributed this to some misconfiguration of the test or perhaps a bottleneck in the interconnect, etc. In the back of my mind I had this nagging idea that there was some sort of “order-of-magnitude” error but I couldn’t see it. Then when I was reading a paper by computer scientist who was writing about running hpl it jumped out at me. The author said they measured 41.77 Tflops, but when I looked at their hpl output it looked like 4.1 Tflops…
Then I realized my mistake.
There’s a few extra characters tacked-on to the Gflops number at the end of hpl’s output:
When I first started using hpl, that last bit was always e+00 so I didn’t pay any attention to it. However at some point that number went from e+00 to e+01 without me noticing, and it turns out that change was significant.
Since I didn’t notice this, I was mis-reading my results and doing a lot of troubleshooting to try and figure out why my performance dropped through the floor after I added a fifth node to the cluster. Once I realized the mistake I went back and reviewed the Mark I logs and sure enough, that mistake sent me on a wild goose chase and resulted in a seriously erroneous assessment of Mark I’s performance.
The good news is that this means Mark II is considerably faster than I thought it was when I wrote about it earlier. The bad news is that this means Mark I is also faster, and my conclusion that Mark II was exceeding Mark I’s performance was incorrect. Knowing what I know now, Mark I was roughly 4x faster than Mark II in the 4×4 configuration.
Performance parity between the two systems seemed too good to be true, so in a way this is a relief. It’s a little disappointing, but this doesn’t undermine the value of Mark II because in terms of physical size, cost and power consumption Mark II easily out-paces Mark I. I don’t have detailed power consumption measurements for either system yet, but if we compare their theoretical maximum power consumption, it’s pretty clear that in terms of flop-per-watt, Mark II is an improvement:
Mark I: 40 Gflops, 4,000 watts (800 watts * 5 chassis) = 100 watts per Gflop
Mark II: 18 Gflops, 75 watts (15 amps @ 5 volts * 1 chassis) = 4 watts per Gflop
The second mistake I made related to the clock speed of Mark II’s compute nodes. The SOPINE module’s maximum clock speed is 1.2Ghz and this is what I used when determining the theoretical peak (Rpeak) and efficiency values for the machine. I knew that this speed might be reduced during the test due to inadequate cooling (cpu scaling) but I thought it was a reasonable starting point.
Based on this I was disappointed to find that no matter how hard I tweaked the hpl configuration, I couldn’t break 25% efficiency. I even added an external fan to see if added cooling could move the needle but it seemed to have no effect. I took a closer look at one of the compute nodes during a test run and this is when the real problem began to become apparent.
I wasn’t able to find the CPU temps in the usual places, and I couldn’t find the CPU clock speed either. I had read something about issues like this being related to kernel versions so I looked into what kernel the compute notes were running. As it turns out they are running the “mainline” kernel, and the kernel notes on the Armbian website state that cpu scaling isn’t supported in this kernel. This means the OS can’t slow-down the CPU if it gets too hot, so the clock is locked at a very low speed for safety’s sake. From what I can tell that speed is 408Mhz, around 1/3 of the speed I expected the CPU’s...
Thanks to a gracious gift of gadgetry for Fathers Day, I was able to spend last weekend increasing Mark II’s computing power to maximum capacity!
Not only did I have the parts, but I also had the time (again thanks to Jamie & Berty) and had very ambitious plans for pushing Mark II across the finish line.
Things didn’t go quite as planned (I spent one of the two days fighting O/S problems of all things) but I was able to get all 8 nodes online by the end of the weekend.
Since then I’ve been working on running high-performance Linpack (hpl) to see how Mark II’s performance stacks-up against Mark I. After a few runs, I was not only able match the best results I achieved with Mark I, but slightly exceed them.
The best hpl performance I could get out of Mark I was 9.403 Gflops. Using a similar hpl configuration (adjusted for reduced memory size per node), I was able to get 9.525 out of Mark II. On both machines this result was achieved with a 4×4 configuration (4 nodes, 16 cores) so even though it wasn’t the maximum capacity of either machine, it’s a pretty good apples-to-apples comparison between the two architectures.
That said, there’s more performance to be had. I’ve learned a lot about running hpl since I recored Mark I’s results and I’m confident that I can figure out what was stopping me from improving the numbers by adding nodes back then. In addition to better hpl tuning skills, I’m able to run all nodes of Mark II without throwing a circuit breaker, which was not the case with Mark I. This makes iterating on hpl configs a lot easier.
The theoretical maximum hpl performance (Rpeak) for Mark II is 76 Gflops. Achieving this is impossible due to memory constraints, transport overhead, etc. but I think 75% of Rpeak is not unreasonable, which would yield a measured peak performance (Rmax) of 57 Gflops. This would rank Mark II right around the middle of the Top 500 Supercomputer Sites… in 1999.
Not too shabby for a $500.00 machine you can hold in your hand.
Still, there’s a long ways to go from my current results of not-quite 10 Gflops to 57. I think theres a lot to improve in how I’m configuring hpl and I also think there is work to do at the O/S and hardware level. Minimally I’m going to need to increase the machine’s cooling capacity (right now I don’t even have heatsinks on the SOC’s so I know the CPU throttle is kicking in). So if I’m able to find an hpl config that doesn’t loose performance as I add nodes to the test, and I can keep the system cool enough to make sure cpu throttling doesn’t kick-in, I should be able to get much closer to that Rmax value.
But even with no improvement this test confirms the validity of the Mark II design. The original goal for this phase of the project was to establish the difference in performance between the Mark I machine and a similar cluster built from ARM single-board computers. The difference, surprisingly is that the ARM machine has a slight node-for-node edge over the “traditional” Mark I.
Perhaps just as important, Mark II achieves this with significantly less cost, physical size, thermal output and electricity consumption. Once I complete the power supply electronics for Mark II I’ll be able to get more precise measurements of power consumption but even if Mark II operates at it’s maximum power consumption that’s still 1/10 of the power consumed by Mark I.
This is a significant milestone in the RAIN project and marks the “official” end of the Mark II phase. I plan to finish Mark II’s implementation (wiring front-panel controls for all nodes, designing & fabricating the front-end interface board, etc.) and generate a “reference design” for others who would like to recreate my results (perhaps even offer a kit?), but with these results in hand I can confidently enter into the Mark III...
Looks like work on RAIN Mark II will be slowing-down a bit for a couple of reasons:
First, the snow has receded which means work that can only be done during the month we call “summer” takes priority over anything that can be done inside (plus there’s just lots of fun things to do outside…).
Second, it looks like I misunderstood how the 2018 Hackaday Prize works. I assumed that having your project in the top 20 positions of the leaderboard when a challenge concluded was what they meant by “The top twenty projects from each challenge will be awarded $1000 and will move on to the finals…”. Instead, they selected 20 projects using some other process, and RAIN Mark II didn’t make that cut.
This is a bummer, because I invested time and effort promoting the competition with the misguided idea that doing so could result in injecting some resources into the project (which could have dramatically accelerated its progress). I should have spent this time working on the project instead.
But it’s a good reminder to me of the pitfalls of competition, and that any project whose success relies on it is vulnerable to competition’s inherent inefficiencies. I’m glad to have had an opportunity to have these inclinations put in-check while the stakes are lower than they would be later on in the project.
The whole experience has caused me to reflect on the purpose of the project itself and I have a renewed focus as a result. The next step for Mark II will be to complete the assembly of an ARM-based cluster with the same node count as the Intel-based Mark I machine. Once this is complete, I can duplicate Mark I’s run of the hpl benchmark on Mark II and have an apples-to-apples comparison of the performance difference between the two architectures. This was the original purpose of building Mark II, and once this is known it will be possible to describe an ARM-based system with equivalent power to an Intel-based system and determine at what scale ARM outperforms Intel in terms of processing power vs. total system efficiency (cost, power consumption, cooling, physical space, etc.).
This will complete the work on the hardware side of Mark II. I can then move-on to both the software aspect of Mark II, as well as using the Mark II hardware as a platform for the development of Mark III hardware components.
I’ll need around $250.00 worth of hardware to get the system to a point where these tests can be run. I’m selling-off the Mark I hardware in an effort to cover it, but it’s indeterminate how long this will take and as such progress will stall until this is complete.
Thank-you to everyone who took the time to support the project on Hackaday.io.
Based on feedback, conversations and thoughts about the short and long-term goals of the RAIN project I’ve decided to make some changes to the naming convention of the series. I’m renaming RAIN Mark II from Personal Supercomputer to Supercomputer Trainer.
The term “trainer” refers to education-oriented machines and systems (typical of the 70’s and 80’s) which I think suits both the form and function of the current iteration of the Mark II machine well. While the overall goal of the RAIN project is to produce an open-source supercomputer, there is a lot of dispute as to exactly what defines a supercomputer and using the term in an unqualified way to refer to Mark II machines appears to be controversial.
So, instead of wasting time arguing semantics I think renaming the machine puts an end to that discussion while at the same time making the objectives of the machine more clear. This also helps me focus on the objective of this stage of the project and might help the machine connect with the most appropriate audience as well.
In addition to renaming the current project, I’m also going to begin using the term “Type” to refer to a specific incarnation of each machine in the series. For example, the Supercomputer Trainer will now be referred to as “Type 1”. I if another machine is designed as part of the Mark II series it will e refereed to as “Type 2”. This makes it more clear where each machine belongs in the overall evolution of the RAIN project (now that I’m producing custom hardware and electronics I need a simpler way to relate parts to the machines they belong to).
I’ve begun this renaming process which has resulted in some changes in the source repository. Updates to specific components (the front panel, for example) will be applied as new versions of the component designs are produced.
In the last log I left-off with a 2-D printed version of a printed circuit board I designed in KiCad which was enough to confirm that the board I was designing would fit into the board I’m designing it for. This felt like a lot of progress!
However there was still a lot to do, primarily consisting of making the actual electrical connections between the components on the board. For some dumb reason I thought this would be easy, but I ran into problems getting the traces to connect, or prevent them from overlapping, etc.
Part of the problem is that I have no real experience in the more abstract (i.e., not related to learning the software) aspects of PCB design. Luckily Bob was willing to help me out and gave me a list of steps to use as a starting point when laying out a board:
Lay out the connectors where they have to go
Logically place groups of components where they need to go (power section should be in one area, microcontroller + decoupling caps in another area, etc.)
Refine component placement to have as few overlapping airwires as possible to ease routing
Route length sensitive traces first
Route other traces
Route power traces, but do a ground pour to make it easier
Lay out silkscreen and be as descriptive as possible
In addition to this, Bob said traces could be 6 mil wide minimum, but to aim for 10 mil in general and 12 mil for power.
This advice helped a lot and between this an finding a “mode” for laying traces that worked for me, I was able to connect all the parts and get the board to pass the Design Rules Check (DRC). Bob eyeballed my layout, gave me some tips on improving it and said it looked like it would work.
It was around this time that I noticed an error in the pin assignment where the panel driver board connects to the headers on the Clusterboard. Since the driver board will connect using a right-angle connector, but I designed it with a straight connector in mind, pin 1 would end up in the wrong place. This was easily fixed by rotating the connector on the board, but since this connector has traces leading to almost every other component, the layout became a complete mess. Instead of fighting with this, I took the opportunity to redo the board from scratch and apply everything I’ve learned so far.
The second time around went much faster and I think it looks nicer as well.
After reviewing the design (for what felt like the millionth time), I was reasonably sure I didn’t make another mistake like the connector and decided it was ready to upload to OSH Park for production.
This process went very smooth. I don’t have any personal experience to compare it to (being the first board I’ve designed to be produced this way) but the website was easy to use, the visualizations and “checklist” guides were very helpful and I felt like I had a clear idea of what the final product would look like when they were done.
Now it’s just a matter of waiting.
The boards showed up a few days early and I was lucky enough to have time to try one out (the minimum order was 3 boards). I had only enough parts on-hand to complete one, but this was intentional because I assumed that I would have made some mistakes and that I could order more parts after I fixed the design and re-ordered a second batch of boards.
After assembly, I tried to quell my excitement and properly check-out the board in steps (inspired by Bob’s article) to reduce the chances of burning-up the new panel driver board or worse, the precious Clusterboard.
First, I tested the driver board alone to make sure power was flowing to the right places. Then, I attached it to the Clusterboard and checked the i2c bus to see if it was showing-up correctly. Finally. I wired-up the LED’s and toggle switch and ran the python script.
Much to my surprise, the LED’s lit up and the switch correctly toggled between display modes!
I’ve been putting-off learning to use an EDA for quite awhile. You see, I am old, and I learned how to read and draw schematics when I was young, so I learned how to do it on paper, and it’s very hard to give-up doing something you can do fast and well for something that is slow, frustrating and you suck at.
In the previous RAIN PSC design, I could get away with using the plentiful GPIO pins available through the PINE A64‘s 40-pin connector to drive the front-panel LED’s and switches. However, adopting the Clusterboard demanded a different approach and I settled on developing an i2c-based interface. This has worked well on the breadboard, but it requires building an interface board for each node and fabricating something like that out of perfboard (or some other “homebrew” technique) repeatedly is not very practical.
Regardless, I thought I could at least cobble-together a prototype of the driver board for a single node so I could at least pack everything inside the case for awhile. A couple of weeks have passed as I’ve tried various options and I’ve come to accept that designing a custom board is the only way to go, and the best way to go about that is getting my design into an EDA.
I’ve put this off for a long time because I’ve made several attempts at learning several EDA’s and it’s always been frustrating. There’s a lot of up-side to learning to use these tools, but paper and pencil is so fast and so natural for me that it’s really had to give that up. Nonetheless, if I’m going to get boards made from my design the options are to either learn this or have someone else do it, and doing yourself makes you smarter so that’s the way I decided to go.
I could have learned a number of different packages but I choose KiCad because it’s open-source and one of the goals for RAIN is to be a completely open-source computer. There’s a number of other open-source EDA’s as well, but I got a lot of recommendations to use KiCad and even though it’s considered more difficult to learn, I’ve been told it’s worth it. I also knew that KiCad ran fairly well on my A64-based laptop, and this would allow me to design RAIN’s hardware on the PSC itself.
I started with Getting Started in KiCad (seems like a logical place to start, right?) and slowly made my way through the tutorial, stopping whenever I became the least bit tired or frustrated. I’ve found this to be a good way to learn something I’m not looking forward to learning, because these forced stops cultivate some excitement and curiosity about returning to the task.
I was able to maintain this discipline over the course of a weekend and while I wasn’t able to finish the design, I made a lot of progress and learned a lot more than I expected about the tool. Based on what I’ve learned, I feel pretty confident that I will be able to design this board successfully, and continue to use an EDA for all of my future electronics projects.
There’s still work to do before I finalize the design and send it out to have a prototype board made, but what remains is squarely within my comfort zone.
I need to determine whether or not the i2c pins on the Clusterboard need to be pulled-up to 3.3v like they do on the A64 (which would be a drag because the Clusterboard’s pins don’t supply 3.3v) and I need to sort-out some software problems on the SOPINE module just to confirm that the driver circuit will work the same as it did when I had it connected to the A64 for testing earlier. Once these two things are sorted I can finalize the design of the board and order a copy.
With any luck it will work and I’ll be able to pack everything back in the case and focus on the software side of things until I scrape-up enough cash to order more panel drivers...
Just a quick note to say that it will probably be a week or so before I post another update on the project. I’m currently on a road trip across the western U.S. and won’t be back in the lab until around 04/01.
When I do get back, I’ll probably work on moving the panel driver from the breadboard to something that can be installed in the chassis.
Also I see some of you have left comments and even expressed interest in joining the project. I will get back to all of you as soon as I’m back from the trip.
Technically, the Clusterboard fits inside the case I’ve been designing around, but it doesn’t fit inside the “endcaps” so it can’t be mounted directly to the steel of the case using the mounting holes on the board.
To address this, I sketched-up some adapters to “relocate” the mount points somewhere more appropriate. Since I’m still not 100% sure where everything will belong in the final configuration of the chassis, I came up with a more flexible way to mount the board: magnets!
I also need to mount a single PINE A64 board to serve as the “front-end node” so I whipped-up a couple of magnetic mounts for this board as well.
I wasn’t able to find appropriate magnets locally so I had to wait for some to arrive from The Internet. In the meantime I switched-gears and worked on writing a little software to drive the panel’s display.
When the magnets arrived I was stunned to see they fit perfectly on the first try. However I didn’t have any glue on-hand that was right for the job. Since I was tired of waiting I thought about how I might modify the mounts to eliminate the need for glue. This turned-out to be easier than expected and after two iterations I had working, glueless mounting brackets.
All-in-all they work pretty well. There is some alignment problem keeping all four feet on the Clusterboard from engaging the inside of the case completely, but I think this will be strong enough to safely move on to the next step: stuffing everything inside the box.
The idea of modifying a part just because you don’t want to run out and buy some glue would seem ridiculous before I had a 3d printer but now it’s easier and faster to just “run-off a new part”. The result is not only faster, but it’s also a better part. This is one of the things I love about 3d printing, the ability to iterate at a pace similar to writing software and letting the robots do the work.
It's very exciting to see other people interested in this project. Thank-you for your feedback and encouragement!
In addition to releasing my work as open-source (so others can reproduce the machine), I'm considering developing kits (and perhaps a small number of assembled systems) once I have a design that is stable, reliable and repeatable.
I haven't gone too far down that road yet because there's still a lot to do, and I wasn't sure how many other people might be interested in owning a machine like this, but if there's interest (and I can wrangle the resources) I'll seriously consider it.
Starting a computer company is something I've dreamed about since I was a kid banging-out BASIC on my VIC-20. It would be kind of poetic if in doing so, I could help put other kids on the same path.
The front panel of RAIN-PSC serves three essential purposes:
Show the status of each node
Show the load on each node
Look really cool
The panel is actually eight individual control panels (one for each node in the cluster). Each panel consists of five LEDs and a toggle switch. The switch selects between two display modes: status and load.
When the switch is in the status position, each LED indicates the following:
boot (on when the os has booted successfully)
network (on when the node has successfully connected to the network)
temp (on when the node temperature is too high to run at full-speed)
user 1 & user 2 (used to indicate custom status selected by the programmer)
When the switch is in the load position, the LEDs behave as a bar-graph displaying the unix load of the node.
To provide this display the software needs to be able to:
Poll the status of each monitored subsystem (os, network, etc.)
Read the system load
Read the toggle switch position
Turn the LEDs on and off
When I started putting the electronics together, I used some command-line tools to interact with the LEDs and switches (I think it’s possible to write do all of the above in a shell script). In the long-run, I’ll probably write this in something faster/more efficient (Rust?) but for now, I’m going to use Python to get the hang of talking to the new hardware.
I’m installing a few things on top of the base Armbian to make this happen:
The script can be broken-down into three primary components:
Functions to gather system information
Functions to read and update the front panel components
A loop to periodically update the display
Gathering system information using Python is a pretty well-worn path, so I won’t discuss that in detail here.
Reading the position of the toggle switch and turning the LED’s on and off is done using the smbus Python package. This package interacts with the bus in much the same way as the command-line i2c tools.
The hardest part of this for me is coming up with the best way to translate between the binary representation (the pins themselves), the boolean/decimal values I’m used to working with and the hexadecimal values that glue the two together. What I settled on was using hexadecimal internally to the functions which generate the display (display_status(), display_load()) and boolean/decimal values everywhere else. At some point I’ll abstract all this away into a library or a module, but since I don’t plan on using Python for this long-term I’ll probably hold-off on that for now.
Finally, the main loop simply loops forever, calling toggle_on() to determine the position of the status/load switch and then calling display_load() or display_status() accordingly. Once the display is updated the loop sleep()s for one second and then starts over. In its final form this will need to update the panel much faster than once-per-second (potentially leveraging interrupts as well), but for this version this is probably fast enough.