By whygee on Thursday 7 August 2008, 08:06 - VHDL
For a while, I was happy with the idea that YASEP would work with a standard SDRAM chip running at 133 MHz. So the core would run at 66MHz and a 16-bit datapath would provide 32 bits in 2 cycles. Good fit. The first synthesis attempts for the ADD/SUB unit (with a standard grade A3P250) gave something like 60 to 70MHz with a plain dumb 33-bit add/sub function (32 bits of result and one carry output). But I was not satisfied.
I have recently found Synchronous SRAM chips that run at 100 and 200MHz, with 18-bit and 36-bit datapaths, in capacities from 128KB to 2MB. That made me think a lot : 100MHz would be better than 66MHz or 50, obviously. However, achieving 50% of speed increase is FAR from easy. I have been busy on this matter since the end of june. More than one month of dumb, repetitive, error-prone work !
First, I needed an Add/Sub unit that I could control completely. I have not found anything near that, and the "default" add/sub created by Synplify was always faster by a significant margin. So... I have analysed this add/sub unit, gate by gate. More than 300 gates were transcribed by hand from the schematic output of the Actel software !
Another constraint is that the adder MUST be portable and easily modified (*sigh*). So using the VHDL output of Synplify was not possible because it uses Actel-specific mapped instances. I decided that "plain text" was better, so that 1) I could modify the netlist more easily 2) another FPGA with a proper synthesizer would not be stuck to 3-input gates (in a world where most FPGA use 4-input LUTs). As a result, after about one month of efforts, I finally got a big file full of "NetA <= NetB xor (NetC or NetD)"-like lines.
Well, nobody is perfect and in the 300+ gates written by hand, more than 10 errors were found. The first ones were spotted by a full re-check of the whole schematic. Painful once again and some naugthy errors where still here. I finally found a method to locate the probable location of an error, using the synthesizer as a "formal verification tool" (comparison against a working add/sub) or as an "oracle", and clever "bit tickling" techniques similar to what a cryptographer would do with an unknown "black box". After about a week, I finally got my dear, long awaited optimised add/sub netlist with a depth of 9 logic layers.
But it was not over ! A lot of cleanup and preparations were necessary, in order to prepare the next step. A lot of logic simplifications were found, possibly breaking some clever synthesis techniques, but also curing some of Synplicity's weaknesses. "Bubble-pushing" allowed a clear (mental) view of the netlist and some critical datapaths were solved through gate duplications and other adaptations. The maximum logic depth was reduced to 8 layers and the propagation delays were homogenized. After adding a first pipeline gate (just after the first logic layer), P&R said that I could run the add/sub unit at around 77MHz.
Now, that's promising but unsufficient. Reaching 100MHz requires a pipeline gate after the 6th logic layer. So there are 2 layers of gates after the pipeline barrer, and some room for muxes and Setup&Hold for writing the result to the registers. After some more editing efforts, Synplicity announced an estimated 101MHz , and P&R said 106MHz ! I had set the bar at 110MHz but 6MHz is enough margin for me. I mean : I can safely run at 100MHz.
Conclusion : YASEP can be "superpipelined" when needed (the added pipeline gates take room and draw power, which is not always necessary) and using a decent SiO2 process will give higher frequencies ! (the A3P250 is in 130nm and even at 0,35u, a pure ASIC will be faster). A 5-layers deep pipeline stage with 3-input gates and a mean fanout of 3 is not extraordinary today, but it's a challenging (the Cray3 had 4 logic layers). Yet, I'm still confident that the pipeline depth will not explode, 4 stages is still possible. What bothers me is that clock gating was messed up by Synplicity, so the power draw is going to be a concern.
Another important thing is that the YASEP architecture is not changed. I realised that I could safely bite in the margin of the last pipeline stage ("Write back to the register") without affecting the execution sequence. A bypass network could be possible but is not necessary (too much control logic would be needed).
Anyway, with a 100MHz rating, one can use 12, 24, 24.576, 25 or 48MHz quartzs/oscillators to feed the PLL, and run from 96 to 100MHz internally. With a 4-stage pipeline, 4 threads can execute simultaneously (at 24-25MHz each) providing a peak 100MOPS performance. Imagine what this would yield on the latest 40nm FPGAs or ASICs !
However, memory bandwidth and latencies are going to be the main bottlenecks again. But I think that I have found a solution...
In the end, this endeavour was doomed... But still very informative.
Today I understand the structures of adders better so I can design one that I can pipeline at the desired levels.
In 2008 I had to reverse-engineer the adder's netlist, after full optimisation and bubble-pushing... it was really dumb but I only had this method at hand. Later experiments on other FPGA families showed that the netlist was way too specific to be portable, the synthesisers couldn't figure that this was an adder, but this helped me learn deeper the structure of the ProASIC3 and its tools.
Today I start from GHDL in VHDL and I try to avoid being too specialised, while at the same time I still work on #VHDL library for gate-level verification
So I start from the "top" and go "bottom" with a much better understanding and now I target ASIC and not just the obsolete A3P family.
Later, I have chosen a lower frequency : 3.6864MHz for many reasons... See a later log :-)
Concerning memory bandwidth : this is still a concern but it's too platform-dependent...