Close

That Was Fast!

A project log for CAT Board

The CAT Board is part of a Raspberry Pi-based hand-held FPGA programming system.

Dave VandenboutDave Vandenbout 09/01/2016 at 16:383 Comments

I've been away from this project for a few months (OK, four months) building things like a new tool for designing electronics. One of the things I haven't discussed here is the time it takes to download a bitstream to the FPGA on the CAT Board.

As shown in previous logs, the FPGA is configured through one of the hardware SPI ports of the RPi. I've never considered SPI a very fast way of transferring data, so I initially set the port bit rate at 1 Mbps. That was good enough to get the FPGA going within a couple of seconds and there was no reason to push it and possibly cause errors while I debugged the board.

But once the board was working reliably, I revisited the SPI bit-rate setting. I figured there was no harm in upping it to 5 Mbps just to see what happens. I went into the litterbox.py script and changed it to:

self.spi.speed = 5000000
Then I ran the command to load the FPGA with the bitstream for the LED blinker:
sudo litterbox -c blinky.bin

The download to the FPGA completed more quickly than before and the LED started blinking. Success!

Then I started pushing for more: 10 Mbps, 20 Mbps, 50 Mbps, no problem; 100 Mbps, 150 Mbps, still five-by-five; 200 Mbps, complete and utter failure.

OK, I hadn't expected to get even close to 200 Mbps. With a little trial and error, I finally found the maximum speed I could use was 199,999,999 bps. The reason for that becomes clear later.

Now, was I actually transferring bits at 200 Mbps, or was the software making a promise that the hardware couldn't keep? To test that, I wrote some code to time the transmission of a 10 MByte payload and compute the effective bit-rate while I also observed the maximum SPI clock frequency and duty cycle with an oscilloscope:

spi.speed (Mbps)Actual Speed (Mbps)Fmax (MHz)Duty Cycle (%)
41.82.5---
103.36.2554.0
258.82536.1
5010.75021.7
6711.36717.9
10012.010012.1
15012.010012.1
20012.82006.4

As can be seen, the actual transmission speeds are quite a bit lower than the speed setting. The reason for that is the overhead in the python-spi module that copies and converts the individual 4096-byte packets of the payload before sending them to the SPI driver. Even though each packet gets transmitted at a high clock speed, there's a significant "dead time" (2.3 ms) while the software readies the next packet. As the raw speed increases, the packet transmission time decreases and the dead time (which stays constant) consumes a larger percentage of the time to send the full payload. That's why the duty cycle decreases as the speed setting increases.

To decrease the overhead, I modified the python-spi code as follows:

After these two changes, setting spi.speed to 100 Mbps resulted in an actual transmission speed of 65 Mbps (an increase of 540%).

There's no reason to set the spi.speed to a value greater than 100 Mbps. The table indicates the RPi is generating the SPI clock by dividing a master 200 MHz clock by an integer. Any setting between 100 and 199 Mbps will result in an SPI clock of 100 MHz, and going to 200 Mbps has already proven too fast for sending an FPGA configuration bitstream. (The iCE40HX datasheet also shows the SPI clock in slave mode should not exceed 25 MHz, so getting to 100 MHz is really pushing it already.)

A transfer rate of 65 Mbps opens up some interesting possibilities. That means there is an 8 MByte/second channel between the CAT Board FPGA and the RPi that uses only a few pins of the GPIO connector. I have some Xilinx-centric VHDL modules and a Python library that provide a printf-like debug interface for FPGA designs through the JTAG port. I can modify these to use the SPI port so the CAT Board + RPi will have the same capabilities. I'll be working on that next. I think. Maybe.

Discussions

Dave Vandenbout wrote 09/07/2016 at 19:48 point

Reading the hardware spec is good, but the BCM2835 doc has statements like "CDIV must be a power of two" when what they really mean is that CDIV must be even. And there's nothing about what the core clock frequency is because that's system dependent for RPi A, B, B+, 2 and 3 systems. And then there are numerous blogs by people who have tried SPI with various settings on various RPi versions with varying levels of success and conflicting ideas about what's best. Then add the software layer for the SPI driver in whatever OS is being used plus the Python module that interfaces to that, each with its own restrictions and calculations on the SPI settings. The hardware spec wouldn't tell me about the 4096-byte limit on SPI packet size imposed by the driver, or the overhead imposed by the unneeded copying of data in the Python module.

In the end, it's sometimes simpler to get a definitive answer by actually looking at it.

  Are you sure? yes | no

Yann Guidon / YGDES wrote 09/07/2016 at 20:05 point

I attacked the problem directly with root-level C programming to access the SPI FIFO directly and provide true real-time responsiveness :-) You have like 8 or 16 bytes in the FIFO, a few status and control bytes, and you can compute things while you send them to the FIFO for overlapped function...

#SPI Flasher uses some of my C routines but are stuck to early B models :-/

  Are you sure? yes | no

Yann Guidon / YGDES wrote 09/07/2016 at 15:27 point

Hi Dave !

You should have looked at the RPi's hardware spec to understand what you "discovered" :-)

Maximum theoretical speed is 125MHz, IIRC, for the RPi v1 but I found that reliable transmission was at 125/2=62.5MHz (that's what I used for #Rosace). Explanation : the IO block is clocked at 250MHz (unless you configure it differently but then you can crash the system). It takes 2 clock cycles to generate one clock pulse (one cycle high, one cycle low). And the system was never meant to be driven so fast (no differential pairs, bad routing of the IO pad...)

If you hit the hardware directly, you can get impressive bandwidths but you reach new limits, not just HW, because the I/O block works at a different frequency than the CPU core. The datapath is only 16 bits and you get turnaround delays all over your code. It takes a while to get this "right" but I've used this for a few years now :-)

Now it also depends on the version of you Pi board. The latest chips have changed a few critical details...

  Are you sure? yes | no