Close

Core FFT Performance : 4096 Radix4 CFFT

A project log for The Human Connection : 1st Impression

What do the oceans under the ice of Europa sound like? Can we communicate with whatever is down there?

ehughesehughes 07/09/2014 at 14:120 Comments

A core part of the algorithms we will use is a complex input FFT (a+jb).    Before going to far I wanted to evaluate the FFT performance of the LPC4370 M4 core.       Now,  an FPGA would rule the roost with FPGA processing horsepower  BUT I am trying to keep this as low cost as possible.   The 4370 on the LPC-Link2 is a place to start.   FPGAs are great once you have everything worked out but HDL can be unforgiving.... (and are high cost!)

So,  here are is some assumptions:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

LPC4370 -  Code running on the M4 core.  Clock rate at 204Mhz.  Exectution from RAMLoc128 (0x10000000 - 0x10020000)

ARM CMSIS DSP libraries V 4.0.1.   In particular I am looking at the function arm_cfft_radix4_q15

I am using fixed point processing.

Input data is a 4096 q15_t array in RAM.   (Note all processing is done in place... source data must be in RAM)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Now,  I am targeting a 200Khz system sample rate with 4096 block size.  (This matches the max radix4 block size allowed by CMSIS DSP).   This means I have a window of 20.48mS to get all my processing done.   In the background,  new ADC data will be DMA's into a buffer and data will be DMA's from an output buffer to a DAC

So.... drum roll.   The algorithm arm_cfft_radix4_q15  takes 2.4mS.    So, I have roughly a fact of 10 margin.   Now, this will quickly get eaten up.  I have to do a minimum 2 FFTs (forward and reverse transform),  the magically scaling algorithms.   Either way, this gives me a good amount of overhead.    I always have 2 other cores ready to go :-)

I also profiled arm_cfft_radix2_q15.   It is a bit slower at 2.9mSec.

Code is in the hc-1 Github repository.

Last notes:

The board support library sometimes crashes in Board_SystemInit() at bootup when running from RAM.  I think a delay is need when setting up clock dividers or the crystal.  If I single step through the code,  it works...   Also,   using the internal osc and PLLing up to 204MHz is fine.

These numbers would certainly get awful if running from SPIFI Flash.  (LPC-4370 is ROM-Less.   You have to bootload from SPIFI flash into RAM or execute from SPIFI...)   Maybe I can do that some other day

Discussions