Required processing power:
My first concern was how long the FFT calculation was going to take. I knew that this algorithm requires floating point calculations, and also some Sine, Cosine and Square root calls. So whatever processor I chose to use would definitely need a hardware Floating Point Unit (FPU).
My initial choice was the ESP32, since I was familiar with this MCU from other projects, and it has an FPU. The ESP32 also supports both I2S and PDM inputs (for the microphone) and the FASTLED library. I was pleasantly surprised to also find an FFT library that I could use with the ESP32.
I’ve used both the Arduino IDE and ESP-IDF programming interfaces on other projets, but since this was likely going to be an open source project, I chose to use the Arduino IDE to make it a bit more “mainstream”. I found some projects using an FFT to process audio input data so I started there.
My first test was to determine how long an FFT took to run. I stared with a 1024 sample input buffer, and discovered that it took about 37 mS to run one conversion cycle. I doubled the size of the input buffer to 2048 samples, and now it clocked in at 74 mS. So I was pleased to see that the relationship between input buffer size and processing time was linear, and not exponential. Since the FFT input buffer has to be a power of 2 in size, to get more samples you need to at least double the buffer size. I was a bit concerned that there may not be “enough” processing power as I was already eating into my 50 mS latency goal.
Bins and Bands.
Next I started looking at how the results of the FFT (frequency Bins) get turned into an LED display (frequency Bands). The first thing I discovered here was that there is NOT a 1:1 ratio between Bins and Bands. It turns out that humans sense audio frequencies logarithmically. For example, in the Do, Ra, Me, Fa, Sol, La, Te, Do scale, the second Do, has twice the frequency of the first Do, and this is frequency doubling is repeated for each Octave. So after 8 octaves, the frequency is 256 times what you started out with. But with an FFT, the frequency Bins that are generated have equal frequency spacing (they increase linearly, not exponentially). To map Bins into Bands you need to figure out which bins fit into each Band. I found a great spreadsheet for doing this. I utilized a tweaked version of this spreadsheet extensively in my progression to more and more Bands.
Rules of thumb.
As I slowly increased the number of Bands in my Visual Ear, I developed some basic rules of thumb which provided the best allocation of Bins to Bands. Remember: Bins are what get generated by the FFT and Bands are what get displayed using LEDs.
- More Bins give you better Band clarity. When allocating Bins into Bands, if you don’t have enough Bins, then at the lower Band frequencies you find that there is LESS than one Bin per Band. This really doesn’t work well, because it means that several Bands will look exactly the same. At a base minimum you need at least one Bin per Band.
- Don’t sample your audio too fast or too slow. For a given audio sample rate, the resulting Bins will span from DC to ½ the sampling frequency. So you want to choose a sample rate that is just higher than 2 times your highest Band frequency. For example, if your top band is for 16 kHz, then you should sample just a bit faster than 32 kHz. If you choose a rate that is too high, there will be a bunch of unused Bins at the top of the frequency spectrum. If you choose a rate that is too low, then there won’t be any Bins that go high enough to be included in you upper bands.
- Don’t start your bottom Band too low. The frequency of your first band will define how close your first few Bands are to each other. This effects how close together your first few Bins will need to be to get good band clarity. If your Bins need to be very narrow (frequency wise) it means you will need so many more of them to reach the higher frequencies. Currently my bottom Band is set at 55 Hz. I chose this value since it is the second “A” key on the piano, which a reasonable lower limit for human hearing. If I want 7 LEDs per octave (like Piano Keys) then the second Band will be just 5.7 Hz away from the first. Choosing 55 Hz also means that the frequency of each successive octave is easily calculated (in my head), for the purpose of testing (eg: 110, 220, 440, 880 etc.). I may go lower in the future if I advance the capabilities of the microphone input and processing power.