09/05/2017 at 18:17 •
I did some quicky tests on a Raspberry Pi v1 Model B that I had lying around. With a modern kernel build with FPU support, it should work.
Despondent over the recent findings with the Netduino Plus 2 board (that I would likely have to do major reimplementation in libvorbis to get it to work with memory constraints -- a daunting task), I awoke to remember that I have another board on-hand (coincidentally about the same age) that has never been given any love: a Raspberry Pi v1 Model B.
My initial thoughts were that perhaps the raw speed increase (700 MHz) might help, and certainly the large memory (512 MB) would. I was pleasantly surprised to find out that the CPU used has the FPU option -- news to me. It turns out that the original kernel builds were not built with the FPU support, though, out of some fear of compatibility, but later an 'armhf' build was created that did.
I found a kernel image for it here:
which is a minimal (118 MB) image that is functional on even a 1 GB SD card. For development I will use a bigger one, and I'll have to install all the build tools, etc.
For a quick initial test, though, I installed 'vorbis tools', which has a pre-built encoder and decoder from the command line. I compressed a sample file, and it took 2m 11.4s to compress a 5m 22.0s, which is about half the duration, so I naively interpret that to mean that I will see about 50% CPU load for the compression process, leaving the other 50% for everything else. I can work with that!
Out of curiosity, I ran the same test on another ARM device I have -- a Sheeva Plug -- which runs at 1.2 GHz, but does not have an FPU. It took 7m 12.0s to compress the 5m 22.0s (even though the processor is 70% faster), which would not be able to keep up realtime, so the FPU is definitely needed. Normalizing for clock speed, this means the FPU gives about 5x-6x performance improvement to this particular codebase.
OK, I still feel like I'm cheating a bit using the R-Pi, but I forgive myself because I still will have plenty of work to do on implementing the radio (and making sure it works), interfacing to the R-Pi over I2S (audio) and I2C (control), and also the server software. Anyway, I consider this an initial prototype. If it turns out to be cool, and other folks are interested, then maybe I'll design a bespoke CPU board.
So, now that I have a notionally suitable board, I need to research how to use the I2S and I2C drivers on the R-Pi. Then I need to make a radio board. Oh, those parts all came in. They are so tiny! Yikes! Oh, well. The future is a tiny place....
Next:In no particular order:
- research I2S master input capability on the R-Pi. This is used for receiving the 32 ksps stereo audio stream from the radio.
- research I2C capability on the R-Pi. This is used for all control functions of the radio.
- design, fabricate, and build a circuit for the radio. Hope it works, it's an expensive part.
09/04/2017 at 15:08 •
I did an initial analysis of memory usage in libVorbis, and it isn't pretty: upwards of half-a-megabyte.
I set down to do some RAM usage analysis. At first I was going to use Visual Studio's profiler, which I've used for many many years, but it seems they may have removed it, and replaced it with some other thing which I think is useless. Oh well, I think I have a VM with 2010 in it somewhere, I'll look for that later. This is even more disconcerting to me, because I like to do instrumented profiling so that I can see what's really going on and focus my efforts on measurable improvements.
Moving on, I saw that the codebase actually has an ad-hoc allocation profiler module, so I got that working (it won't compile on windows as-is without several mods). The data output was quite an unreadable mess, but I'm sure it was interesting to someone at some time in the past. On the plus side, it generates logs both globally and per-module. I modded the record emission logic to output what I thought was more usable data, and emitted a run.
The results are not promising. As it is, with my test clip, it uses upwards of half-a-meg of RAM! That's just the total of requested memory -- not overhead or consideration for fragmentation, etc. This is a little bit more that the 128k+64k on the chip. So I'm going to need to do a bunch of analysis to see what, if anything, can be done.
Another little treat I found was some memory allocation that happens outside of what is redirected through this debugging module. There's a bunch of memory allocated via alloca(). If you're not familiar with alloca(), it stands for 'allocate automatic', and that means 'off the stack'. alloca() is technically non-portable but virtually everything environment has an implementation. It's non-portable because it allocates memory such that it is automatically free when the function returns, which almost always means 'on the stack'. The upside is that C has long has something vaguely akin to the C++ idiom of ResourceAcquisitionIsInitialization (RAII) which in this case particularly means obviating the need to explicitly free(). Another advantage is that the allocation is 'cheap', because it typically just involves modifying the base pointer register, rather than fiddling with heap allocation structures. A downside for embedded is that stack is often quite limited (hundreds of bytes), so all this code needs to be analyzed.
Now I need to do a bunch of work to see if this runtime memory can be reduced. This could be a while.
08/27/2017 at 19:00 •
I spent much of two days performing surgery on libvorbis to optionally operate in a single-precision-only fashion.
With dummy calls to force linkage of that code into the project, the result is about 640 kiB, which fits.
I needed to verifiably expunge all double precision math from the vorbis library to make it have any hope of running fast enough, since the STM32F405 chip does have an FPU, but single-precision only -- double would still be done in software.
This was an arduous exercise, touching nearly every module, but it was helped by a compiler option -Wdouble-promotion which would point out all the hidden places that the compiler was upconverting floats to doubles in an otherwise unseen manner. There is also another option -fsingle-precision-constant that will cause things like 0.1 to be float instead of double (against the standard, but useful since the standard does not have an explicit suffix for double). This is more of a safety net feature -- I prefer to be explicit, so I left that option out. I might put it in at the end of the project as a catchall for booboos that might creep in. Maybe.
After about a day, I had modded the code. I went back-and forth between Visual Studio and gcc since those compilers have different notions about what should be warned -- both useful -- and I wanted both to give issue a clean bill of health. There were many, many instances in the code base where folks were a bit sloppy about precision, and some other coding issues (like out-scoping named variables), but this codebase is more of a reference implementation than production code. Anyway, with the speed of desktop machines, and the ubiquity of a full FPU, these things don't get noticed practically. But I am not running this on a desktop machine, so it definitely gets noticed in this case.
Ultimately, I ran it, and it still compressed audio file. The size was very slightly different, but I'm sure that is a consequence of the loss of precision. Subjectively it still sounded right. Well, that's not entirely true -- I did notices some artifacts now and then that sounded like a little tone burst here and there. I originally thought it was in my source material, but I did a before-and-after and it was due to the code changes. It was only in one of my few test files, so at least I had the opportunity to find it now. It was very slight, but it annoyed me that it was there, so I had some hunting to do.
Not really knowing the code base, or the DSP behind it, but knowing that it was something I introduced, I decided to try doing a binary search on the changes to narrow down which change(s) caused the artifact. This is easier said than done, because there are interdependencies in the changes, so you can't strictly do a binary search, but you get the idea of swapping bulk changes in-and-out and narrowing the size of those changes.
I decided to modify the code yet again, this time using a macro that would switch between the single and double precision implementation. I wish I had thought of that at the outset, rather than wildly changing constants, variable declarations, and function names, but I did at least have the sense to mark every single line I had changed with an XXX comment. Indeed, my XXX comment had the original line, and I added the changed line, so it was a matter of grepping the source to make a to-do list, and then retool the code again. This took a long time, but I used that effort to do the search at the same time. I consider myself very lucky that the change that caused the audible artifacts was located in a focused area: there is some code that computes "Frequency to octave. We arbitrarily declare 63.5 Hz to be octave 0.0". It consists of two macros: toOC and fromOC that, if changed to single precision, will cause the artifacts to appear. *sigh*. Well, that will take some examination to understand, so I am putting that off for now.
In the end, I can now provably do a build which uses single-precision only (save the one spot I am deferring) and the result is about 640 kiB total for the flash image. This is about 5/8 of the total flash available, so I think I should have plenty room left for other stuff I need. Now I have to analyze memory usage. If the runtime RAM usage is too high, then it's a bust irrespective of FPU performance. This chip only has 128+64 kiB RAM (split into two regions -- there is a 64 kiB 'core-coupled RAM' which is special in that the peripherals can't touch it for DMA and whatnot. I don't why this is done from a chip designer's perspective, but it's an inconvenience to the programmer, so I'll have to see how I can effectively make use of it. Stack and OS-related things could use it, I suppose.
I must analyze runtime memory usage to see if it is acceptable. I suspect out-of-the-box it will be a disaster, since desktop machines have a comparative glut of memory, but I need to see how bad it is, and if it's salvageable.
08/26/2017 at 17:02 •
Today I managed to get the basic build system working for the STM32F405 board that I am going (try to) use for the first prototype.
I reviewed some audio compression technologies, and am either going to do MP3 or Vorbis. I found a fixed-point implementation of MP3 (a library named 'shine'), but I am going to try my hand at Vorbis first, since it's legally unencumbered. Another candidate at some point might be AAC.
My first step in any project is to bring up the tool chain. Since I am currently going to try to use an STM32F405 (in this particular case, re-purposing an old Netduino Plus 2 board), I already have a build system with the 'System Workbench' and STM32CubeMX tools installed (in turn, gcc, libnano, gdb, and openocd). I had a bit of trouble for a while until it occurred to me that my debugging pod is an 'SL-Link v2' (and specifically NOT a v2.1). The salient difference has to do with reset, and apparently the v2 does not do hardware reset, so you need to select 'software reset'. Failing to do this causes all sorts of complaints about the board not being halted, and whatnot. For the curious, the 'software reset' is a command send over the SWD connector, whereas 'hardware reset' is -- you guessed it -- pulling the /RESET line low. You only need 'hardware reset' in the cases where the firmware changes the SWD pins to some other IO function, which is not applicable in this case because those lines are simply brought out to the debugging connector.
Having successfully made a 'blinky' and being able to debug and step it was time to move one. Actually, there was one little detour I should mention for posterity -- along the way I found out about a thing called 'semi hosting'. SWD has an optional trace feature that can be used to send log-like information to your debugging terminal. As you might imagine, things like 'sprintf' to format such data can take a toll on a firmware image, so someone came up with 'semi hosting' which sends the parameter data, and relies upon the host to do the formatting. Nifty! The way it works is you link to a library and make a special call, and then you 'printf' gets redirected to the semi hosting host. I went through all these motions, but all I got was hard faults as soon as I tried to do a 'printf', and I didn't see the source to the 'librdimon' so I decided to punt on that. I've got interactive stepping, and that's enough for now. The topic is interesting, though, so I'll come back to it at a later time.
Further things I need to do for this board are support the SD card and the Ethernet socket. This means writing some board-specific drivers for the FatFS and lwip libraries, but this should be straightforward. More pressing was seeing if I can even get the libvorbis in the flash itself. If I can't get that working, who cares about Ethernet and SD.
The libvorbis code is a reference implementation. First, I tried compiling it in Visual Studio Community 2017. This was more-or-less straightforward (some project file tweaks, since there wasn't a configuration for 2017, and for include paths, etc.). The sample encoder was also trivially modified since it had a naive WAV reader that further assumed the sampling rate of 44.1 ksps, and I'm probably going to be using 32 ksps (the radio chip can output at that sample rate, and it's the lowest it supports.) I found a song that is in FLAC format for a source, decompressed it to WAV, and converted the sample rate to 32 ksps. I compressed it with the sample encoder and it sounds fine. I'm still very concerned about processing speed and also RAM footprint. In optimised release build, it took about 11 sec to compress a 5:22 song, but that's on my 2.8 GHz desktop computer. How will the little 168 MHz STM32F processor fare? (*gulp*) Also, this reference code uses double all over the place, and the STM32F only has a single-precision FPU. There is going to have to be plenty of surgery for that. And heaven only knows about RAM usage -- there's only 128+64 kib on the chip.
So, next steps are to move the source code into the STM32F project, and see if I can even compile it, and if so, see how big the firmware image gets. Then a lot of surgery and profiling, I'm sure.
For my next amazing feat, I will attempt to integrate the libvorbis code into the System Workbench project I have set up, and see if there is early bad news. Fail fast!
Also, I need to figure out a clever project picture. I don't have anything interesting yet to photograph, it's all just code and a hopeful dev board. Maybe a system diagram will work.
08/23/2017 at 19:55 •
My current thinking is that I will attach an AM/FM radio chip to a microprocessor, and that to an Ethernet controller to present the Internet stream.
I did a little research on radio chips; there's many, but I would like the circuit to be simple, have all digital tuning controls, and if possible have digital audio output. I found a Silicon Labs offering, the Si4737, which I like a lot. Unfortunately, it is quite expensive (USD$ 18 in unit quantities from Mouser). However, I have resolved myself to just go for it for at least this first prototype. If there is interest in the project for others, and a sensitivity to price, I'll re-visit the radio choice. (There are several cheaper ones that involve more external components, do not have digital controls, and have analog outputs.)
My current thinking is to see if I can get away with using an STM32F4 microcontroller. I have had some recent experience with various STM parts, have a build system set up, and have a bunch of dev boards on-hand.
My main concern with this controller, though, is if it will have enough computational power. I am expecting that I will have to do some real-time compression to get sufficient audio quality in a low enough bandwidth. The STM32F405's do have a single precision FPU, so there is an outside chance that this might be possible with the 168 MHz CPU clock. My first tests will be to see if I can compress audio at a sufficiently high rate on that CPU.
If the STM32F4 is a bust, my next choice is a board I stumbled across recently, the C.H.I.P. CHIP_(computer). This is a 1 GHz Cortex A8, with NEON SIMD, and I think FPU. If it does indeed have FPU then it really should work. At an USD$ 9 each, what a bargain. Unfortunately, they are out-of-stock or I would have ordered a few already just to have some on-hand. Maybe they'll be back next month.
Anyway, I happen to have a heap of STM32F4 boards, and for this initial text/prototype, I'm going to use one board, the 'Netduino Plus 2'. To wit, this is a discontinued board, but I like it for this application because:
- I have three of them otherwise collecting dust
- they have an Ethernet adapter on board (via a ENC28J60 -- a really handy chip, but heed the errata)
- they have an SD card socket on board, which I intend to use for the TiVo-esque features
- they have a USB socket for.... for 'why not'? probably at a minimum to realize a debug monitor
- the rest of the IO is brought out via an Arduino Uno R3 style header
- there is one I2S capable SPI port brought out, and also an I2C port. I'll be needing I2S to receive the digital audio from the radio, and the I2C to control the radio chip (tuning, etc). Some sundry gpio will be used for other stuff.
So, from a physical design, the Netduino Plus 2 will work and save me some time soldering all those adapter things. I'm pretty sure I'm going to have to make a 'shield' of some sort to house the radio board, but that's what OSHPark is for, no?
Anyway, my first two orders of business are:
- order radio parts from Mouser. I don't know what I'm going to do about the AM antenna (loop ferrite sticks?), but I will concentrate on FM for now
- write some test code for the STM32F4 to see if it has any hope of being able to do the compression and also have overhead for disk IO, networking, etc. It very well may not, and this will significantly change the direction of the project
For those that don't know, the 'Netduino' was a line of products that have an Arduino form factor, but were in running Microsoft's .NET Micro Framework ("dotNetMF"). It was an interesting idea, and there was a community around it, but the founder made some business errors and the company and product line are, to-wit, no more. (There is another company that does dotNetMF products, GHI.) Anyway, dotNetMF was interesting, but it's exquisitely slow (it's interpreted). Since the board has an unpopulated 10-pin ARM JTAG header, I should be able to completely re-purpose the hardware. It wasn't the worst board in the world, but it sure was expensive -- about USD$ 60, which is kind of a hard sell relative to the also-then-popular Raspberry Pi at about USD$ 20 and way way more capable.