I spent much of two days performing surgery on libvorbis to optionally operate in a single-precision-only fashion.
With dummy calls to force linkage of that code into the project, the result is about 640 kiB, which fits.
I needed to verifiably expunge all double precision math from the vorbis library to make it have any hope of running fast enough, since the STM32F405 chip does have an FPU, but single-precision only -- double would still be done in software.
This was an arduous exercise, touching nearly every module, but it was helped by a compiler option -Wdouble-promotion which would point out all the hidden places that the compiler was upconverting floats to doubles in an otherwise unseen manner. There is also another option -fsingle-precision-constant that will cause things like 0.1 to be float instead of double (against the standard, but useful since the standard does not have an explicit suffix for double). This is more of a safety net feature -- I prefer to be explicit, so I left that option out. I might put it in at the end of the project as a catchall for booboos that might creep in. Maybe.
After about a day, I had modded the code. I went back-and forth between Visual Studio and gcc since those compilers have different notions about what should be warned -- both useful -- and I wanted both to give issue a clean bill of health. There were many, many instances in the code base where folks were a bit sloppy about precision, and some other coding issues (like out-scoping named variables), but this codebase is more of a reference implementation than production code. Anyway, with the speed of desktop machines, and the ubiquity of a full FPU, these things don't get noticed practically. But I am not running this on a desktop machine, so it definitely gets noticed in this case.
Ultimately, I ran it, and it still compressed audio file. The size was very slightly different, but I'm sure that is a consequence of the loss of precision. Subjectively it still sounded right. Well, that's not entirely true -- I did notices some artifacts now and then that sounded like a little tone burst here and there. I originally thought it was in my source material, but I did a before-and-after and it was due to the code changes. It was only in one of my few test files, so at least I had the opportunity to find it now. It was very slight, but it annoyed me that it was there, so I had some hunting to do.
Not really knowing the code base, or the DSP behind it, but knowing that it was something I introduced, I decided to try doing a binary search on the changes to narrow down which change(s) caused the artifact. This is easier said than done, because there are interdependencies in the changes, so you can't strictly do a binary search, but you get the idea of swapping bulk changes in-and-out and narrowing the size of those changes.
I decided to modify the code yet again, this time using a macro that would switch between the single and double precision implementation. I wish I had thought of that at the outset, rather than wildly changing constants, variable declarations, and function names, but I did at least have the sense to mark every single line I had changed with an XXX comment. Indeed, my XXX comment had the original line, and I added the changed line, so it was a matter of grepping the source to make a to-do list, and then retool the code again. This took a long time, but I used that effort to do the search at the same time. I consider myself very lucky that the change that caused the audible artifacts was located in a focused area: there is some code that computes "Frequency to octave. We arbitrarily declare 63.5 Hz to be octave 0.0". It consists of two macros: toOC and fromOC that, if changed to single precision, will cause the artifacts to appear. *sigh*. Well, that will take some examination to understand, so I am putting that off for now.
In the end, I can now provably do a build which uses single-precision only (save the one spot I am deferring) and the result is about 640 kiB total for the flash image. This is about 5/8 of the total flash available, so I think I should have plenty room left for other stuff I need. Now I have to analyze memory usage. If the runtime RAM usage is too high, then it's a bust irrespective of FPU performance. This chip only has 128+64 kiB RAM (split into two regions -- there is a 64 kiB 'core-coupled RAM' which is special in that the peripherals can't touch it for DMA and whatnot. I don't why this is done from a chip designer's perspective, but it's an inconvenience to the programmer, so I'll have to see how I can effectively make use of it. Stack and OS-related things could use it, I suppose.
I must analyze runtime memory usage to see if it is acceptable. I suspect out-of-the-box it will be a disaster, since desktop machines have a comparative glut of memory, but I need to see how bad it is, and if it's salvageable.