We finally got to the bottom of some audio issues that had us running in circles for a while. This was a tricky one because there were actually 2 separate issues and we added a codec and changed our audio library at the same time we started noticing them, so we wasted a while thinking it was our stuff, or the new audio library (spoiler: the one that had us spinning our wheels the longest was an ESP32 issue).
Issue #1 was the clock source for the audio codec. Originally it was coming from a separate oscillator, but apparently the right way to handle it is to drive it from the same source as the I2S.
Solution: delete the external oscillator and add a jumper the existing boards to use a pin from the ESP32 as the clock signal.
Issue #2 we originally called the drunken codec because we noticed it around the same time we implemented G.711 and it seemed like that was the source of the weirdness. It make you sound drunk. Funny, but hard to debug. Andriy spent quite a while trying to see if we had some sort of endianness problem or a math mistake in the codec, but everything looked right. Shortly after that we added an audio library so that we could play MP3s, and that sounded great. Yay! We fixed it. Except, no, actually it only sounded good for audio playback. Calls still were still coming from someone who had been at a drinking establishment for quite some time. So we started looking at the network stack and audio buffers used for calling. A lot of debugging later and the audio pipeline for calling is simplified considerably, but the issue remains.
So eventually we tried recording audio in various situations and noticed something weird:
For a 16kHz sample, it looked like high frequency (around 8 kHz) noise was overlaid on the signal. Except there was no noise when the input level was low. And we also noticed that what sounded "drunk" at 8kHz sounded like sort of a buzzy robotic voice at 16kHz. Strange.
Now that we had some real observations to play with, the issue quickly unraveled itself as the following:
- The ESP32 was doing it.
- "It" being chronological swapping of every other audio sample.
- But only for mono. Stereo came through as expected.
One of the things that made it so hard to debug (apart from all the other changes we made clouding the issue near the time we noticed it), was it only happened in calls since that was the only time we used mono. And it only happened when the call was between a WiPhone and something else (not between WiPhones) since the error self corrected if the audio got sent to and processed another ESP32.
Anyway, now you will only sound normal instead of drunk or robotic during calls.