Embiggen the Smallest Mp3

A project log for Positivity Pusher | infinite affirmation generator

Hear a new encouraging phrase with every push

stephSteph 05/04/2023 at 03:390 Comments

The Positivity Pusher utilizes Microsoft's neurally generated text-to-speech API. Any other service offering TTS could be substituted, but the Microsoft offering has some compelling features:

  1. Cost - It's free up to 500,000 characters of generated text per month. That's about a novel worth of audio
  2. Emotion - The API generates speech with convincing, selectable styles like 'friendly,' 'sad', and even 'empathetic'
  3. Flexible - A wide selection of voices, languages, and output formats are available

While the service is able to stream out broadcast quality audio (48khz 16bit uncompressed), we are more interested in the low-res side of the offerings. We need the smallest files we can get, for a few reasons:

  1. We are working in an extremely memory constrained environment. While our program is running, we have access to about 140k of RAM. The onboard flash memory has less than one megabyte available to store everything. 
  2. Smaller files download faster, and faster download times extend battery life.
  3. Audio quality is inherently limited by the inexpensive little speaker in the button, and also by our use of 10-bit PWM to generate the audio (the Pico has no dac), so our file doesn't need a lot of resolution.

The smallest file we can request from the API (and playback in CircuitPython) is a 16khz sample rate, 32k bitrate mp3 file. For the duration of speech that the Positivity Pusher generates, those files end up being 40-80kb in size. That's pretty great, especially considering that the same audio as a wav would take up around 500kb, which wouldn't even fit in our storage. So 50kb is small, but we can get it even smaller.

Here's the Trick

Part 1

The text-to-speech API allows us to specify the rate and pitch of the speech. We can request a 100% increase in rate, making the model talk twice as fast as it usually would, and also a 100% increase in pitch. The result is effectively the same as a clip that's played at 2x speed. 

Part 2

CircuitPython allows us to specify the sample rate used in playback. If we play our 16khz file back at half the sample rate (8khz), our 2x speed clip will be slowed down by 50%, back to the normal speed. Additionally, the pitch will be lowered by the same amount, bringing our voice back to normal pitch as well. 

All the Fun, Half the Heft

By employing this tactic, we shrink the size of the MP3 by half. The clips download faster, which saves battery, and we save a ton of storage space. The resulting drop in audio quality barely perceptible (if at all) considering the limitations of the amp and speaker.