It speaks...

I decided to document how I got this little processor to talk while it's still kinda fresh in my mind.

First off, how do you get sound out of a ATmega328? The short answer is you use the PWM peripheral to output a signal (wave) that drives a speaker. In fact, the outputs on the AVRs are stout enough to drive a tiny speaker directly. I think Sparkfun and/or Adafruit sell the little speakers. The sound is a little quiet, but it's pretty good. In this latest build of the dice tower, I added a simple LM386 op amp circuit to drive a slightly bigger 8 ohm speaker for better sound. I also inserted a simple RC low band pass filter circuit between the timer compare output pin on the AVR and the op amp. I used a .01 uf capacitor with a 200 ohm resistor to filter out frequencies roughly above 8 kHz.

There's lot more details to reproducing sound involved here. But, I'm not going to try an explain all of that. I assume there are several sources on the inter-webs and text books that explain pulse code modulation or why I added a low band pass filter much better than I can.

I wrote a simple AVR PWM "driver" that sets up the ATmega's Timer/Counter 2 in fast PWM mode, sets the period to 7812 Hz at the 8 mHz internal oscillator clock rate, and loads 8 bit pulse code modulated (PCM) data from the program space (flash) into the comparator each cycle. This code actually came from a previous project of mine. If you dig through the source code, you'll see it can also setup the PWM to play square wave tones at a given audible frequency. I didn't use any of that in this project.

At this point, I had a dice tower that could play PCM sound data. So how do I make it talk?

This isn't a full speech to text implementation. It can only say the numbers 1 through 99 and pronounce "Dee" as in the letter "D" or "roll six D six for damage on that fireball." But, that's still pretty good for a little 8 bit microcontroller with just 32K of Flash, 2 K of RAM, and a simple built-in PWM peripheral. For perspective, that 32K of RAM will store just 4 seconds of 8 bit PCM encoded mono sound sampled at ~8 kHz. Try to count from 1 to 99 out loud in less then 4 seconds and make it easy to understand. It ain't gonna happen.

So, I figured I could just sample the unique words that are used to say the numbers one to ninety nine. I started by snagging an older version of the Microsoft text to speech demonstration application TTSApp. I liked the dated quality of the speech generated by the app and it was very easy to generate wave source files of the words I wanted. I could even control how fast the words were spoken. I generated .wav files for all the words that represent the numbers 1 to 20 and Thirty, Forty, ..., and Ninety. I then used the Audacity application on my little Linux notebook to max out the sound levels on the samples.

Next, I needed to get those .wav files into the ATmega328's flash memory. The easiest way to do that with the minimum amount of additional program memory overhead is to convert the data the sound files represent into C arrays and link that code with my application. There are likely some existing programs out there to do that, but I decided to write one myself. I figured there was a good chance I might want to do some additional processing on the data.

I wrote a simple Linux command line application in C to convert the .wav files into text files that represent the data as C arrays of type uint8_t. To save the headache of trying to figure out how to decode a .wav file and add some additional features like converting mpeg and other sound/video files and changing the sample rate from any source sample rate to a user provided rate, I piped the output from the FFMPEG application into my app and output the text file with all the C required stuff. Now my source file could be just about any type of media file and I could convert it to mono 8 bit PCM data at any sample rate I chose and my little command line app would spit out some statistics like the total size of the data.

I used the new app to build a single .h file with the C arrays from the .wav files I mentioned above. Once it completed, it reported that the total length of the sampled sound was over 8 seconds and the memory required to store it would be more than 64K. At that point, I just about gave up on the whole speech idea. But, I decided to see what I could do with a little compression on the data.

First, I added simple run length encoding. This lossless form of compression simply substitutes a reserved value followed by a count whenever it comes across more than one instance of the same value sequentially in the source data. So, if the source .wav file had the value 0x80 ten times in row it would replace that with two bytes. The first byte would be the reserved token 0x00 and the second byte would be 0x0a (the value ten). In this manner, 10 bytes in the source data is stored as 2 bytes. 8 bit PCM sound data represents a wave form. Values greater than 0x80 represent a "positive" pulse and values less that 0x80 represent a negative pulse. If all is quiet in the original source data, the value 0x80 is repeated at the sample rate. Looking at the source data for the sound samples I could see this 0x80 value repeated from time to time. Very few other values seemed to be repeated more than twice. I decided that I would reserve the token 0x00 followed by another byte representing 0-255 bytes of repeating values 0x80. I coded this up and re-ran my little command line application. I got about 20% - 30% compression on my original sources.

This wasn't enough but it was a pretty good start. I looked at the raw source data again and saw that sound waves didn't instantly return 0x80 as the source data got quieter. It tended to bounce around the value 0x80 plus or minus a small number. These low energy pulses would barely be audible and might not even move the diaphragm on my cheap little speaker. So, I changed the app again. Now I could provide a value that would be used as a +/- deadband around the value 0x80. I rebuilt the .h file from the original .wav sources using a deadband value of +/- 5. This gave even better compression. I then modified my AVR application to recognize this new token and expand it appropriately as it loaded bytes into the counter comparator for the PWM. I played one of the sampled words and it sounded great. I then tried to link all the words/numbers. Unfortunately, my data was still just a little too big.

So, I rebuilt the .h file one more time using a deadband value of +/- 7. This just barely fit when I linked all the numbers and the sound for "D". I had like 100-200 bytes to spare. But how would it sound? Well if you watched the video in the previous log, you know it sounded pretty darn good.

The rest was just a matter of selecting and playing the sound samples sequentially to represent the numbers selected by the user and resulted from the pseudo-random number generator.

Got it working...

Discussions

Become a Hackaday.io Member