Speech synthesis on a microcontroller. Talking projects, cheap as chips!
Not all is forgotten here...
The process of coding pronunciation has a long history. Obviously you can't just start with "HELLO" and know how the mouth sounds should be formed to speak it. Instead, you break it down into phonemes, then speak them separately. So how to notate phonemes?
Dictionary printers each have their own particular system, typically borrowing letters from the language and adding some extras where needed to disambiguate (like the schwa - ə - to sound like the "a" in "about"). Merriam-Webster pronunciation for HELLO is
hə-ˈlō or he-ˈlō
which retains most English characters (h,e,l,o) and is easy for English speakers to understand.
If you ever used S.A.M. for C64 / A800 / etc, you may be familiar with ArpaBet. This describes phonemes with one- or two-letter codes, space separated, using ASCII uppercase letters. Numbers denote stress (accent... aka "loud"). Invented in the 1970s, it's a bit primitive but still functional, and the CMU Pronouncing Dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict) uses it. For example, HELLO in ArpaBet is:
HH AH0 L OW1
These systems have a commonality, though, which is that they are tied to a language. If you want to go truly universal (well, human universal anyway) you need to use the International Phonetic Alphabet. This system covers every sound humans can or do use in a language. To do so requires lots of funny symbols. Hello becomes something like:
hɛˈloʊ̯ or həˈloʊ̯
which is why English dictionaries don't use IPA: it's hard to read and way overkill for English - would be better to have "e" instead of "ɛ" right? Well, if µTTS is to be useful everywhere, it should speak IPA. Only... that's hard, because most IPA characters are Unicode and who wants Unicode in their microcontroller?
Enter X-SAMPA (https://en.wikipedia.org/wiki/X-SAMPA). This is a system which encodes IPA to 7-bit ASCII strings. Some characters are the same, others are substitutions, and more complicated IPA symbols are "decomposed" into multiple X-SAMPA characters by adding modifier characters (typically, "\" or "_" etc). Best of all, it's a quasi-standard, so there are tools already to convert between IPA and X-SAMPA. So the IPA HELLO above, in X-SAMPA, is:
hE"loU_^ or h@"loU_^
All this is informing design decisions for µTTS, which is to say that it should take X-SAMPA as input and produce audio as output. Since X-SAMPA fits in 7 bits ASCII, I can use the top bit for "control" sequences (pitch, speed, etc). The hardware interface would be something like 9600-8-N-1 serial. I sort of envision it all fitting together like this:
Well, that's enough for now. Time to get coding!
Starting from the beginning: Human speech is complicated.
There are a lot of noisemakers in the body. The lungs and chest create (resonate) harmonically rich sounds which are modified and shaped by the glottis, larynx, tongue, lips, jaw, and others. We can also create various plosives (popping, clicking, other one-shot noises) as well as a simulation of white noise through air blowing. Languages choose from these sounds to create letters (or groups of letters), and from the letters we make words.
Fortunately for me, the ground work has been laid already. The International Phonetic Association has categorized the various possible human sounds (or at least, those used in known languages) already.
Using these charts, we arrive at a set of phones which would be needed to implement any given language. (Languages group and select phones into phonemes, which are the relevant "atoms" that make up words. Most languages don't use all the phones, and further group the others together in certain ways.)
There are quite a few of these. For recognizable speech, it may not be necessary to implement all of them (say, the difference between "m" made by closing the lips, vs "m" made by touching lips to teeth).
Vowel production is done by synthesizing formants - "the spectral peaks of sound of the human voice". The interaction of vocal cords and internal structures create resonances (some tones louder than others). With humans there are four to six peaks, but research on sound synthesis has found that just two formants are necessary for humans to be able to distinguish one vowel from another.
Noise production - Making an "s" sound is pretty straightforward: white noise generation. "sh" is also noisy, but with the high frequencies rolled off. Several phones can be produced by noise generation.
Other consonant production - to be determined : )
Putting it all together, what I intend to build is more of a "phone synthesizer", capable of producing the phones necessary to build speech. It has no knowledge of letters or vowels and must be hand-fed the proper pronunciation strings from the host.
To make this easier, I've built a cross-platform C application which combines the synth engine with a loadable per-language dictionary. The dictionary has IPA pronunciation keys for lists of words, but also phoneme groupings for the language to make guesses at the pronunciation for unknown words. (As an example - there is some debate to the number of phonemes used in English, but a reasonable estimate is about 42.)
The C app allows users to preview and tweak the sound, dump a .wav of the sample, or retrieve the phone string - and then compile this into their own application, "baking in" the phrases.
I may further put the phoneme translation into a standalone C module, for use in e.g. Arduino and friends, as a way to speak arbitrary phrases.