Stage 1: Speech to Text to Speech

One of the simplest shortcomings of the original costume as outlined in stage 0 is the user interface. While I can thematically get away with a keyboard as an input method, it left my interactions a little stilted. To have a "conversation" required that I sit down with the keyboard in my lap or on a desk. To respond, I would have to look down at the keyboard (my touch typing isn't that good without feedback), carefully type out something, and then look back at the person I am talking to. My solution to this is to use speech to text to recognize what I say so espeak can repeat it.

Software

The basis of this new interface method is Mozilla's DeepSpeech (https://github.com/touchgadget/DeepSpeech), which was designed to run on Raspberry Pis. Apart from a momentary issue with Alsa, this was easy to get running and modify for my purposes. As of now, my work in this area has been done in the speechRec branch of this project's repo (https://github.com/cogFrog/computerHead/tree/speechRec). I used the mic_vad_streaming.py example as a basis for my speechToTextToSpeech.py.

At first, I thought it would be a pretty simple adjustment. My original plan was to use pyttsx3's runAndWait() function to have espeak say the recognized speech. I expected that this would pause the collection of new audio samples, preventing the system from hearing itself and "echoing". There were two problems with this. First, the audio collection was done on a separate thread, so the blocking function of runAndWait() didn't prevent echoing. Second, pyttsx3 crashes when it is fed an empty string. The solution is in two parts. First, I added pause and unpause functions to the audio class, shown below.

class Audio(object):
    ...
    def pause(self):
        self.stream.stop_stream()

    def unpause(self):
        self.stream.start_stream()

Second, I used the new pause/unpause functions while double-checking that the recognized text is not an empty string. This actually works!

text = stream_context.finishStream()
    print("Recognized: %s" % text)
            
    if len(text) != 0:
        vad_audio.pause()
        engine.say(text)
        engine.runAndWait()
        vad_audio.unpause()

Hardware

For this, only two changes were needed. First, the Raspberry Pi 3 B+ has been upgraded to a Raspberry Pi 4 with 4 GB of RAM. The 3 worked, but the 4 noticeably reduced the delay between an utterance and its recognition.

The modification was to replace the keyboard with a decent microphone. The challenge here was to find a decent quality microphone that could work at low volumes. The costume effect is diminished if you can hear the human inside talking as well as the computer! I just went to the store, bought a couple of microphones, and found that the Samson Go Mic worked well enough. A little expensive at $50, but not horrendous. The picture of the current setup is below. Cable management is going to be non-existent until I get more of the functions working, so things are going to be pretty ugly for now.

Next Steps

Now that the speech-to-text-to-speech system is working, it is time to redo the LED matrix control. Adding new icons and animations won't be too much work. In my previous implementation, the two separate scripts were used for the speech and display controls, as the two functions are were separate. However, speech recognition offers a good opportunity to display more complex content, this probably means figuring out some type of threading.

Stage 0: What I already have, and what I want to change

Stage 2: More Animation, Music, and Finalization

Discussions

Become a Hackaday.io Member