• Stage 1: Speech to Text to Speech

    Skye Rutan-bedard10/02/2021 at 05:48 0 comments

    One of the simplest shortcomings of the original costume as outlined in stage 0 is the user interface. While I can thematically get away with a keyboard as an input method, it left my interactions a little stilted. To have a "conversation" required that I sit down with the keyboard in my lap or on a desk. To respond, I would have to look down at the keyboard (my touch typing isn't that good without feedback), carefully type out something, and then look back at the person I am talking to. My solution to this is to use speech to text to recognize what I say so espeak can repeat it.


    Software

    The basis of this new interface method is Mozilla's DeepSpeech (https://github.com/touchgadget/DeepSpeech), which was designed to run on Raspberry Pis. Apart from a momentary issue with Alsa, this was easy to get running and modify for my purposes. As of now, my work in this area has been done in the speechRec branch of this project's repo (https://github.com/cogFrog/computerHead/tree/speechRec). I used the mic_vad_streaming.py example as a basis for my speechToTextToSpeech.py.

    At first, I thought it would be a pretty simple adjustment. My original plan was to use pyttsx3's runAndWait() function to have espeak say the recognized speech. I expected that this would pause the collection of new audio samples, preventing the system from hearing itself and "echoing". There were two problems with this. First, the audio collection was done on a separate thread, so the blocking function of runAndWait() didn't prevent echoing. Second, pyttsx3 crashes when it is fed an empty string. The solution is in two parts. First, I added pause and unpause functions to the audio class, shown below.

    class Audio(object):
        ...
        def pause(self):
            self.stream.stop_stream()
    
        def unpause(self):
            self.stream.start_stream()

    Second, I used the new pause/unpause functions while double-checking that the recognized text is not an empty string. This actually works!

    text = stream_context.finishStream()
        print("Recognized: %s" % text)
                
        if len(text) != 0:
            vad_audio.pause()
            engine.say(text)
            engine.runAndWait()
            vad_audio.unpause()

     

    Hardware

    For this, only two changes were needed. First, the Raspberry Pi 3 B+ has been upgraded to a Raspberry Pi 4 with 4 GB of RAM. The 3 worked, but the 4 noticeably reduced the delay between an utterance and its recognition.


    The modification was to replace the keyboard with a decent microphone. The challenge here was to find a decent quality microphone that could work at low volumes. The costume effect is diminished if you can hear the human inside talking as well as the computer! I just went to the store, bought a couple of microphones, and found that the Samson Go Mic worked well enough. A little expensive at $50, but not horrendous. The picture of the current setup is below. Cable management is going to be non-existent until I get more of the functions working, so things are going to be pretty ugly for now.



    Next Steps

    Now that the speech-to-text-to-speech system is working, it is time to redo the LED matrix control. Adding new icons and animations won't be too much work. In my previous implementation, the two separate scripts were used for the speech and display controls, as the two functions are were separate. However, speech recognition offers a good opportunity to display more complex content, this probably means figuring out some type of threading.

  • Stage 0: What I already have, and what I want to change

    Skye Rutan-bedard09/27/2021 at 14:56 0 comments

    Last year, I made a computer head helmet. Rather than building an entirely new costume this year, I am instead improving this preexisting costume. Before I get ahead of myself, I should start by documenting this preexisting design a bit.


    Physical Construction

    The frame of the costume is an old CRT display that has been gutted and cleaned. From there, three key modifications. First, a hole was sawed in the bottom of the CRT case, with pipe insulation around the edge for comfort.The second modification is a hard hat. This allows for the costume to be worn as a helmet. I was lazy, so the hard hat was literally epoxied to the CRT case. Not clean, but it works.

    The third modification is a screen. For this, an acrylic one-way mirror was cut to size and glued into place.

    Electrical Construction
    The electronics for this project were fairly simple, as shown below:

    The Raspberry Pi had two functions. First, it controlled the Neopixel matrix to display an animated eye in green. Second, it spoke. Phrases were typed on the keyboard, and Espeak was used to say them aloud through the audio amplifier and speaker.


    Software
    The software used was similarly simple. While the code can be seen in the first commit to main in my git repo (https://github.com/cogFrog/computerHead), I will briefly explain it here. There are two separate scripts that are run at once. The first script, testNeopixel.py, drives the Neopixel matrix. It uses four .gifs as four frames of animation (eye looking far left, eye looking left, eye looking left while blinking, eye closed looking left). These four frames of animation are flipped to get a total of eight frames of animation. A state machine with random transitions from each state to the connected states brings it all to life.

    The other script, testSpeech.py, simply has Espeak say whatever the user types into the terminal. Unfortunately, there is no screen to allow the user to see what they are typing. The result is that I have carefully type and hope I don't make any mistakes.

    Conclusion
    That's it for now while I work on getting Mozilla's DeepSpeech working on my newly acquired Raspberry Pi 4. With any luck, updates will come soon and I can get this done before Halloween hits!