Close

Choosing a Speech Recognition Model

A project log for Helping H.A.N.D.S.

We are Electrical Engineering students who built and programmed a robotic hand to translate spoken language into ASL fingerspelling.

nick-allisonNick Allison 05/25/2023 at 17:540 Comments

During the development of our natural language processing (NLP) block, we researched and tested several different speech recognition libraries. We evaluated the libraries based on the following criteria:

We have researched the following libraries, listing both advantages and disadvantages of each one based on the factors mentioned above.

Kaldi

Kaldi is an open-source speech recognition software written in C++, which works on Windows, macOS, and Linux. Kaldi’s main feature over other speech recognition software is that it’s extendable and modular: The community provides tons of 3rd-party modules. Kaldi also supports deep neural networks and offers excellent documentation on its website. While the code is mainly written in C++, it is “wrapped” by Bash and Python scripts. Kaldi also provides a Python pre-built engine with English-trained models.

However, Kaldi has only a few open-source models available, most requiring a lot of space. Kaldi also has a long installation and build process. Part of that process requires creating configuration files for each transcription, which drastically increases the complexity of the NLP.

DeepSpeech

DeepSpeech is an open-source Speech-To-Text engine using a model trained by machine learning techniques based on the TensorFlow framework.

DeepSpeech takes a stream of audio as input and converts that stream of audio into a sequence of characters in the designated alphabet. Two basic steps make This conversion possible: First, the audio is converted into a sequence of probabilities over characters in the alphabet. Secondly, this sequence of probabilities is converted into a sequence of characters. DeepSpeech provides an "end-to-end" model, simplifying the speech recognition pipeline into a single model.

DeepSpeech, unfortunately, lacks documentation and features behind other competitors on this list.

Whisper

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Using such a large and diverse dataset improves robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages and translation from those languages into English.

Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Unfortunately, Whisper models have a long runtime for embedded systems.

Athena

An end-to-end speech recognition engine that implements ASR.

Written in Python and licensed under the Apache 2.0 license. Supports unsupervised pre-training and multi-GPUs training either on the same or multiple machines. Built on the top of TensorFlow.

Has a large model available for both English and Chinese languages.

However, Athena does not support offline speech recognition, which is a requirement for our project.

Vosk

Vosk is a speech recognition toolkit, which supports 20+ languages and dialects. Vosk works offline, even on lightweight devices - Raspberry Pi, Android, and iOS. Vosk has portable per-language models that are only 50Mb each and provide streaming API for the best user experience, as well as bindings for different programming languages. Vosk allows quick reconfiguration of vocabulary for best accuracy and supports speaker identification besides simple speech recognition.

Vosk also provides large documentation and technical support through GitHub and a large set of features.

We decided to use the Vosk speech recognition library due to its lightweight models and high accuracy in speech recognition. 

Discussions