Close
0%
0%

Vision

AI Voice Activated Home Automation System

robRob
Public Chat
Similar projects worth following
I wanted a voice assistant that was private, hackable, and extendable—none of the cloud-based stuff that spies on you. Plus, I wanted to really learn the guts of how AI assistants work. So, I built my own.

This is Vision, my DIY voice assistant inspired by Iron Man’s AI. Built on a ThinkPad, powered by Python, and running fully offline, this project brings together wake word detection, speech-to-text, command handling, and text-to-speech. My goal is to eventually expand it into a modular smart home interface that runs on low-cost hardware, but with built in AI to give it a personality and memory so its more humanoid than a boring Alexa.

Youtube: https://www.youtube.com/@RobEEStuff

GitHub: https://github.com/RobertF816/Vision

  • 1 × Blue Yeti Nano
  • 1 × Thinkpad t480s

  • Log 9: TTS? + Improved Commands

    Rob3 days ago 0 comments

    Summary:

    Finally, I've added TTS using ElevenLabs (for now) which was a very easy implementation. Also, I've improved my 3 simple commands as of now, more specifically the setTimer command.

    Current Changes:

    • Added TTS using ElevenLabs simple integration
    • Improved Command Efficiency
      • setTimer can now accept multiple different time formats for any number 
        • ie: 49 seconds, 150 minutes

    Whats next?:

    • Add real-world home automation integration
    • Add AI fallback into mix for personality and mishaps

    Future Visions:

    • Context Memory and Long-term memory
    • Very Large Scale integrated Commands
    • Different form of TTS for fully offline
    • I would love to upgrade right now as its starting to slow down, but thats ok! When the time is right I will strike...

  • Log 8: Intent Parsing

    Rob5 days ago 0 comments

    Summary:

    I did it! I added intent parsing that actually feels good! (for now at least). What this means is that when I want to fire a quick command like getting the time, setting a timer, and changing lights around the house, it will recognize diverse language requests and find the intent behind it and also the slots intertwined within the text such as location and duration.

    Current Changes:

    • Added intent_handler.py which currently supports 3 simple commands with arguments
      • getTime 
      • setTimer - 1 argument - duration
      • controlLight - 2 arguments - location, on/off

    Whats next?:

    • Command handler to execute commands based off of intent
    • AI LLM fallback when user wants to just chat or intent handler fails

    Future Visions:

    • Efficiency Upgrades to make Vision faster
    • Context Memory and Long-Term Memory
    • TTS
    • Real World Automation!

    I really like how nothing has been too hard since the VAD thing, so im hoping it stays that way with no problems (jinx)

  • Log 7: Revision?

    Rob6 days ago 0 comments

    Summary:

    After a long break from this project because of finals, I am back but this time I decided to start over from the ground up with efficiency in mind. So far, I have enhanced the voice recording and transcribing user faster models and a VAD system that stops recording when the user stops speaking. 

    Restarting was the best option at this state as I've also upgraded my setup with a high quality microphone and new tools. Also, it just felt right so why not?

    Current Changes:

    • Restarted!
    • Added Wake Word Detection - Porcupine
    • Added VAD based Voice Recording to end when user is silent
    • Faster Transcribing - Faster Whisper

    Next Steps:

    • Recreate command handler but reimagined for efficiency
    • Add in LLM once command handling is set

    Future Visions:

    • Context Memory
    • TTS Model
    • Intent Recognition
    • Real-World Automation

    Would Love to upgrade system in the future I just dont have the funds for that RN!!!! Dont worry summer job will come in clutch.

  • Log 6: Local LLM and Smart Command Handling!

    Rob04/26/2025 at 22:15 0 comments

    Summary:

    Vision has officially gone AI with some major enhancements. Not only has it implemented an AI so that any input it takes in will have adequate output, it also recognizes when the LLM is NOT needed and uses only the Command Handling script for faster function calling/ actions!

    (Also it Streams word for word!)

    Current Changes:

    • Installed Ollama and Mistral
    • Built the LLM_Fallback script to intercept any unknown input and run it through the promped LLM
      • Streams the LLM Response live
    • Modified Command_Handler: uses fuzzy and substring matching as a priority
      • Long/Complex/Unknown Phrases sent to the LLM

    Next Steps:

    • Upgrade LLM prompt to not only respond verbally, but also be able to run functions based off of response
    • Begin Vision's Memory System (Hybrid)
      • Save user Data
      • Remember past conversations

    Future Visions:

    • Dynamic Command Learning - Learn and Store new commands
    • TTS MODEL (I forgot let me get on that soon)
    • Intent Recognition
    • Start Physical Home Automation Integrations

    Honestly, I'm really looking to upgrade my system right now from my t480s. That will just be something that happens in the future when I stumble upon an opportunity.

  • Log 5: Simple Command Handler

    Rob04/26/2025 at 17:49 0 comments

    Summary:

    Vision can now understand simple voice commands and react to them! I have created my simple command_handler script to parse the text and recognize any key phrases that could indicate a specific function being called.

    Current Changes:

    • Created command_handler script
    • Setup a 'dictionary' of recognized phrases
      • 3 simples commands: "Say Hello", "What time is it?", "turn off"

    Next Steps:

    • Incorporate LLM to use side-by-side with command handler for more advanced dialogue
    • Integrate a more advanced recognition system for the command handler, so it can recognize more and doesent have to send to LLM every time.

    Future Visions:

    • A Memory for Context recognition or personality - Local Memory Management
    • Dynamic Commands: "learn" new commands
    • Bring Automation into the physical home

  • Log 4: Speech to Text and Enhanced Speed

    Rob04/26/2025 at 02:41 0 comments

    Summary

    After Successfully integrating OpenAI's Python Whisper Library, and then running it simultaneously with the wake word function, it became very slow at transcribing. So, I switched to the C++ version of whisper which is CPU optimized to perform more efficiently in under 5 seconds.

    Current Changes

    • Installed and integrated whisper.cpp to the current Function
    • Changed up speech_to_text.py to run whisper.cpp using subprocess
    • Cleaned up Warnings
    • Full Tested Wake word -> Record -> whisper transcribe -> print to terminal

    Next Steps

    • Create a command handler to parse text and run commands
    • Begin Designing a Text to Speech system for Vision to respond verbally

    Future Visions

    • Replace Command Handler with Local LLM to not only respond with a personality but execute commands based on unique input
    • Optimize Whisper Recording to Record until end of sentence
    • Add Basic - Advanced Memory for the LLM to remember context and past conversations
    • Package Vision into a simple service to launch on startup

  • Log 3: Mid-Game Reset

    Rob04/26/2025 at 00:47 0 comments

    Summary:

    After changing my OS to Linux Mint, I began rebuilding Vision in Linux from the start. I felt better about restarting as there were some changes I wanted to make at this point that would be harder to make later on. (I also did not want to download anything I didn't have to)

    Current Changes:

    Fresh OS Setup

    • Full Installation of Linux Mint
    • Minimum Package Installs: python3, pip, git
      • Additional Audio Libs: portaudio19-dev, alsa-utils, pulseaudio
    • Installed Whisper for offline speech recognition
    • Installed Porcupine for offline wake word detection

    *Kept Python at version 3.12.3*

    Setup Filesystem Architecture for Future

    Jarvis/
    ├── main.py
    ├── config.py
    ├── README.md
    ├── requirements.txt
    ├── audio/
    │   ├── wakeword_detector.py
    │   ├── speech_to_text.py (coming next)
    ├── ai/
    │   ├── command_handler.py (coming next)
    │   └── memory_manager.py (Future Vision)
    ├── tts/
    │   └── text_to_speech.py (coming next)
    ├── utils/
    │   └── logger.py (Future Vision)
    ├── data/
    │   └── memory.json (Future Vision)
    ├── models/
    │   └── custom_wakeword.ppn
    

    Wake word Detection (Working!)

    • Using porcupine by Picovoice, I was able to get a simple wake word detection script running
    • Uses a custom wake word (right now it is Hey Jinx) for activation which is the model placed in models/
    • The Function Currently:
      • Initializes Porcupine
      • Waits and listens until wake word is detected
      • prints to console when wake word is detected

    Next Steps:

    • Start working on Speech-to-Text using OpenAi's Whisper Tiny (very fast)
    • Start wiring command handler to interpret user commands
    • Add TTS to reply verbally

    Future Visions:

    • Incorporate Local LLM to interpret Complex Dialogue
    • Speed Optimizations (Parallel actions: Microphone Listening, STT, TTS)
    • Smarter Command Parsing (Not having to rely on LLM every time)
    • Adding a Memory so that the LLM remembers past interactions

  • Log 2: AI and LLMS

    Rob04/23/2025 at 18:48 0 comments

    I decided that I wanted to host the AI on my laptop as well, and so I decided to dive into LLMs. I learned about Ollama and current LLMs that were light weight but powerful. I also decided to ditch my basic command handler and turn the LLM itself into a command handler that with human input, outputs a file that can be parsed for response text and functions to run.

    New Features

    • Local LLM with command Handling - Ollama with Mistral Model

    Current Problems:

    • It seems as if the time it takes for the LLM to process the input is very long, and honestly all of it is pretty slow.
    • Wake Word and STT is still a little inaccurate

    Whats Next:

    • Switch OS to a lighter one - Linux Mint
    • Switch STT with Whisper

  • Log 1:

    Rob04/23/2025 at 18:40 0 comments

    This Project Started out with me deciding to just jump into the coding section. I started with a basic hierarchy in Python with functions for: wake word recognition, TTS, STT, and command handling.

    Current Features

    • Clean, extensible code split into separate modules for each component4
    • Python-based command handler
    • Voice synthesis using ElevenLabs
    • Offline speech-to-text with Vosk
    • Wake word detection with Porcupine

    Whats Next:

    • Integrating the AI personality into the program
    • Differentiate between function and conversation

    Current Problems

    • Very Slow as its on an old laptop I believe
    • Sometimes cant hear wake word
    • Sometimes STT is innacurrate

View all 9 project logs

Enjoy this project?

Share

Discussions

Similar Projects

Does this project spark your interest?

Become a member to follow this project and never miss any updates