Close

tl;dr - self-hosted AI Yoda Speaks, Listens

A project log for Hacking Seasonal Yoda

From the clearance bin to our hearts! Modding Yoda to be a companion bot

savant42savant42 10/03/2025 at 19:020 Comments

My Mom said to go make a friend. So I made a friend.

LOTS of updates, I'll just get to the link first: https://github.com/thesavant42/chainloot-Yoda-Bot-Interface

That is the chat Interface for Yoda's AI stack. 

He can now hold conversations on his own, using speech recognition, text to speech, and voice cloning.


And did I say, locally, self-hosted? All of the parts are OpenAI API compatible, but open source / open models, running on my desktop PC with the graphics card it came wtih. (Ok, so the graphics card isn't terrible yet. Let's do the rundown.


 Hardware 


- RTX 4070, 12 Gigs of VRAM), 

- 64 Gigs System Ram

- i7 somethinorother 

Software


- Docker, via WSL 2 on Windows 11 for hosting the containers
- TTS-WebUI with Chatterbox TTS for Text to Speech and Whisper STT services

- LM Studio for GGUF Model hosting, will likely swap out for llama.cpp once I finalize on the base model, so I can tune for performance. I have an ollama docker container I will test, as well.

Model(s)

- Using Qwen3 as a baseline, as it offers a decent mix of tool usage capabilities, reasonably decent at instruction following, and the Q4 quants are pretty ok for under 10 gigabytes. 
- phi4 performs very well at chatting and role playing but is less reliable at tool usage in my testing, but I concede I have not tuned it whatsoever.
- Smollm3 from Huggingface is shockingly capable for how smol it is. The unsloth quantization clocks in at ~ 1.8 gigabytes for 128k context. Not a typo, 128k context, under 2 gigs. WAT. If I can get this to use tools reliably, this is the one.

- Voice Cloning utilizes the chatterbox model + some optimizations, but best of all is how it utilizes compiled flash_attn for speed boosts.

Lesson learned on tts latency: Don't stream the text, stream the audio. Even if the model supports streaming characters in "real time", don't make the mistake of thinking that will help text to speech times. The TTS engine needs complete sentences to avoid audio artifacts, or even worse, waiting forever for the stream to end.

https://github.com/rsxdalv/TTS-WebUI The real hero of the show

https://huggingface.co/unsloth/SmolLM3-3B-128K-GGUF  How'd they do that?

Discussions