2026-05-17

Robie V1.2 — The Moment the Project Starts Feeling “Alive”
Over the past few weeks, I’ve been working on a lot of very different things:
- documentation pipelines,
- automation,
- AI applied to documentation,
- workflows,
- integrations…
But in the background, another project kept slowly evolving: Robie.
And this time, something important changed. Robie is finally starting to become a real coherent system.
A system with a clear architecture, separated responsibilities, real components, a voice, and above all: a sense of existence.
The Classic Trap of Ambitious Personal Projects
When building a local voice assistant on Raspberry Pi with:
- STT,
- fuzzy matching,
- LLMs,
- conversational orchestration,
- real-time audio,
- playback memory,
- continuous listening,
- intelligent playback…
… the temptation to connect everything immediately is enormous.
And very often, the result is a huge amount of complexity with nothing actually functional.
This time, I tried a different approach:
small independent building blocks
+
clean architecture
+
fast visible wins
And honestly? It completely changes the energy of the project.
The Architecture Is Starting to Stabilize
V1.2 now relies on a fairly clear separation of responsibilities.
The general idea:
Microphone
↓
Vosk (STT)
↓
Intent Router
├─ simple intents → direct execution
└─ complex intents → Qwen
↓
Structured Intent
↓
Voice confirmation
↓
mpv
using:
- Vosk for local transcription,
- RapidFuzz to recover books despite voice recognition errors,
- Qwen for semantic understanding,
- LangGraph for conversational state management,
- mpv for actual audiobook playback.
The major architectural principle that got confirmed:
The LLM must NOT be the primary search engine.
So:
transcription
→ local matching
→ shortlist
→ LLM understanding
and definitely not:
full library
→ LLM
It sounds almost obvious when written like that. But a huge number of AI tutorials ignore this kind of fundamental separation, waste tokens, or simply aren’t adaptable to lightweight hardware like a Pi.
First Real Functional Pipeline
The first truly validated pipeline is tiny… but extremely satisfying:
Structured Intent
↓
library lookup
↓
natural sentence generation
↓
TTS
↓
real audio
Robie now genuinely speaks.
For example:
Intent(
intent="play_audiobook",
book_id="rdf_04",
start_mode=StartMode.BEGINNING,
)
↓
Do you want to listen to Wings of Fire, book 4, The Hidden Kingdom, from the beginning?
And what’s funny is that this extremely simple behavior already completely changes the perception of the project.
Pydantic Everywhere: Mental Peace
Another important decision was to define internal structures very early:
BookLibraryIntentConfirmationResponseStartMode
with Pydantic validation.
Sometimes this can feel “over-engineered” for a personal project.
But in practice, it very quickly provides:
- a real source of truth,
- stable contracts,
- a massive reduction in chaos,
- and above all, the ability to evolve the system without breaking everything.
The Detail I Really Like: Non-Binary Confirmations
One thing I particularly like in the current architecture:
confirmations are not designed as simple “yes/no” interactions.
Example:
Robie:
“Do you want to listen to The Hidden Kingdom from the beginning?”
Child:
“No, resume where it stopped.”
Here:
- the book was correct,
- only the playback mode was wrong.
So the system keeps a modifiable pending_intent.
It’s a small architectural detail… but it brings the behavior much closer to a genuinely natural interaction.
RapidFuzz Comes Next
The next major step will be local fuzzy matching.
Goal:
"play wings of fire book saint"
↓
rdf_05
despite imperfect Vosk transcription.
And once again, the approach is intentionally pragmatic:
light normalization only:
- lowercase,
- accent removal,
- punctuation removal,
but NOT:
four → 4
in order to preserve realistic phonetic STT errors.
What This Project Is Really Teaching Me
I think the most interesting thing here isn’t even technical.
The real recent lesson is probably this:
small visible wins completely change the mental dynamics of a long project
For a long time, Robie was mostly:
- a vision,
- a theoretical architecture,
- notes,
- ideas.
Today:
- it has a real library,
- a stable structure,
- a first voice,
- a coherent pipeline,
- conversational states starting to emerge,
- and an identifiable technical personality.
The project is genuinely starting to “exist”.