2026-04-03

Rebuilding a Clean Virtual Environment
And… crash.
Everything broke, especially the Adafruit Voice Bonnet handling.
I had to start over to make audio input and output work again.
Step 1 — The Real Wall: Low-Level Audio
Before even talking about AI, the first challenge was… the microphone.
Problems encountered:
RPi.GPIOerrors → conflict between Python environment and system librariessounddeviceunable to open the audio stream- PulseAudio / PipeWire locking the device
- ALSA detects the card… but rejects every format
Typical symptoms:
PortAudioError: Invalid number of channelsdevice or resource busyUnable to install hw params
Important lessons:
- On Raspberry Pi, avoid high-level audio layers
- Go directly through ALSA (
arecord) - Disable PipeWire/PulseAudio if needed
- Check codec configuration with
alsamixer
Once this step is solved, everything becomes much easier.
Step 2 — Working Audio Pipeline
After stabilization, we finally get:
microphone → ALSA → recording → processing → playback
And on the UX side:
- LED off → standby
- red LED → listening
- yellow LED → processing
- sound → response
At this stage, the robot already feels “alive”.
Step 3 — Whisper Attempt (and Failure)
The next logical step was transcription with faster-whisper.
Result:
- huge latency (several seconds, sometimes tens of seconds)
- poor quality with the
tinymodel - impossible to improve quality without exploding compute time
Why it fails:
- Raspberry Pi 4 is too limited for modern STT
- Whisper is optimized for GPUs or powerful CPUs
- impossible to maintain a good quality/speed tradeoff
Conclusion: Whisper is excellent… but not for this use case on Pi.
Step 4 — Pivot to Vosk
Strategy shift: test Vosk.
Immediate result:
- much better latency
- almost correct transcription
- stable pipeline
Big improvement.
But…
New problem:
- ~10 seconds to process 4 seconds of audio
- still too slow for natural interaction
Key Insight: Wrong Problem
The issue was not the engine.
The issue was the task.
We were asking:
“Freely transcribe everything I say”
When the real need was:
“Recognize a few simple commands”
Step 5 — Paradigm Shift
Instead of voice dictation, move to voice command recognition.
Example:
```python id=”a1r7kp” if “hello” in text: play(“hello.mp3”)
Or even better: restrict the vocabulary directly in Vosk:
```python id="w2m5dz"
rec = KaldiRecognizer(
model,
16000,
'["hello", "story", "music", "stop"]'
)
Result:
- faster
- more reliable
- much more robust
Final Architecture (V1)
Wake word
↓
Red LED (listening)
↓
Short recording (2–3s)
↓
Vosk (limited vocabulary)
↓
Simple intent
↓
Audio response
↓
Back to standby
What Really Made the Difference
What does not work well
- Whisper on Raspberry Pi
- abstracted audio layers (
sounddevice, PulseAudio) - free transcription on a weak CPU
What works
- direct ALSA (
arecord) - simple and deterministic pipeline
- Vosk with restricted vocabulary
- intent logic rather than full NLP
Result
We move from:
a slow and frustrating prototype
to:
a fast, responsive voice assistant usable by children
What Comes Next?
Once this solid base is ready:
- add end-of-speech detection (VAD)
- improve responses (TTS or sounds)
- add simple memory
- possibly connect an LLM (later)
Conclusion
The key issue in this project was not an AI problem.
It was an architecture choice problem.
On limited hardware:
- you must simplify the problem
- not just optimize the solution