Project Kickoff

2026-03-28

Goal

Build a system able to:

listen continuously
detect a wake word
record a voice command
understand the intent
trigger an action
respond with sound or voice

All of it locally, with no cloud dependency.

Overall Architecture

The system is split into several blocks:

Microphone
→ Wake word
→ Recording
→ Speech-to-Text (Whisper)
→ Interpretation (LLM)
→ Action
→ Response (sounds / TTS / LEDs)

Each block will be explored and validated separately.

1. Audio and Hardware

The project relies on:

Raspberry Pi (Debian Bookworm)
Adafruit Voice Bonnet (microphones + LEDs + speakers)
audio output tested with pink noise

Important points:

correctly identify the right audio device
properly handle simultaneous input/output
implement clean LED management (cleanup)

2. Wake Word: the First Challenge

❌ Picovoice (Porcupine)

Initially considered, but dropped:

now requires a pro account
external dependency
less suitable for a long-term personal project

✅ openWakeWord

Chosen solution:

open source
runs locally
based on TFLite models

Issues encountered:

missing models → download_models()
NumPy / SciPy conflicts → downgrade and version alignment
false positives → filtering required

Solutions implemented:

high threshold (~0.95)
several consecutive frames
refractory period (10s)
stop audio stream during actions

👉 Key takeaway: a wake word is not reliable “raw” — it needs control logic

3. Classic Problem: the Robot Triggers Itself

Robie was detecting its own audio output (feedback loop).

Solution:

pause listening during:
- recording
- sound response
add a grace delay

4. STT ≠ Understanding

Testing faster-whisper

5. Introducing a Local LLM

Test with Ollama + Qwen2.5 (1.5B)

Result:

~2 second latency
stable behavior
viable for embedded usage

👉 Conclusion: a small local LLM on Raspberry Pi is usable

The Key Metric: Latency

What matters is not total speed, but the time before the response starts.

< 3 seconds : good
3–6 : acceptable
> 6 : frustrating

UX trick:

yellow LED = “thinking”
intermediate sound cue

👉 turns lag into natural behavior

6. Role of the LLM in Robie

The LLM should not be used for open-ended chatting.

It will be used to:

transform a sentence into an intent
structure the command

Example:

Input:

“Robie, note that Thomas needs to bring his coat tomorrow”

Output:

{
  "intent": "take_note",
  "content": "Thomas needs to bring his coat tomorrow",
  "answer": "Noted."
}

The Python code then executes the action.

7. French Language Support

Important constraint: French-speaking children.

Solutions:

multilingual Whisper (not .en)
language forced to fr
multilingual LLM (Qwen works well)
prompts in French
intents in French

⚠️ Note: children’s voices are harder to recognize → tolerance will be needed.

8. What About My Coral TPU?

Not usable for LLMs.

Why:

Coral = quantized TFLite models
LLM = incompatible architecture

Relevant future uses:

vision (camera)
object detection
environmental perception

Conclusion

This first session shows that building a local embedded assistant is probably achievable. Next step: finish testing each block, then start designing the overall architecture for version 2.

V1

Component testing and performance stability

V2

Pipeline construction and first real-world tests