Architecture of v2

2026-04-04

While trying to think through the behavior I really want, I realized my naïve approach was incomplete. Robie will actually need to listen while it is reading. Because we’ll want to interrupt it:

to change the story
to switch to another action
to adjust the volume, since right now there is nothing in place to control Robie’s volume

To kick off the thinking process, nothing beats a small diagram.

Flow

flowchart TB Start([Start]) --> Idle[Idle: waiting for wake word] Idle --> Wake[/Wake word said/] Wake --> Listen[/Record or listen live/] Listen --> DetectIntent[Process intent] DetectIntent --> ConfirmIntent[/Confirm intent/] ConfirmIntent --> IsConfirmIntent{Intent confirmed?} IsConfirmIntent -- No --> Listen IsConfirmIntent -- Yes --> IsIntent{Which intent?} IsIntent -- Read a story --> ReadingInit[Enter reading mode] IsIntent -- Take a note --> NotePrompt[/Please record the note/] IsIntent -- Other request --> HandleOther[Handle other intent] NotePrompt --> NoteListen[/Record note/] NoteListen --> NoteSave[Save note] NoteSave --> Idle HandleOther --> Idle subgraph ReadingMode [Reading mode] ReadingInit --> StartPlayback[Start audio playback] StartPlayback --> ReadingLoop[Reading active] ReadingLoop --> CheckCommand{Command detected?} ReadingLoop --> CheckTime{Is it midnight?} ReadingLoop --> EndOfStory{Story finished?} CheckCommand -- No --> ReadingLoop CheckCommand -- Stop --> StopPlayback[Stop playback] CheckCommand -- Volume up --> VolumeUp[Increase volume] CheckCommand -- Volume down --> VolumeDown[Decrease volume] VolumeUp --> ReadingLoop VolumeDown --> ReadingLoop CheckTime -- No --> ReadingLoop CheckTime -- Yes --> Shutdown[Shutdown device] EndOfStory -- No --> ReadingLoop EndOfStory -- Yes --> ExitReading[Exit reading mode] end StopPlayback --> Idle ExitReading --> Idle

Consequences

The central point is that Reading is not a one-shot action.
It is a long-running active mode, during which several things must exist at the same time:

continuous audio playback
listening for control commands
monitoring the time
the ability to interrupt playback cleanly

In other words, my system can no longer be designed as a simple linear chain such as:

wake → listen → STT → action → end

It must become a system with persistent activity + concurrent events.

First Constraint: Concurrency

Since I do not want to split playback into tiny chunks, the reading must continue while something else is happening.

That implies some form of concurrency, typically:

multithreading
separate processes
or a more advanced event loop

In all cases, we move beyond the logic of “one loop doing everything in order”.

Concretely, I will probably need at least:

one component managing playback
one component listening to the microphone
one component processing commands
one component monitoring the clock
one orchestrator deciding what to do

Second Constraint: Clean Inter-Component Communication

As soon as multiple activities run in parallel, I need to define how they communicate.

For example:

the microphone module detects “stop”
it must notify the playback module
the clock module detects midnight
it must trigger a global shutdown
the playback module reaches the end of the file
it must notify the system to return to Idle

So I can no longer rely on simple functions calling one another directly. I need logic such as:

events
message queues
state flags
synchronization objects

Otherwise, I’ll quickly end up with spaghetti code.

Third Constraint: A Real State Model

My diagram implicitly says that we are no longer only in “do an action”, but in “be in a state”.

For example:

Idle
Listening
Reading
Note recording
maybe later Thinking
maybe Shutting down

And while in Reading, some commands are allowed:

stop
volume up
volume down

while others may not be allowed, or not handled the same way.

So I need to explicitly model:

the current state
allowed transitions
what happens when an event arrives in a given state

Otherwise I’ll get fuzzy behaviors like:

“What does Robie do if someone talks while it is reading?” “What happens if midnight occurs during a volume command?”

Fourth Constraint: Clean Interruption

Continuous audio playback means I must be able to:

stop immediately
possibly pause
change volume on the fly
exit without leaving the audio system in a broken state

So the audio player cannot be a simple blocking command launched without control. It must be a controllable component, with commands such as:

start
stop
pause
set_volume

And those commands must remain safe no matter when they arrive.

Fifth Constraint: Speech Recognition Can No Longer Be Designed the Same Way

In a classic conversational loop, we do:

record
transcribe
act

But here, during reading, I need to detect very short commands continuously.

So I am no longer doing only “classic” STT. Instead, I need continuous control listening, probably with:

reduced vocabulary
limited command logic
fast and robust detection

So the problem is no longer:

“transcribe an open request”

but rather:

“quickly and reliably detect a few critical commands”

That is a different kind of need.

Sixth Constraint: Risk of Robie Hearing Itself

This is probably one of the hidden big challenges of reading mode.

If Robie reads aloud while listening, the microphone may capture:

its own playback
reverberation
children’s voices
ambient noise

So I’ll need safeguards such as:

short and highly specific commands
adapted thresholds / detection logic
maybe a secondary wake word in reading mode
or microphone / volume / physical placement adjustments

The diagram does not mention it, but formalizing Reading as an interactive mode directly creates this problem.

Seventh Constraint: Priority Logic

Not all events have the same weight.

For example:

Shutdown at midnight is probably highest priority
Stop playback is very high priority
Volume up is less critical
Story finished is a normal event

So I will need to define a policy:

what interrupts what
who wins in case of collision
in what order events are processed

Without that, strange behaviors are likely.

Eighth Constraint: Separate Behavior from Implementation

The diagram is excellent because it formalizes the expected behavior. But it also forces an important distinction:

functional level: what Robie must do
technical level: how it is implemented

In this case, the formalization already says I will probably need:

a controllable audio player
parallel listening
autonomous time monitoring
an event system
explicit state management

Even if I have not yet chosen between:

thread
callback
queue
process

Summary

This formalization leads to one clear conclusion:

Robie can no longer be developed as a simple sequential pipeline. Reading mode requires a concurrent, event-driven architecture with explicit state management.

More concretely, that means:

several activities must run at the same time
they must communicate cleanly
the system must know its current state
playback must be interruptible at any moment
voice detection during playback becomes a specific problem
priorities and transitions must be handled properly