Files

Sanju Sivalingam eae221b904 docs: add voice overlay design document

Design for voice-activated overlay feature — tap floating pill to activate
voice mode, stream audio to server for Groq Whisper STT, show live
transcription on screen with glowing gradient border, send as goal.

2026-02-20 01:46:49 +05:30

6.1 KiB

Raw Blame History

Voice Overlay — Design Document

Date: 2026-02-20 Status: Approved Approach: Stream audio over existing WebSocket (Approach A)

Overview

Add a voice-activated overlay to DroidClaw's Android app. User taps the floating pill → full-screen glowing gradient border appears → speech is streamed to the server for real-time transcription → live text appears on screen → tap Send to execute as a goal.

User Flow

[IDLE] → tap pill → [LISTENING] → tap send → [EXECUTING] → done → [IDLE]
                          ↓
                     tap cancel
                          ↓
                       [IDLE]

States

IDLE — Existing floating pill: ● Ready, draggable, tappable.

LISTENING — Pill disappears. Full-screen overlay:

Animated gradient border around all 4 screen edges (purple → blue → cyan → green cycle, ~3s)
Large transcribed text in center, updating live word-by-word
Bottom: Send (primary) + Cancel (secondary) buttons
Audio recording starts immediately on transition

EXECUTING — Overlay collapses back to pill. Pill shows agent progress as today.

IDLE (post-completion) — Pill shows ● Done for 3s, then ● Ready.

Audio Streaming Protocol

Android → Server

Message	Description
`{type: "voice_start"}`	Recording begun
`{type: "voice_chunk", data: "<base64>"}`	~100ms PCM chunks, 16kHz mono 16-bit
`{type: "voice_stop", action: "send"}`	User tapped Send — finalize & execute goal
`{type: "voice_stop", action: "cancel"}`	User tapped Cancel — discard

Server → Android

Message	Description
`{type: "transcript_partial", text: "..."}`	Live streaming partial transcript
`{type: "transcript_final", text: "..."}`	Final complete transcript

Flow

Android sends voice_start → server opens streaming connection to Groq Whisper
Android streams voice_chunk every ~100ms → server pipes PCM to Groq
Groq sends partial transcriptions → server relays as transcript_partial
User taps Send → Android sends voice_stop with action: "send"
Server flushes final audio → gets transcript_final → sends to Android → fires goal into agent loop
Cancel: voice_stop with action: "cancel" → server discards Groq session, no goal

Audio Format

Sample rate: 16kHz
Channels: mono
Bit depth: 16-bit PCM (linear16)
Bandwidth: ~32KB/sec
Encoding for WebSocket: base64 text frames

Full-Screen Gradient Overlay

Two separate overlay layers managed by AgentOverlay:

Layer 1 — Gradient Border (non-interactive)

TYPE_APPLICATION_OVERLAY with FLAG_NOT_TOUCHABLE | FLAG_NOT_FOCUSABLE
MATCH_PARENT — covers entire screen
Compose renders animated gradient strips (~6dp) along all 4 edges
Colors: purple → blue → cyan → green → purple, infinite rotation ~3s cycle
Implementation: drawBehind modifier with 4 LinearGradient brushes, animated offset via rememberInfiniteTransition
Center is fully transparent — pass-through to apps behind

Layer 2 — Text + Buttons (interactive)

TYPE_APPLICATION_OVERLAY with FLAG_NOT_FOCUSABLE (tappable, no keyboard steal)
Positioned at bottom ~40% of screen
Semi-transparent dark background Color(0xCC000000)
Contents:
- Transcribed text: 24-28sp, white, center-aligned, auto-scrolls
- Subtle pulse/waveform animation while listening
- Bottom row: Send button (accent) + Cancel button (muted)

Why Two Layers

Android overlays cannot be partially touchable. The gradient border must be FLAG_NOT_TOUCHABLE (pass-through) while the text/button area must be tappable. Separate WindowManager views with different flags solve this.

Server-Side STT Handler

New file: src/voice.ts

Responsibilities

On voice_start: open Groq Whisper streaming connection
On voice_chunk: pipe decoded PCM to Groq stream
On voice_stop (send): flush stream, get final transcript, trigger runAgent() with transcript as goal
On voice_stop (cancel): close Groq stream, discard

Fallback

If Groq streaming is unavailable, buffer all chunks server-side. On voice_stop, send complete audio as single Whisper API call. No live words — final text appears all at once. Always works.

Goal Execution

After transcript_final, call existing runAgent() from kernel.ts — identical to web dashboard goals. No changes to agent loop.

Files Changed

File	Change	Scope
`android/.../AndroidManifest.xml`	Add `RECORD_AUDIO` permission	Minor
`android/.../overlay/AgentOverlay.kt`	State machine: idle/listening/executing, manage 2 overlay layers	Major
`android/.../overlay/OverlayContent.kt`	New composables: `GradientBorder`, `VoiceOverlayContent`, `LiveTranscriptText`	Major
`android/.../overlay/VoiceRecorder.kt`	New file. `AudioRecord` capture + chunked base64 streaming	New
`android/.../connection/ConnectionService.kt`	Handle voice messages, route transcript events to overlay	Medium
`android/.../model/Protocol.kt`	New message data classes for voice protocol	Minor
`src/voice.ts`	New file. Groq Whisper streaming STT handler	New
`src/kernel.ts`	Route voice WebSocket messages to `voice.ts`	Minor

Untouched

actions.ts, skills.ts, workflow.ts, sanitizer.ts, llm-providers.ts, config.ts, constants.ts

Permissions

RECORD_AUDIO — new runtime permission, requested on first voice activation
SYSTEM_ALERT_WINDOW — already granted (existing overlay)
INTERNET — already granted

Difficulty Assessment

Overall: Medium. Estimated 3-4 days.

Android AudioRecord → WebSocket streaming: well-documented, straightforward
Full-screen gradient overlay animation: standard Compose Canvas + rememberInfiniteTransition
Groq Whisper streaming API: documented, Bun handles WebSocket/HTTP streaming natively
Two-layer overlay management: minor complexity in AgentOverlay state machine
No risky unknowns — all components have clear precedents

6.1 KiB Raw Blame History