Files
droidclaw/docs/plans/2026-02-20-voice-overlay-design.md
Sanju Sivalingam eae221b904 docs: add voice overlay design document
Design for voice-activated overlay feature — tap floating pill to activate
voice mode, stream audio to server for Groq Whisper STT, show live
transcription on screen with glowing gradient border, send as goal.
2026-02-20 01:46:49 +05:30

6.1 KiB

Voice Overlay — Design Document

Date: 2026-02-20 Status: Approved Approach: Stream audio over existing WebSocket (Approach A)

Overview

Add a voice-activated overlay to DroidClaw's Android app. User taps the floating pill → full-screen glowing gradient border appears → speech is streamed to the server for real-time transcription → live text appears on screen → tap Send to execute as a goal.

User Flow

[IDLE] → tap pill → [LISTENING] → tap send → [EXECUTING] → done → [IDLE]
                          ↓
                     tap cancel
                          ↓
                       [IDLE]

States

IDLE — Existing floating pill: ● Ready, draggable, tappable.

LISTENING — Pill disappears. Full-screen overlay:

  • Animated gradient border around all 4 screen edges (purple → blue → cyan → green cycle, ~3s)
  • Large transcribed text in center, updating live word-by-word
  • Bottom: Send (primary) + Cancel (secondary) buttons
  • Audio recording starts immediately on transition

EXECUTING — Overlay collapses back to pill. Pill shows agent progress as today.

IDLE (post-completion) — Pill shows ● Done for 3s, then ● Ready.

Audio Streaming Protocol

Android → Server

Message Description
{type: "voice_start"} Recording begun
{type: "voice_chunk", data: "<base64>"} ~100ms PCM chunks, 16kHz mono 16-bit
{type: "voice_stop", action: "send"} User tapped Send — finalize & execute goal
{type: "voice_stop", action: "cancel"} User tapped Cancel — discard

Server → Android

Message Description
{type: "transcript_partial", text: "..."} Live streaming partial transcript
{type: "transcript_final", text: "..."} Final complete transcript

Flow

  1. Android sends voice_start → server opens streaming connection to Groq Whisper
  2. Android streams voice_chunk every ~100ms → server pipes PCM to Groq
  3. Groq sends partial transcriptions → server relays as transcript_partial
  4. User taps Send → Android sends voice_stop with action: "send"
  5. Server flushes final audio → gets transcript_final → sends to Android → fires goal into agent loop
  6. Cancel: voice_stop with action: "cancel" → server discards Groq session, no goal

Audio Format

  • Sample rate: 16kHz
  • Channels: mono
  • Bit depth: 16-bit PCM (linear16)
  • Bandwidth: ~32KB/sec
  • Encoding for WebSocket: base64 text frames

Full-Screen Gradient Overlay

Two separate overlay layers managed by AgentOverlay:

Layer 1 — Gradient Border (non-interactive)

  • TYPE_APPLICATION_OVERLAY with FLAG_NOT_TOUCHABLE | FLAG_NOT_FOCUSABLE
  • MATCH_PARENT — covers entire screen
  • Compose renders animated gradient strips (~6dp) along all 4 edges
  • Colors: purple → blue → cyan → green → purple, infinite rotation ~3s cycle
  • Implementation: drawBehind modifier with 4 LinearGradient brushes, animated offset via rememberInfiniteTransition
  • Center is fully transparent — pass-through to apps behind

Layer 2 — Text + Buttons (interactive)

  • TYPE_APPLICATION_OVERLAY with FLAG_NOT_FOCUSABLE (tappable, no keyboard steal)
  • Positioned at bottom ~40% of screen
  • Semi-transparent dark background Color(0xCC000000)
  • Contents:
    • Transcribed text: 24-28sp, white, center-aligned, auto-scrolls
    • Subtle pulse/waveform animation while listening
    • Bottom row: Send button (accent) + Cancel button (muted)

Why Two Layers

Android overlays cannot be partially touchable. The gradient border must be FLAG_NOT_TOUCHABLE (pass-through) while the text/button area must be tappable. Separate WindowManager views with different flags solve this.

Server-Side STT Handler

New file: src/voice.ts

Responsibilities

  • On voice_start: open Groq Whisper streaming connection
  • On voice_chunk: pipe decoded PCM to Groq stream
  • On voice_stop (send): flush stream, get final transcript, trigger runAgent() with transcript as goal
  • On voice_stop (cancel): close Groq stream, discard

Fallback

If Groq streaming is unavailable, buffer all chunks server-side. On voice_stop, send complete audio as single Whisper API call. No live words — final text appears all at once. Always works.

Goal Execution

After transcript_final, call existing runAgent() from kernel.ts — identical to web dashboard goals. No changes to agent loop.

Files Changed

File Change Scope
android/.../AndroidManifest.xml Add RECORD_AUDIO permission Minor
android/.../overlay/AgentOverlay.kt State machine: idle/listening/executing, manage 2 overlay layers Major
android/.../overlay/OverlayContent.kt New composables: GradientBorder, VoiceOverlayContent, LiveTranscriptText Major
android/.../overlay/VoiceRecorder.kt New file. AudioRecord capture + chunked base64 streaming New
android/.../connection/ConnectionService.kt Handle voice messages, route transcript events to overlay Medium
android/.../model/Protocol.kt New message data classes for voice protocol Minor
src/voice.ts New file. Groq Whisper streaming STT handler New
src/kernel.ts Route voice WebSocket messages to voice.ts Minor

Untouched

actions.ts, skills.ts, workflow.ts, sanitizer.ts, llm-providers.ts, config.ts, constants.ts

Permissions

  • RECORD_AUDIO — new runtime permission, requested on first voice activation
  • SYSTEM_ALERT_WINDOW — already granted (existing overlay)
  • INTERNET — already granted

Difficulty Assessment

Overall: Medium. Estimated 3-4 days.

  • Android AudioRecord → WebSocket streaming: well-documented, straightforward
  • Full-screen gradient overlay animation: standard Compose Canvas + rememberInfiniteTransition
  • Groq Whisper streaming API: documented, Bun handles WebSocket/HTTP streaming natively
  • Two-layer overlay management: minor complexity in AgentOverlay state machine
  • No risky unknowns — all components have clear precedents