Go to file

Sanju Sivalingam 389ac81c98 Add Maestro-style YAML flow runner for deterministic automation

New --flow mode executes scripted YAML steps without LLM, mapping 17
commands (tap, type, swipe, scroll, etc.) to existing actions. Element
finding uses accessibility tree text/hint/id matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-07 19:34:18 +05:30

docs

Rename project to DroidClaw

2026-02-06 17:51:08 +05:30

examples

Add Maestro-style YAML flow runner for deterministic automation

2026-02-07 19:34:18 +05:30

logs

time to rethink architecture of this...

2026-02-07 18:04:46 +05:30

src

Add Maestro-style YAML flow runner for deterministic automation

2026-02-07 19:34:18 +05:30

.env.example

Rename project to DroidClaw

2026-02-06 17:51:08 +05:30

.gitignore

time to rethink architecture of this...

2026-02-07 18:04:46 +05:30

CLAUDE.md

Rename project to DroidClaw

2026-02-06 17:51:08 +05:30

kernel_screenshot.png

time to rethink architecture of this...

2026-02-07 18:04:46 +05:30

package.json

Add Maestro-style YAML flow runner for deterministic automation

2026-02-07 19:34:18 +05:30

README.md

Rewrite README: concise, scannable, conversational tone

2026-02-07 19:13:49 +05:30

tsconfig.json

Flatten project structure: move android-action-kernel/ to root

2026-02-06 16:02:40 +05:30

window_dump.xml

time to rethink architecture of this...

2026-02-07 18:04:46 +05:30

README.md

DroidClaw

Give it a goal in plain English. It figures out what to tap, type, and swipe on your Android phone to get it done.

It reads the screen (accessibility tree + optional screenshot), sends it to an LLM, gets back a JSON action like {"action": "tap", "coordinates": [540, 1200]}, executes it via ADB, and repeats. Perception → reasoning → action, in a loop.

See it work

$ bun run src/kernel.ts
Enter your goal: Open YouTube and search for "lofi hip hop"

--- Step 1/30 ---
Think: I'm on the home screen. I should launch YouTube directly.
Decision: launch — Open YouTube app (842ms)

--- Step 2/30 ---
Think: YouTube is open. I need to tap the search icon.
Decision: tap — Tap search icon at top right (623ms)

--- Step 3/30 ---
Think: Search field is focused and ready.
Decision: type — Type "lofi hip hop" (501ms)

--- Step 4/30 ---
Decision: enter — Submit the search (389ms)

--- Step 5/30 ---
Think: Search results showing lofi hip hop videos. Done.
Decision: done (412ms)

Task completed successfully.

Quick start

You need: Bun, ADB, and an API key for any LLM provider.

# Install Bun
curl -fsSL https://bun.sh/install | bash

# Install ADB (macOS)
brew install android-platform-tools

# Clone and setup
bun install
cp .env.example .env

Edit .env — fastest way to start is with Groq (free tier):

LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key_here

Get your key at console.groq.com.

Connect your phone

Enable USB Debugging: Settings → About Phone → tap "Build Number" 7 times → Developer Options → USB Debugging.

adb devices   # should show your device

Run it

bun run src/kernel.ts

Type a goal and watch your phone do it.

Workflows

Workflows chain multiple goals across apps. Way more powerful than single goals.

bun run src/kernel.ts --workflow examples/weather-to-whatsapp.json

34 ready-to-use workflows included

Messaging — whatsapp-reply, whatsapp-broadcast, whatsapp-to-email, telegram-channel-digest, telegram-send-message, slack-standup, slack-check-messages, email-digest, email-reply, translate-and-reply

Social Media — social-media-post (Twitter + LinkedIn), social-media-engage, instagram-post-check

Productivity — morning-briefing, calendar-create-event, notes-capture, notification-cleanup, do-not-disturb, github-check-prs, screenshot-share-slack

Research — google-search-report, news-roundup, multi-app-research, price-comparison

Lifestyle — food-order, uber-ride, maps-commute, check-flight-status, spotify-playlist, youtube-watch-later, fitness-log, expense-tracker, wifi-password-share, weather-to-whatsapp

Each workflow is a simple JSON file:

{
  "name": "Slack Daily Standup",
  "steps": [
    {
      "app": "com.Slack",
      "goal": "Open #standup channel, type the standup message and send it.",
      "formData": {
        "Message": "Yesterday: Finished API integration\nToday: Writing tests\nBlockers: None"
      }
    }
  ]
}

What it can do

22 actions + 6 multi-step skills. Some example goals:

Open WhatsApp and send "I'm running late" to Mom
Turn on WiFi
Search Google for "best restaurants near me"
Open YouTube and play the first trending video
Copy tracking number from Amazon and search it on Google

LLM providers

Pick one. They all work.

Provider	Cost	Vision	Best for
Groq	Free tier	No	Getting started fast
OpenRouter	Pay per token	Yes	200+ models (Claude, Gemini, etc.)
OpenAI	Pay per token	Yes	Best accuracy with GPT-4o
AWS Bedrock	Pay per token	Yes	Enterprise / Claude on AWS

# Groq (recommended to start)
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key_here
GROQ_MODEL=llama-3.3-70b-versatile

# OpenRouter
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-v1-your_key_here
OPENROUTER_MODEL=anthropic/claude-3.5-sonnet

# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your_key_here
OPENAI_MODEL=gpt-4o

# AWS Bedrock (uses aws configure credentials)
LLM_PROVIDER=bedrock
AWS_REGION=us-east-1
BEDROCK_MODEL=anthropic.claude-3-sonnet-20240229-v1:0

Config

All in .env. Here's what matters:

Setting	Default	What it does
`MAX_STEPS`	30	Steps before giving up
`STEP_DELAY`	2	Seconds between actions (UI settle time)
`STUCK_THRESHOLD`	3	Steps before stuck-loop recovery kicks in
`VISION_MODE`	fallback	`off` / `fallback` (screenshot when accessibility tree is empty) / `always`
`MAX_ELEMENTS`	40	UI elements sent to LLM (scored & ranked)
`MAX_HISTORY_STEPS`	10	Past steps kept in conversation context
`STREAMING_ENABLED`	true	Stream LLM responses token-by-token
`LOG_DIR`	logs	Session logs directory

How it works

Each step: dump accessibility tree → score & filter elements → optionally screenshot → send to LLM → execute action → log → repeat.

The LLM thinks before acting:

{
  "think": "Search field is focused. I should type the query.",
  "plan": ["Launch YouTube", "Tap search", "Type query", "Submit"],
  "planProgress": "Step 3: typing query",
  "action": "type",
  "text": "lofi hip hop"
}

Stuck detection — if the screen doesn't change for 3 steps, the kernel tells the LLM to try a different approach.

Vision fallback — when the accessibility tree is empty (games, WebViews, Flutter), it falls back to sending a screenshot.

Conversation memory — the LLM sees its full history of observations and decisions, so it won't repeat itself.

Architecture

src/
  kernel.ts          — Main agent loop
  actions.ts         — 22 actions + ADB retry logic
  skills.ts          — 6 multi-step skills (read_screen, submit_message, etc.)
  workflow.ts        — Workflow orchestration engine
  llm-providers.ts   — 4 LLM providers + system prompt
  sanitizer.ts       — Accessibility XML parser + smart filtering
  config.ts          — Env config
  constants.ts       — Keycodes, coordinates, defaults
  logger.ts          — Session logging

Commands

bun install              # Install dependencies
bun run src/kernel.ts    # Start the agent
bun run build            # Compile to dist/
bun run typecheck        # Type-check (tsc --noEmit)

Troubleshooting

"adb: command not found" — Install ADB or set ADB_PATH=/full/path/to/adb in .env.

"no devices found" — Run adb devices. Check USB debugging is enabled and you tapped "Allow" on the phone.

Agent keeps repeating the same action — Stuck loop detection handles this automatically. If it persists, try a more capable model (GPT-4o, Claude).

High token usage — Set VISION_MODE=off, lower MAX_ELEMENTS to 20, lower MAX_HISTORY_STEPS to 5, or use a cheaper model.

Docs

Use Cases — 50+ examples across 15 categories
ADB Commands — 750+ shell commands reference
Capabilities & Limitations

License

MIT

Languages

TypeScript 48.1%

Kotlin 32.8%

HTML 9.1%

Svelte 8.6%

Shell 1%

Other 0.3%