Clean up for public release: remove logs, debug artifacts, and future plans from tracking; rewrite readme minimal and lowercase

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Sanju Sivalingam
2026-02-14 19:58:47 +05:30
parent 389ac81c98
commit 8b9f0a4e6e
39 changed files with 75 additions and 8763 deletions

237
README.md
View File

@@ -1,237 +1,142 @@
# DroidClaw
# droidclaw
Give it a goal in plain English. It figures out what to tap, type, and swipe on your Android phone to get it done.
ai agent that controls your android phone. give it a goal in plain english — it figures out what to tap, type, and swipe.
It reads the screen (accessibility tree + optional screenshot), sends it to an LLM, gets back a JSON action like `{"action": "tap", "coordinates": [540, 1200]}`, executes it via ADB, and repeats. Perception → reasoning → action, in a loop.
## See it work
reads the screen (accessibility tree + optional screenshot), asks an llm what to do, executes via adb, repeats.
```
$ bun run src/kernel.ts
Enter your goal: Open YouTube and search for "lofi hip hop"
enter your goal: open youtube and search for "lofi hip hop"
--- Step 1/30 ---
Think: I'm on the home screen. I should launch YouTube directly.
Decision: launch — Open YouTube app (842ms)
--- step 1/30 ---
think: i'm on the home screen. launching youtube.
action: launch (842ms)
--- Step 2/30 ---
Think: YouTube is open. I need to tap the search icon.
Decision: tap — Tap search icon at top right (623ms)
--- step 2/30 ---
think: youtube is open. tapping search icon.
action: tap (623ms)
--- Step 3/30 ---
Think: Search field is focused and ready.
Decision: type — Type "lofi hip hop" (501ms)
--- step 3/30 ---
think: search field focused.
action: type "lofi hip hop" (501ms)
--- Step 4/30 ---
Decision: enter — Submit the search (389ms)
--- step 4/30 ---
action: enter (389ms)
--- Step 5/30 ---
Think: Search results showing lofi hip hop videos. Done.
Decision: done (412ms)
Task completed successfully.
--- step 5/30 ---
think: search results showing. done.
action: done (412ms)
```
## Quick start
## setup
You need: **Bun**, **ADB**, and an **API key** for any LLM provider.
you need **bun**, **adb**, and an api key for any llm provider.
```bash
# Install Bun
curl -fsSL https://bun.sh/install | bash
# Install ADB (macOS)
brew install android-platform-tools
# Clone and setup
bun install
cp .env.example .env
```
Edit `.env` — fastest way to start is with Groq (free tier):
edit `.env` — fastest way to start is with groq (free tier):
```bash
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key_here
```
Get your key at [console.groq.com](https://console.groq.com).
### Connect your phone
Enable USB Debugging: Settings → About Phone → tap "Build Number" 7 times → Developer Options → USB Debugging.
connect your phone (usb debugging on):
```bash
adb devices # should show your device
```
### Run it
```bash
bun run src/kernel.ts
```
Type a goal and watch your phone do it.
## workflows
## Workflows
Workflows chain multiple goals across apps. Way more powerful than single goals.
chain goals across apps:
```bash
bun run src/kernel.ts --workflow examples/weather-to-whatsapp.json
```
### 34 ready-to-use workflows included
**Messaging** — whatsapp-reply, whatsapp-broadcast, whatsapp-to-email, telegram-channel-digest, telegram-send-message, slack-standup, slack-check-messages, email-digest, email-reply, translate-and-reply
**Social Media** — social-media-post (Twitter + LinkedIn), social-media-engage, instagram-post-check
**Productivity** — morning-briefing, calendar-create-event, notes-capture, notification-cleanup, do-not-disturb, github-check-prs, screenshot-share-slack
**Research** — google-search-report, news-roundup, multi-app-research, price-comparison
**Lifestyle** — food-order, uber-ride, maps-commute, check-flight-status, spotify-playlist, youtube-watch-later, fitness-log, expense-tracker, wifi-password-share, weather-to-whatsapp
Each workflow is a simple JSON file:
each workflow is a simple json file:
```json
{
"name": "Slack Daily Standup",
"name": "slack standup",
"steps": [
{
"app": "com.Slack",
"goal": "Open #standup channel, type the standup message and send it.",
"formData": {
"Message": "Yesterday: Finished API integration\nToday: Writing tests\nBlockers: None"
}
"goal": "open #standup channel, type the message and send it",
"formData": { "Message": "yesterday: api integration\ntoday: tests\nblockers: none" }
}
]
}
```
## What it can do
35 ready-to-use workflows in `examples/` — messaging, social media, productivity, research, lifestyle.
22 actions + 6 multi-step skills. Some example goals:
## deterministic flows
```
Open WhatsApp and send "I'm running late" to Mom
Turn on WiFi
Search Google for "best restaurants near me"
Open YouTube and play the first trending video
Copy tracking number from Amazon and search it on Google
```
## LLM providers
Pick one. They all work.
| Provider | Cost | Vision | Best for |
|---|---|---|---|
| **Groq** | Free tier | No | Getting started fast |
| **OpenRouter** | Pay per token | Yes | 200+ models (Claude, Gemini, etc.) |
| **OpenAI** | Pay per token | Yes | Best accuracy with GPT-4o |
| **AWS Bedrock** | Pay per token | Yes | Enterprise / Claude on AWS |
for repeatable tasks that don't need ai, use yaml flows:
```bash
# Groq (recommended to start)
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key_here
GROQ_MODEL=llama-3.3-70b-versatile
# OpenRouter
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-v1-your_key_here
OPENROUTER_MODEL=anthropic/claude-3.5-sonnet
# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-your_key_here
OPENAI_MODEL=gpt-4o
# AWS Bedrock (uses aws configure credentials)
LLM_PROVIDER=bedrock
AWS_REGION=us-east-1
BEDROCK_MODEL=anthropic.claude-3-sonnet-20240229-v1:0
bun run src/kernel.ts --flow examples/flows/send-whatsapp.yaml
```
## Config
no llm calls, just step-by-step adb commands.
All in `.env`. Here's what matters:
## providers
| Setting | Default | What it does |
| provider | cost | vision | notes |
|---|---|---|---|
| groq | free tier | no | fastest to start |
| openrouter | per token | yes | 200+ models |
| openai | per token | yes | gpt-4o |
| bedrock | per token | yes | claude on aws |
## config
all in `.env`:
| key | default | what |
|---|---|---|
| `MAX_STEPS` | 30 | Steps before giving up |
| `STEP_DELAY` | 2 | Seconds between actions (UI settle time) |
| `STUCK_THRESHOLD` | 3 | Steps before stuck-loop recovery kicks in |
| `VISION_MODE` | fallback | `off` / `fallback` (screenshot when accessibility tree is empty) / `always` |
| `MAX_ELEMENTS` | 40 | UI elements sent to LLM (scored & ranked) |
| `MAX_HISTORY_STEPS` | 10 | Past steps kept in conversation context |
| `STREAMING_ENABLED` | true | Stream LLM responses token-by-token |
| `LOG_DIR` | logs | Session logs directory |
| `MAX_STEPS` | 30 | steps before giving up |
| `STEP_DELAY` | 2 | seconds between actions |
| `STUCK_THRESHOLD` | 3 | steps before stuck recovery |
| `VISION_MODE` | fallback | `off` / `fallback` / `always` |
| `MAX_ELEMENTS` | 40 | ui elements sent to llm |
## How it works
## how it works
Each step: dump accessibility tree → score & filter elements → optionally screenshot → send to LLM → execute action → log → repeat.
each step: dump accessibility tree → filter elements → send to llm → execute action → repeat.
The LLM thinks before acting:
the llm thinks before acting — returns `{ think, plan, action }`. if the screen doesn't change for 3 steps, stuck recovery kicks in. when the accessibility tree is empty (webviews, flutter), it falls back to screenshots.
```json
{
"think": "Search field is focused. I should type the query.",
"plan": ["Launch YouTube", "Tap search", "Type query", "Submit"],
"planProgress": "Step 3: typing query",
"action": "type",
"text": "lofi hip hop"
}
```
**Stuck detection** — if the screen doesn't change for 3 steps, the kernel tells the LLM to try a different approach.
**Vision fallback** — when the accessibility tree is empty (games, WebViews, Flutter), it falls back to sending a screenshot.
**Conversation memory** — the LLM sees its full history of observations and decisions, so it won't repeat itself.
## Architecture
## source
```
src/
kernel.ts — Main agent loop
actions.ts 22 actions + ADB retry logic
skills.ts 6 multi-step skills (read_screen, submit_message, etc.)
workflow.ts — Workflow orchestration engine
llm-providers.ts — 4 LLM providers + system prompt
sanitizer.ts — Accessibility XML parser + smart filtering
config.ts — Env config
constants.ts — Keycodes, coordinates, defaults
logger.ts — Session logging
kernel.ts main loop
actions.ts 22 actions + adb retry
skills.ts 6 multi-step skills
workflow.ts workflow orchestration
flow.ts yaml flow runner
llm-providers.ts 4 providers + system prompt
sanitizer.ts accessibility xml parser
config.ts env config
constants.ts keycodes, coordinates
logger.ts session logging
```
## Commands
## troubleshooting
```bash
bun install # Install dependencies
bun run src/kernel.ts # Start the agent
bun run build # Compile to dist/
bun run typecheck # Type-check (tsc --noEmit)
```
**"adb: command not found"** — install adb or set `ADB_PATH` in `.env`
## Troubleshooting
**"no devices found"** — check usb debugging is on, tap "allow" on the phone
**"adb: command not found"** — Install ADB or set `ADB_PATH=/full/path/to/adb` in `.env`.
**agent repeating**stuck detection handles this. if it persists, use a better model
**"no devices found"** — Run `adb devices`. Check USB debugging is enabled and you tapped "Allow" on the phone.
## license
**Agent keeps repeating the same action** — Stuck loop detection handles this automatically. If it persists, try a more capable model (GPT-4o, Claude).
**High token usage** — Set `VISION_MODE=off`, lower `MAX_ELEMENTS` to 20, lower `MAX_HISTORY_STEPS` to 5, or use a cheaper model.
## Docs
- [Use Cases](docs/use-cases.md) — 50+ examples across 15 categories
- [ADB Commands](docs/adb-commands.md) — 750+ shell commands reference
- [Capabilities & Limitations](docs/capabilities-and-limitations.md)
## License
MIT
mit