diff --git a/README.md b/README.md index 8f0e25e..4ccae81 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,12 @@ # droidclaw -> **experimental.** i wanted to build something to turn my old android devices into ai agents. after a few hours reverse engineering accessibility trees and the kernel and playing with tailscale.. it worked. +> an ai agent that controls your android phone. give it a goal in plain english — it figures out what to tap, type, and swipe. -ai agent that controls your android phone. give it a goal in plain english - it figures out what to tap, type, and swipe. it reads the screen, asks an llm what to do, executes via adb, and repeats until the job is done. +i wanted to turn my old android devices into ai agents. after a few hours reverse engineering accessibility trees and playing with tailscale.. it worked. -one of the biggest things it can do right now is delegate incoming requests to chatgpt, gemini, or google search on the device... and give us the result back. few years back we could run this kind of automation with predefined flows. now think of this as automation with ai intelligence... it can do stuff. you don't need to worry about messy api's. just install your fav apps, write workflows or tell them on the fly. it will get it done. +think of it this way — a few years back, we could automate android with predefined flows. now imagine that automation layer has an llm brain. it can read any screen, understand what's happening, decide what to do, and execute. you don't need api's. you don't need to build integrations. just install your favourite apps and tell the agent what you want done. + +one of the coolest things it can do right now is delegate incoming requests to chatgpt, gemini, or google search on the device... and bring the result back. no api keys for those services needed — it just uses the apps like a human would. ``` $ bun run src/kernel.ts @@ -30,74 +32,207 @@ think: search results showing. done. action: done (412ms) ``` +--- + +## how it works + +the core idea is dead simple — a **perception → reasoning → action** loop that repeats until the goal is done (or it runs out of steps). + +``` + ┌─────────────────────────────────────────┐ + │ your goal │ + │ "send good morning to mom on whatsapp"│ + └────────────────┬────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────┐ + │ │ + │ ┌──────────────┐ │ + │ │ 1. perceive │ │ + │ └──────┬───────┘ │ + │ │ │ + │ dump accessibility tree via adb │ + │ parse xml → interactive ui elements │ + │ diff with previous screen (detect changes) │ + │ optionally capture screenshot │ + │ │ │ + │ ▼ │ + │ ┌──────────────┐ │ + │ │ 2. reason │ │ + │ └──────┬───────┘ │ + │ │ │ + │ send screen state + goal + history to llm │ + │ llm returns { think, plan, action } │ + │ "i see the search icon at (890, 156). │ + │ i should tap it." │ + │ │ │ + │ ▼ │ + │ ┌──────────────┐ │ + │ │ 3. act │ │ + │ └──────┬───────┘ │ + │ │ │ + │ execute via adb: tap, type, swipe, etc. │ + │ feed result back to llm on next step │ + │ check if goal is done │ + │ │ │ + │ ▼ │ + │ done? ─────── yes ──→ exit │ + │ │ │ + │ no │ + │ │ │ + │ └─────── loop back to perceive │ + │ │ + └─────────────────────────────────────────────────┘ +``` + +### what makes it not fall apart + +llms controlling ui's sounds fragile. and it is, if you don't handle the failure modes. here's what droidclaw does: + +- **stuck loop detection** — if the screen doesn't change for 3 steps, recovery hints get injected into the prompt. context-aware hints based on what type of action is failing (tap vs swipe vs wait). +- **repetition tracking** — a sliding window of recent actions catches retry loops even across screen changes. if the agent taps the same coordinates 3+ times, it gets told to stop and try something else. +- **drift detection** — if the agent spams navigation actions (swipe, back, wait) without interacting with anything, it gets nudged to take direct action. +- **vision fallback** — when the accessibility tree is empty (webviews, flutter apps, games), a screenshot gets sent to the llm instead, with coordinate-based tap suggestions. +- **action feedback** — every action result (success/failure + message) gets fed back to the llm on the next step. the agent knows whether its last move worked. +- **multi-turn memory** — conversation history is maintained across steps so the llm has context about what it already tried. + +--- + ## setup +### quick install + ```bash curl -fsSL https://droidclaw.ai/install.sh | sh ``` -installs bun and adb if missing, clones the repo, sets up `.env`. or do it manually: +this installs bun and adb if missing, clones the repo, and sets up `.env`. + +### manual install + +**prerequisites:** + +- [bun](https://bun.sh) (required — node/npm won't work. droidclaw uses bun-specific apis like `Bun.spawnSync` and native `.env` loading) +- [adb](https://developer.android.com/tools/adb) (android debug bridge — comes with android sdk platform tools) +- an android phone with usb debugging enabled +- an llm provider api key (or ollama for fully local) ```bash # install adb +# macos: brew install android-platform-tools +# linux: +sudo apt install android-tools-adb +# windows: +# download from https://developer.android.com/tools/releases/platform-tools -# install bun (required — npm/node won't work) +# install bun curl -fsSL https://bun.sh/install | bash # clone and setup git clone https://github.com/unitedbyai/droidclaw.git -cd droidclaw && bun install +cd droidclaw +bun install cp .env.example .env ``` -> **note:** droidclaw requires [bun](https://bun.sh), not node/npm. it uses bun-specific apis (`Bun.spawnSync`, native `.env` loading) that don't exist in node. +### configure your llm -edit `.env` - fastest way to start is with groq (free tier): +edit `.env` and pick a provider. fastest way to start is groq (free tier): ```bash LLM_PROVIDER=groq GROQ_API_KEY=gsk_your_key_here ``` -or run fully local with [ollama](https://ollama.com) (no api key needed): +or run fully local with [ollama](https://ollama.com) (no api key, no internet needed): ```bash ollama pull llama3.2 +# then in .env: LLM_PROVIDER=ollama OLLAMA_MODEL=llama3.2 ``` -connect your phone (usb debugging on): +### connect your phone + +1. go to **settings → about phone → tap "build number" 7 times** to enable developer options +2. go to **settings → developer options → enable "usb debugging"** +3. plug in via usb and tap "allow" on the phone when prompted ```bash adb devices # should show your device -bun run src/kernel.ts ``` -that's the simplest way - just type a goal and let the agent figure it out. but for anything you want to run repeatedly, there are two modes: **workflows** and **flows**. +### run it -## workflows +```bash +bun run src/kernel.ts +# type your goal and press enter +``` -workflows are ai-powered. you describe goals in natural language, and the llm decides how to navigate, what to tap, what to type. use these when the ui might change, when you need the agent to think, or when chaining goals across multiple apps. +--- + +## three ways to use it + +droidclaw has three modes, each for a different use case: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ │ +│ interactive mode workflows flows │ +│ ───────────────── ───────────────── ───────────────── │ +│ │ +│ type a goal and chain goals fixed sequences │ +│ the agent figures across multiple of taps and types. │ +│ it out on the fly. apps with ai. no llm, instant. │ +│ │ +│ $ bun run --workflow --flow │ +│ src/kernel.ts file.json file.yaml │ +│ │ +│ best for: best for: best for: │ +│ one-off tasks, multi-app tasks, things you do │ +│ exploration, recurring routines, exactly the same │ +│ quick commands morning briefings way every time │ +│ │ +│ uses llm: yes uses llm: yes uses llm: no │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +### interactive mode + +just type what you want: + +```bash +bun run src/kernel.ts +# enter your goal: open settings and turn on dark mode +``` + +### workflows (ai-powered, multi-app) + +workflows are json files describing a sequence of sub-goals. each step can optionally switch to a different app. the llm decides how to navigate, what to tap, what to type. ```bash bun run src/kernel.ts --workflow examples/workflows/research/weather-to-whatsapp.json ``` -each workflow is a json file - just a name and a list of steps: - ```json { "name": "weather to whatsapp", "steps": [ - { "app": "com.google.android.googlequicksearchbox", "goal": "search for chennai weather today" }, - { "goal": "share the result to whatsapp contact Sanju" } + { + "app": "com.google.android.googlequicksearchbox", + "goal": "search for chennai weather today" + }, + { + "goal": "share the result to whatsapp contact Sanju" + } ] } ``` -you can also pass form data into steps when you need to inject specific text: +you can inject specific data into steps using `formData`: ```json { @@ -106,65 +241,17 @@ you can also pass form data into steps when you need to inject specific text: { "app": "com.Slack", "goal": "open #standup channel, type the message and send it", - "formData": { "Message": "yesterday: api integration\ntoday: tests\nblockers: none" } + "formData": { + "Message": "yesterday: api integration\ntoday: tests\nblockers: none" + } } ] } ``` -### examples +### flows (no ai, instant execution) -35 ready-to-use workflows organised by category: - -**[messaging](examples/workflows/messaging/)** - whatsapp, telegram, slack, email -- [slack-standup](examples/workflows/messaging/slack-standup.json) - post daily standup to a channel -- [whatsapp-broadcast](examples/workflows/messaging/whatsapp-broadcast.json) - send a message to multiple contacts -- [telegram-send-message](examples/workflows/messaging/telegram-send-message.json) - send a telegram message -- [email-reply](examples/workflows/messaging/email-reply.json) - draft and send an email reply -- [whatsapp-to-email](examples/workflows/messaging/whatsapp-to-email.json) - forward whatsapp messages to email -- [slack-check-messages](examples/workflows/messaging/slack-check-messages.json) - read unread slack messages -- [email-digest](examples/workflows/messaging/email-digest.json) - summarise recent emails -- [telegram-channel-digest](examples/workflows/messaging/telegram-channel-digest.json) - digest a telegram channel -- [whatsapp-reply](examples/workflows/messaging/whatsapp-reply.json) - reply to a whatsapp message -- [send-whatsapp-vi](examples/workflows/messaging/send-whatsapp-vi.json) - send whatsapp to a specific contact - -**[social](examples/workflows/social/)** - instagram, youtube, cross-posting -- [social-media-post](examples/workflows/social/social-media-post.json) - post across platforms -- [social-media-engage](examples/workflows/social/social-media-engage.json) - like/comment on posts -- [instagram-post-check](examples/workflows/social/instagram-post-check.json) - check recent instagram posts -- [youtube-watch-later](examples/workflows/social/youtube-watch-later.json) - save videos to watch later - -**[productivity](examples/workflows/productivity/)** - calendar, notes, github, notifications -- [morning-briefing](examples/workflows/productivity/morning-briefing.json) - read messages, calendar, weather across apps -- [github-check-prs](examples/workflows/productivity/github-check-prs.json) - check open pull requests -- [calendar-create-event](examples/workflows/productivity/calendar-create-event.json) - create a calendar event -- [notes-capture](examples/workflows/productivity/notes-capture.json) - capture a quick note -- [notification-cleanup](examples/workflows/productivity/notification-cleanup.json) - clear and triage notifications -- [screenshot-share-slack](examples/workflows/productivity/screenshot-share-slack.json) - screenshot and share to slack -- [translate-and-reply](examples/workflows/productivity/translate-and-reply.json) - translate a message and reply -- [logistics-workflow](examples/workflows/productivity/logistics-workflow.json) - multi-app logistics coordination - -**[research](examples/workflows/research/)** - search, compare, monitor -- [weather-to-whatsapp](examples/workflows/research/weather-to-whatsapp.json) - get weather via google ai mode, share to whatsapp -- [multi-app-research](examples/workflows/research/multi-app-research.json) - research across multiple apps -- [price-comparison](examples/workflows/research/price-comparison.json) - compare prices across shopping apps -- [news-roundup](examples/workflows/research/news-roundup.json) - collect news from multiple sources -- [google-search-report](examples/workflows/research/google-search-report.json) - search google and save results -- [check-flight-status](examples/workflows/research/check-flight-status.json) - check flight status - -**[lifestyle](examples/workflows/lifestyle/)** - food, transport, music, fitness -- [food-order](examples/workflows/lifestyle/food-order.json) - order food from a delivery app -- [uber-ride](examples/workflows/lifestyle/uber-ride.json) - book an uber ride -- [spotify-playlist](examples/workflows/lifestyle/spotify-playlist.json) - create or add to a spotify playlist -- [maps-commute](examples/workflows/lifestyle/maps-commute.json) - check commute time -- [fitness-log](examples/workflows/lifestyle/fitness-log.json) - log a workout -- [expense-tracker](examples/workflows/lifestyle/expense-tracker.json) - log an expense -- [wifi-password-share](examples/workflows/lifestyle/wifi-password-share.json) - share wifi password -- [do-not-disturb](examples/workflows/lifestyle/do-not-disturb.json) - toggle do not disturb with exceptions - -## flows - -for tasks where you don't need ai thinking at all - just a fixed sequence of taps and types. no llm calls, instant execution. good for things you do exactly the same way every time. +for tasks where you don't need ai thinking — just a fixed sequence of taps and types. no llm calls, instant execution. think of it like a macro. ```bash bun run src/kernel.ts --flow examples/flows/send-whatsapp.yaml @@ -184,17 +271,7 @@ name: Send WhatsApp Message - done: "Message sent" ``` -### examples - -5 flow templates in [`examples/flows/`](examples/flows/): - -- [send-whatsapp](examples/flows/send-whatsapp.yaml) - send a whatsapp message -- [google-search](examples/flows/google-search.yaml) - run a google search -- [create-contact](examples/flows/create-contact.yaml) - add a new contact -- [clear-notifications](examples/flows/clear-notifications.yaml) - clear all notifications -- [toggle-wifi](examples/flows/toggle-wifi.yaml) - toggle wifi on/off - -## quick comparison +### quick comparison | | workflows | flows | |---|---|---| @@ -204,80 +281,214 @@ name: Send WhatsApp Message | speed | slower (llm calls) | instant | | best for | complex/multi-app tasks | simple repeatable tasks | +--- + +## example workflows + +35 ready-to-use workflows organised by category: + +**[messaging](examples/workflows/messaging/)** — whatsapp, telegram, slack, email +- [slack-standup](examples/workflows/messaging/slack-standup.json) — post daily standup to a channel +- [whatsapp-broadcast](examples/workflows/messaging/whatsapp-broadcast.json) — send a message to multiple contacts +- [telegram-send-message](examples/workflows/messaging/telegram-send-message.json) — send a telegram message +- [email-reply](examples/workflows/messaging/email-reply.json) — draft and send an email reply +- [whatsapp-to-email](examples/workflows/messaging/whatsapp-to-email.json) — forward whatsapp messages to email +- [slack-check-messages](examples/workflows/messaging/slack-check-messages.json) — read unread slack messages +- [email-digest](examples/workflows/messaging/email-digest.json) — summarise recent emails +- [telegram-channel-digest](examples/workflows/messaging/telegram-channel-digest.json) — digest a telegram channel +- [whatsapp-reply](examples/workflows/messaging/whatsapp-reply.json) — reply to a whatsapp message +- [send-whatsapp-vi](examples/workflows/messaging/send-whatsapp-vi.json) — send whatsapp to a specific contact + +**[social](examples/workflows/social/)** — instagram, youtube, cross-posting +- [social-media-post](examples/workflows/social/social-media-post.json) — post across platforms +- [social-media-engage](examples/workflows/social/social-media-engage.json) — like/comment on posts +- [instagram-post-check](examples/workflows/social/instagram-post-check.json) — check recent instagram posts +- [youtube-watch-later](examples/workflows/social/youtube-watch-later.json) — save videos to watch later + +**[productivity](examples/workflows/productivity/)** — calendar, notes, github, notifications +- [morning-briefing](examples/workflows/productivity/morning-briefing.json) — read messages, calendar, weather across apps +- [github-check-prs](examples/workflows/productivity/github-check-prs.json) — check open pull requests +- [calendar-create-event](examples/workflows/productivity/calendar-create-event.json) — create a calendar event +- [notes-capture](examples/workflows/productivity/notes-capture.json) — capture a quick note +- [notification-cleanup](examples/workflows/productivity/notification-cleanup.json) — clear and triage notifications +- [screenshot-share-slack](examples/workflows/productivity/screenshot-share-slack.json) — screenshot and share to slack +- [translate-and-reply](examples/workflows/productivity/translate-and-reply.json) — translate a message and reply +- [logistics-workflow](examples/workflows/productivity/logistics-workflow.json) — multi-app logistics coordination + +**[research](examples/workflows/research/)** — search, compare, monitor +- [weather-to-whatsapp](examples/workflows/research/weather-to-whatsapp.json) — get weather via google, share to whatsapp +- [multi-app-research](examples/workflows/research/multi-app-research.json) — research across multiple apps +- [price-comparison](examples/workflows/research/price-comparison.json) — compare prices across shopping apps +- [news-roundup](examples/workflows/research/news-roundup.json) — collect news from multiple sources +- [google-search-report](examples/workflows/research/google-search-report.json) — search google and save results +- [check-flight-status](examples/workflows/research/check-flight-status.json) — check flight status + +**[lifestyle](examples/workflows/lifestyle/)** — food, transport, music, fitness +- [food-order](examples/workflows/lifestyle/food-order.json) — order food from a delivery app +- [uber-ride](examples/workflows/lifestyle/uber-ride.json) — book an uber ride +- [spotify-playlist](examples/workflows/lifestyle/spotify-playlist.json) — create or add to a spotify playlist +- [maps-commute](examples/workflows/lifestyle/maps-commute.json) — check commute time +- [fitness-log](examples/workflows/lifestyle/fitness-log.json) — log a workout +- [expense-tracker](examples/workflows/lifestyle/expense-tracker.json) — log an expense +- [wifi-password-share](examples/workflows/lifestyle/wifi-password-share.json) — share wifi password +- [do-not-disturb](examples/workflows/lifestyle/do-not-disturb.json) — toggle do not disturb with exceptions + +**[flows](examples/flows/)** — 5 deterministic flow templates (no ai) +- [send-whatsapp](examples/flows/send-whatsapp.yaml) — send a whatsapp message +- [google-search](examples/flows/google-search.yaml) — run a google search +- [create-contact](examples/flows/create-contact.yaml) — add a new contact +- [clear-notifications](examples/flows/clear-notifications.yaml) — clear all notifications +- [toggle-wifi](examples/flows/toggle-wifi.yaml) — toggle wifi on/off + +--- + +## actions + +the agent has 28 actions it can use. these are the building blocks — each one maps to an adb command. + +**basic interactions:** +`tap` `type` `enter` `longpress` `clear` `paste` `swipe` `scroll` + +**navigation:** +`home` `back` `launch` `switch_app` `open_url` `open_settings` `notifications` + +**clipboard:** +`clipboard_get` `clipboard_set` + +**multi-step skills** (compound actions that handle common patterns): +`read_screen` `submit_message` `copy_visible_text` `wait_for_content` `find_and_tap` `compose_email` + +**system:** +`screenshot` `shell` `keyevent` `pull_file` `push_file` `wait` `done` + +the multi-step skills are interesting — they replace 5-10 manual actions with a single call. for example, `read_screen` auto-scrolls through the entire screen, collects all text, and copies it to clipboard. `compose_email` fills To, Subject, and Body fields in the correct order using android intents. these dramatically reduce the number of llm decisions needed. + +--- + ## providers | provider | cost | vision | notes | |---|---|---|---| -| groq | free tier | no | fastest to start | -| ollama | free (local) | yes* | no api key, runs on your machine | -| openrouter | per token | yes | 200+ models | -| openai | per token | yes | gpt-4o | -| bedrock | per token | yes | claude on aws | +| groq | free tier | no | fastest to start, great for most tasks | +| ollama | free (local) | yes* | no api key, runs entirely on your machine | +| openrouter | per token | yes | 200+ models, single api | +| openai | per token | yes | gpt-4o, strong reasoning | +| bedrock | per token | yes | claude/llama on aws | -*ollama vision requires a vision model like `llama3.2-vision` or `llava` +*ollama vision requires a vision-capable model like `llama3.2-vision` or `llava` + +--- ## config -all in `.env`: +all configuration lives in `.env`. here's what you can tweak: -| key | default | what | +| key | default | what it does | |---|---|---| -| `MAX_STEPS` | 30 | steps before giving up | -| `STEP_DELAY` | 2 | seconds between actions | -| `STUCK_THRESHOLD` | 3 | steps before stuck recovery | -| `VISION_MODE` | fallback | `off` / `fallback` / `always` | -| `MAX_ELEMENTS` | 40 | ui elements sent to llm | +| `LLM_PROVIDER` | groq | which llm to use (groq/openai/ollama/bedrock/openrouter) | +| `MAX_STEPS` | 30 | how many steps before the agent gives up | +| `STEP_DELAY` | 2 | seconds to wait between actions (lets the ui settle) | +| `STUCK_THRESHOLD` | 3 | how many unchanged steps before stuck recovery kicks in | +| `VISION_MODE` | fallback | `off` / `fallback` (only when accessibility tree is empty) / `always` | +| `MAX_ELEMENTS` | 40 | max ui elements sent to the llm per step (scored & ranked) | +| `MAX_HISTORY_STEPS` | 10 | how many past steps to keep in conversation context | +| `STREAMING_ENABLED` | true | stream llm responses (shows progress dots) | +| `LOG_DIR` | logs | directory for session json logs | -## how it works +--- -each step: dump accessibility tree → filter elements → send to llm → execute action → repeat. +## source code -the llm thinks before acting - returns `{ think, plan, action }`. if the screen doesn't change for 3 steps, stuck recovery kicks in. when the accessibility tree is empty (webviews, flutter), it falls back to screenshots. - -## source +the entire agent is ~10 files in `src/`: ``` src/ - kernel.ts main loop - actions.ts 22 actions + adb retry - skills.ts 6 multi-step skills - workflow.ts workflow orchestration - flow.ts yaml flow runner - llm-providers.ts 5 providers + system prompt - sanitizer.ts accessibility xml parser - config.ts env config - constants.ts keycodes, coordinates - logger.ts session logging +├── kernel.ts the main perception → reasoning → action loop +├── actions.ts 28 action implementations (tap, type, swipe, etc.) +├── skills.ts 6 multi-step skills (read_screen, compose_email, etc.) +├── workflow.ts workflow orchestration engine (multi-app sub-goals) +├── flow.ts yaml flow runner (deterministic, no llm) +├── llm-providers.ts 5 providers + the system prompt that teaches the llm +├── sanitizer.ts accessibility xml parser → structured ui elements +├── config.ts env config loader with validation +├── constants.ts keycodes, swipe coordinates, defaults +└── logger.ts session logging (json, crash-safe partial writes) ``` +### data flow through the codebase + +``` + kernel.ts + │ + ┌────────────┼────────────────┐ + │ │ │ + ▼ ▼ ▼ + sanitizer.ts llm-providers.ts actions.ts + (parse screen) (ask the llm) (execute via adb) + │ + ├── skills.ts + │ (multi-step compound actions) + │ + config.ts ◄────── all files read config + constants.ts ◄─── keycodes, coordinates + + workflow.ts ── calls kernel.runAgent() per sub-goal + flow.ts ────── calls actions.executeAction() directly (no llm) + logger.ts ◄─── kernel writes step logs here +``` + +--- + ## remote control with tailscale -the default setup is usb - phone plugged into your laptop. but you can go further. +the default setup is usb — phone plugged into your laptop. but you can go much further. -install [tailscale](https://tailscale.com) on both your android device and your laptop/vps. once they're on the same tailnet, connect adb over the network: +install [tailscale](https://tailscale.com) on both your android device and your laptop/server. once they're on the same tailnet, connect adb over the network: ```bash -# on your phone: enable wireless debugging (developer options → wireless debugging) -# note the ip:port shown on the screen +# on your phone: enable wireless debugging +# settings → developer options → wireless debugging +# note the ip:port shown -# from your laptop/vps, anywhere in the world: +# from anywhere in the world: adb connect : adb devices # should show your phone bun run src/kernel.ts ``` -now your phone is a remote ai agent. leave it on a desk, plugged into power, and control it from your vps, your laptop at a cafe, or a cron job running workflows at 8am every morning. the phone doesn't need to be on the same wifi or even in the same country. +now your phone is a remote ai agent. leave it on a desk plugged into power, and control it from a vps, your laptop at a cafe, or a cron job running workflows every morning at 8am. the phone doesn't need to be on the same wifi or even in the same country. -this is what makes old android devices useful again - they become always-on agents that can do things on apps that don't have api's. +this is what makes old android devices useful again — they become always-on agents that can do things on apps that don't have api's. + +--- + +## commands + +```bash +bun run src/kernel.ts # interactive mode (prompts for goal) +bun run src/kernel.ts --workflow file.json # run a workflow +bun run src/kernel.ts --flow file.yaml # run a deterministic flow +bun install # install dependencies +bun run build # compile to dist/ +bun run typecheck # type-check (tsc --noEmit) +``` + +--- ## troubleshooting -**"adb: command not found"** - install adb or set `ADB_PATH` in `.env` +**"adb: command not found"** — install adb (`brew install android-platform-tools` on mac) or set `ADB_PATH` in `.env` to point to your adb binary. -**"no devices found"** - check usb debugging is on, tap "allow" on the phone +**"no devices found"** — make sure usb debugging is enabled, you've tapped "allow" on the phone, and the cable supports data transfer (not just charging). -**agent repeating** - stuck detection handles this. if it persists, use a better model +**agent keeps repeating the same action** — stuck detection should handle this automatically. if it persists, try a stronger model (groq's llama-3.3-70b or openai's gpt-4o). + +**empty accessibility tree** — some apps (flutter, webviews, games) don't expose accessibility info. set `VISION_MODE=always` in `.env` to send screenshots every step instead. + +**swipe coordinates seem off** — droidclaw auto-detects screen resolution at startup. if your device has an unusual resolution, check the console output on step 1 for the detected resolution. + +--- ## contributors @@ -288,7 +499,7 @@ built by [unitedby.ai](https://unitedby.ai) — an open ai community ## acknowledgements -droidclaw's workflow orchestration was influenced by [android action kernel](https://github.com/Action-State-Labs/android-action-kernel) from action state labs. we took the core idea of sub-goal decomposition and built a different system around it — with stuck recovery, 22 actions, multi-step skills, and vision fallback. +droidclaw's workflow orchestration was influenced by [android action kernel](https://github.com/Action-State-Labs/android-action-kernel) from action state labs. we took the core idea of sub-goal decomposition and built a different system around it — with stuck recovery, 28 actions, multi-step skills, and vision fallback. ## license diff --git a/package.json b/package.json index a647183..4192edc 100644 --- a/package.json +++ b/package.json @@ -1,12 +1,20 @@ { "name": "droidclaw", - "workspaces": ["packages/*", "server"], + "private": true, + "packageManager": "bun@1.3.5", + "workspaces": ["packages/*", "server", "web"], "version": "1.0.0", "description": "AI agent that takes control of your Android phone — give it a goal, it figures out the taps", "type": "module", "scripts": { - "start": "bun run src/kernel.ts", - "build": "bun build src/kernel.ts --outdir dist --target bun", + "start": "bun run --cwd web start", + "build": "bun run --cwd web build", + "start:agent": "bun run src/kernel.ts", + "build:agent": "bun build src/kernel.ts --outdir dist --target bun", + "build:web": "bun run --cwd web build", + "dev:web": "bun run --cwd web dev", + "dev:server": "bun run --cwd server dev", + "db:push": "bun run --cwd web db:push", "typecheck": "tsc --noEmit" }, "dependencies": {