Removes the unnecessary nesting — all source, config, and docs now live at the project root for simpler paths and commands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.5 KiB
6.5 KiB
Android Action Kernel — Capabilities & Limitations
Built-in Actions (15)
| # | Action | What it does | ADB Command |
|---|---|---|---|
| 1 | tap |
Tap at x,y coordinates | input tap x y |
| 2 | longpress |
Long press at x,y (context menus, drag) | input swipe x y x y 1000 |
| 3 | type |
Type text into focused field | input text "..." |
| 4 | enter |
Press Enter/Submit key | input keyevent 66 |
| 5 | clear |
Select all + delete in focused field | keyevent MOVE_END → MOVE_HOME → DEL |
| 6 | swipe |
Swipe up/down/left/right (scrolling) | input swipe x1 y1 x2 y2 300 |
| 7 | home |
Go to home screen | input keyevent KEYCODE_HOME |
| 8 | back |
Press back button | input keyevent KEYCODE_BACK |
| 9 | launch |
Open app by package, activity, or URI | am start / monkey -p |
| 10 | clipboard_get |
Read clipboard contents | cmd clipboard get-text |
| 11 | clipboard_set |
Write text to clipboard | cmd clipboard set-text "..." |
| 12 | screenshot |
Capture screen and pull to local file | screencap -p + adb pull |
| 13 | wait |
Sleep 2 seconds (wait for UI to load) | Bun.sleepSync(2000) |
| 14 | shell |
Run any arbitrary ADB shell command | adb shell <command> |
| 15 | done |
Signal task completion, stop the loop | (internal) |
Cannot Do At All
| Limitation | Reason |
|---|---|
| Read screen content directly | Must rely on uiautomator dump (accessibility XML) — if an app doesn't expose accessibility nodes, the agent is blind |
| Interact with secure/banking apps | Apps with FLAG_SECURE block screenshots and UI dumps — agent gets empty data |
| Handle biometrics | Cannot simulate fingerprint or face unlock (hardware-level security) |
| Bypass lock screen | Cannot enter PIN/pattern via ADB on encrypted devices (pre-boot state) |
| Access other app's private data | /data/data/pkg/ requires root access |
| Install from Play Store | Can sideload APKs via pm install, but cannot interact with Play Store purchase/install flow |
| Control phone calls | Can open dialer (am start tel:...) but cannot control the call itself (answer, hang up, conference) |
| Read SMS content | Restricted since Android 10 without default SMS app permission |
| Access camera/microphone streams | Can trigger camera app but cannot capture or process the live feed |
| Modify system partitions | /system is read-only without root |
| Grant all permissions silently | Some runtime permissions require on-device user interaction |
| Multi-finger gestures | ADB input only supports single-touch — no pinch-to-zoom, two-finger swipe, or rotation gestures |
Unreliable / Partially Working
| Limitation | Reason |
|---|---|
| Custom UI frameworks (Flutter, React Native, games) | uiautomator dump returns empty or useless XML — falls back to vision-based coordinate tapping |
| WebViews | Accessibility tree is often incomplete inside embedded browsers |
| Precise timing gestures | Double-tap, quick swipe, fling — timing is inconsistent over ADB |
| Notification interaction | Can expand the shade, but reading/tapping individual notifications is flaky via accessibility tree |
| Drag and drop | input draganddrop exists but is unreliable across Android versions |
| Clipboard on Android 12+ | cmd clipboard increasingly restricted, apps get toast warnings on clipboard access |
| Fast typing | input text is slow for long strings, and some keyboards intercept or modify input |
| CAPTCHAs / bot detection | Some apps detect ADB-driven input patterns and block interaction |
| Screen state on custom launchers | Some launchers produce non-standard accessibility trees that confuse element parsing |
Needs Root (Not Available by Default)
| Capability | What root unlocks |
|---|---|
Read/write /data/data/ |
Access any app's private storage, databases, shared preferences |
Read/write /system/ |
Modify system files, replace system apps |
| Capture network traffic | Run tcpdump for packet capture |
| Change MAC address | Spoof network hardware identity |
| Modify hosts file | Block domains, redirect traffic |
| Access keystore/credentials | Read stored accounts and tokens |
| Disable SELinux | Remove Android's mandatory access control |
| Full logcat from all apps | Read logs from all processes without filtering |
| Install as system app | Survive factory resets, gain system-level permissions |
Architecture-Level Gaps
| Gap | Impact |
|---|---|
| No OCR | Cannot read text from screenshots natively — relies entirely on accessibility XML text fields. If text isn't in the XML, it's invisible to the agent |
| No audio processing | Cannot hear, record, or process audio output from the device |
| No real-time streaming | Screen state is polled (dump → parse → act), not continuous — misses animations, transient toasts, loading states |
| Single device only | The kernel controls one device at a time, no multi-device orchestration |
| Resolution assumptions | Swipe coords use ratio-based calculation from computeSwipeCoords(), but LLM-suggested tap coordinates may be wrong on unusual resolutions or aspect ratios |
| No state persistence across runs | Each run starts fresh — no memory of previous sessions, learned app layouts, or cached element positions |
| Network latency to LLM | Each perception → reasoning → action cycle includes an LLM API round-trip (200ms–2s), making the agent slow for time-sensitive interactions |
| No parallel actions | Actions execute sequentially — cannot tap two things simultaneously or perform background monitoring while acting |
What the shell Escape Hatch Unlocks
The shell action can run any adb shell command, extending capabilities beyond the 15 built-in actions. See adb-commands.md for the full reference. Key categories:
- Input simulation — all key events, swipes, text input
- App management — launch, kill, install, uninstall, clear data
- Package management — list apps, grant/revoke permissions
- System inspection — battery, wifi, memory, CPU, notifications
- Settings — brightness, volume, airplane mode, rotation
- File system — list, read, copy, move, delete files on
/sdcard/ - Networking — enable/disable wifi/bluetooth/data, ping, DNS lookup
- Screen recording — capture video, screenshots
- Content providers — query contacts, SMS, call log, media, calendar
- Process management — list, kill, monitor processes
- Device info — model, Android version, carrier, serial number