diff --git a/.env.example b/.env.example index 55e33ca..4d1814c 100644 --- a/.env.example +++ b/.env.example @@ -39,7 +39,7 @@ MAX_HISTORY_STEPS=10 # How many past steps to keep in conversation context STREAMING_ENABLED=true # Stream LLM responses (shows progress dots) # =========================================== -# LLM Provider: "groq", "openai", "bedrock", or "openrouter" +# LLM Provider: "groq", "openai", "bedrock", "openrouter", or "ollama" # =========================================== LLM_PROVIDER=groq @@ -84,3 +84,18 @@ OPENROUTER_MODEL=anthropic/claude-3.5-sonnet # meta-llama/llama-3.3-70b-instruct (open source) # mistralai/mistral-large-latest (European) # deepseek/deepseek-chat (cost efficient) + +# =========================================== +# Ollama Configuration (local LLMs, no API key needed) +# Install: https://ollama.com then: ollama pull llama3.2 +# =========================================== +OLLAMA_BASE_URL=http://localhost:11434/v1 +OLLAMA_MODEL=llama3.2 +# Vision models (for screenshot support): +# llava (7B, good vision) +# llama3.2-vision (11B, best open-source vision) +# Text-only models: +# llama3.2 (3B, fast) +# llama3.1 (8B, balanced) +# qwen2.5 (7B, strong reasoning) +# mistral (7B, fast) diff --git a/CLAUDE.md b/CLAUDE.md index cebefdc..d9aacd0 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,14 +27,14 @@ Seven source files in `src/`, no subdirectories: - **kernel.ts** — Entry point and main agent loop. Reads goal from stdin, runs up to MAX_STEPS iterations of: capture screen → diff with previous → call LLM → execute action → track history. Handles stuck-loop detection and vision fallback when the accessibility tree is empty. - **actions.ts** — 15 action implementations (tap, type, enter, swipe, home, back, wait, done, longpress, screenshot, launch, clear, clipboard_get, clipboard_set, shell). Each wraps ADB commands via `Bun.spawnSync()`. `runAdbCommand()` provides exponential backoff retry. -- **llm-providers.ts** — LLM abstraction with `LLMProvider` interface and factory (`getLlmProvider()`). Four providers: OpenAI, Groq (OpenAI-compatible endpoint), AWS Bedrock (Anthropic + Meta model formats), OpenRouter (Vercel AI SDK). Contains the full SYSTEM_PROMPT with all 15 action definitions and rules. +- **llm-providers.ts** — LLM abstraction with `LLMProvider` interface and factory (`getLlmProvider()`). Five providers: OpenAI, Groq (OpenAI-compatible endpoint), Ollama (local LLMs, OpenAI-compatible), AWS Bedrock (Anthropic + Meta model formats), OpenRouter (Vercel AI SDK). Contains the full SYSTEM_PROMPT with all 15 action definitions and rules. - **sanitizer.ts** — Parses Android Accessibility XML (via `fast-xml-parser`) into `UIElement[]`. Depth-first walk extracting bounds, center coordinates, state flags (enabled, checked, focused, etc.), and parent context. `computeScreenHash()` used for stuck-loop detection. - **config.ts** — Singleton `Config` object reading from `process.env` with defaults from constants. `Config.validate()` checks required API keys at startup. - **constants.ts** — All magic values: ADB keycodes, swipe coordinates (hardcoded for 1080px-wide screens), default models, file paths, agent defaults. ## Key Patterns -- **Provider factory:** `getLlmProvider()` returns the appropriate `LLMProvider` based on `Config.LLM_PROVIDER`. Groq reuses the `OpenAIProvider` class with a different base URL. +- **Provider factory:** `getLlmProvider()` returns the appropriate `LLMProvider` based on `Config.LLM_PROVIDER`. Groq and Ollama reuse the `OpenAIProvider` class with different base URLs. - **Screen state diffing:** Hash-based comparison (id + text + center + state). After STUCK_THRESHOLD unchanged steps, recovery hints are injected into the LLM prompt. - **Vision fallback:** When `getInteractiveElements()` returns empty (custom UI, WebView, Flutter), a screenshot is captured and the LLM gets a fallback context suggesting coordinate-based taps. - **LLM response parsing:** `parseJsonResponse()` handles both clean JSON and markdown-wrapped code blocks. Falls back to "wait" action on parse failure. @@ -56,7 +56,7 @@ Seven source files in `src/`, no subdirectories: ## Environment Setup -Requires: Bun 1.0+, ADB (Android SDK Platform Tools) in PATH, an Android device connected via USB/WiFi with accessibility enabled, and an API key for at least one LLM provider (Groq, OpenAI, Bedrock, or OpenRouter). +Requires: Bun 1.0+, ADB (Android SDK Platform Tools) in PATH, an Android device connected via USB/WiFi with accessibility enabled, and either a local Ollama install or an API key for a cloud LLM provider (Groq, OpenAI, Bedrock, or OpenRouter). Copy `.env.example` to `.env` and configure `LLM_PROVIDER` + the corresponding API key. diff --git a/README.md b/README.md index 63065f6..b703bde 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ action: done (412ms) ## setup -you need **bun**, **adb**, and an api key for any llm provider. +you need **bun**, **adb**, and either [ollama](https://ollama.com) for local models or an api key for a cloud provider. ```bash # install adb if you don't have it @@ -42,9 +42,15 @@ bun install cp .env.example .env ``` -edit `.env` - fastest way to start is with groq (free tier): +edit `.env` - fastest way to start is with ollama (fully local, no api key): ```bash +# option a: local with ollama (no api key needed) +ollama pull llama3.2 +LLM_PROVIDER=ollama +OLLAMA_MODEL=llama3.2 + +# option b: cloud with groq (free tier) LLM_PROVIDER=groq GROQ_API_KEY=gsk_your_key_here ``` @@ -189,11 +195,14 @@ name: Send WhatsApp Message | provider | cost | vision | notes | |---|---|---|---| -| groq | free tier | no | fastest to start | +| ollama | free (local) | yes* | no api key, runs on your machine | +| groq | free tier | no | fastest cloud option | | openrouter | per token | yes | 200+ models | | openai | per token | yes | gpt-4o | | bedrock | per token | yes | claude on aws | +*ollama vision requires a vision model like `llama3.2-vision` or `llava` + ## config all in `.env`: @@ -221,7 +230,7 @@ src/ skills.ts 6 multi-step skills workflow.ts workflow orchestration flow.ts yaml flow runner - llm-providers.ts 4 providers + system prompt + llm-providers.ts 5 providers + system prompt sanitizer.ts accessibility xml parser config.ts env config constants.ts keycodes, coordinates diff --git a/site/index.html b/site/index.html index ad2f815..1be92aa 100644 --- a/site/index.html +++ b/site/index.html @@ -772,13 +772,19 @@ cp .env.example .env
edit .env - fastest way to start is groq (free tier):
LLM_PROVIDER=groq +edit
+.env- fastest way is ollama (fully local, no api key):# local (no api key needed) +ollama pull llama3.2 +LLM_PROVIDER=ollama + +# or cloud (free tier) +LLM_PROVIDER=groq GROQ_API_KEY=gsk_your_key_here
| provider | cost | vision | notes |
|---|---|---|---|
| groq | free | no | fastest to start |
| ollama | free (local) | yes* | no api key, runs on your machine |
| groq | free | no | fastest cloud option |
| openrouter | per token | yes | 200+ models |
| openai | per token | yes | gpt-4o |
| bedrock | per token | yes | claude on aws |