Architecture
Design Philosophy
AgenTest is an MCP (Model Context Protocol) server that gives an AI agent the ability to interact with Android apps running on emulators or physical devices. The core principle is "The AI is the brain, AgenTest is the fingers."
The MCP server is intentionally a dumb execution engine. It does not generate test cases, interpret results, or contain any AI logic. All intelligence lives in the LLM that calls the tools. AgenTest only knows how to:
- Read the accessibility tree (what's on screen)
- Inject input events (tap, type, swipe)
- Wait for the UI to settle
- Report what happened
This separation means the MCP server stays small, predictable, and testable, while the LLM handles the creative work of deciding what to test and interpreting failures.
System Diagram
+--------------------------------------------+
| AI Agent (LLM) |
| |
| 1. Reads app source code |
| 2. Generates test scenarios |
| 3. Calls AgenTest MCP tools |
| 4. Interprets results |
| 5. Reports findings to developer |
+--------------------------------------------+
|
| MCP Protocol (stdio, JSON-RPC 2.0)
v
+--------------------------------------------+
| AgenTest MCP Server |
| (server.ts) |
| |
| +--------------------------------------+ |
| | 10 MCP Tools | |
| +--------------------------------------+ |
| | |
| +--------------------------------------+ |
| | DeviceClient | |
| | (3-backend router) | |
| | helper > grpc > adb | |
| +--------------------------------------+ |
| | | | |
| +-------------+ +-----------+ +-----------+
| | Helper HTTP | | gRPC | | ADB |
| | Client | | Client | | Client |
| +-------------+ +-----------+ +-----------+
| | | | |
| +--------------------------------------+ |
| | Shell Executor (DI) | |
| +--------------------------------------+ |
+--------------------------------------------+
| | |
v v v
fetch HTTP child_process grpc-js
| | |
v v v
localhost:8765 adb binary localhost:8554
(emulator gRPC)
| | |
| v |
| +--------------+ |
| | ADB (host) | |
| +--------------+ |
| | |
| USB / TCP |
v v v
+--------------------------------------------+
| Android Emulator / Device |
| |
| +--------------------------------------+ |
| | Helper APK process | |
| | (am instrument, shell UID) | |
| | - NanoHTTPD on :8765 | |
| | - TreeDumper (UiAutomation) | |
| | - InputInjector | |
| | - IdleWaiter (event listener) | |
| | - ScreenshotEncoder | |
| | - FrameworkDetector | |
| +--------------------------------------+ |
| |
| +--------------------------------------+ |
| | App Under Test | |
| | - View hierarchy | |
| | - Accessibility tree | |
| +--------------------------------------+ |
+--------------------------------------------+
Layer Breakdown
Layer 1: MCP Server (server.ts)
The entry point. Registers 10 tools with the MCP SDK and connects over stdio transport. Manages shared state (active device ID, active package name, active gRPC client). Delegates all work to the tool handlers.
Key decisions:
- Uses
McpServerfrom@modelcontextprotocol/sdkwithStdioServerTransport - All tool inputs validated with Zod schemas before reaching handler code
- Errors are caught and returned as structured JSON, never as thrown exceptions that would crash the MCP connection
- State (active device, active package, gRPC client) persists between tool calls within a session
agentest_connectaccepts abackendparameter:"auto"(default, try gRPC then ADB),"adb","grpc"
Layer 2: Tool Handlers (tools/)
Ten files, each exporting a single handle* function. These orchestrate the Android layer — they compose ADB/gRPC calls, tree parsing, idle detection, and input injection into higher-level operations. They contain no business logic of their own. All handlers accept an optional GrpcEmulatorClient and create a DeviceClient facade.
| Tool | Handler | Responsibility |
|---|---|---|
agentest_connect | handleConnect | Verify device → auto-detect gRPC → launch app → wait for idle → return tree + backend |
agentest_get_ui_tree | handleGetUiTree | Single tree snapshot → serialize for LLM |
agentest_run_flow | handleRunFlow | Loop over steps with pre-validation gate: detect screen changes between steps, stop on failure |
agentest_reset_app | handleResetApp | Force-stop → relaunch → wait for idle → return tree |
agentest_get_logs | handleGetLogs | Logcat capture filtered by app PID |
agentest_screenshot | handleScreenshot | Screenshot via gRPC (emulator) or ADB screencap |
agentest_device_info | handleDeviceInfo | Screen size, density, Android version, model |
agentest_get_shared_prefs | handleGetSharedPrefs | Read SharedPreferences XML via run-as cat |
agentest_query_db | handleQueryDb | SQL queries against app SQLite databases via run-as sqlite3 |
agentest_set_network | handleSetNetwork | Network condition simulation (speed/delay/wifi/airplane mode) |
Layer 3: Android Layer (android/)
Sixteen files that encapsulate all Android-specific knowledge:
| File | Responsibility |
|---|---|
adb.ts | Thin wrapper over ADB shell commands. Builds command strings from constants, executes via injected ShellExecutor. Includes app state inspection (run-as cat/sqlite3), network simulation (adb emu network, svc wifi/data), helper-lifecycle commands (install, uninstall, forward, am instrument), and the opt-in idling bridge query (queryIdlingBridge). |
device-client.ts | DeviceClient facade: composes AdbClient + optional GrpcEmulatorClient + optional HelperClient + optional FrameworkSync. Three-backend router: tree/idle prefer helper, input prefers gRPC > helper > ADB on emulators. Per-method graceful fallback on failure. Exposes sync for the idle pipeline to tail-probe framework-specific channels. |
helper-client.ts | HTTP client for the on-device helper APK. Methods: status, getTree, screenshot, waitForIdle, detectFramework, tap, swipe, longPress, key, typeText, shutdown. |
helper-installer.ts | ensureHelper() — auto-install + launch the helper on first connect with zero user input. Locates prebuilt APKs (env var or relative path), checks installed versionCode, installs both APKs if missing/stale, sets up adb forward, spawns am instrument as a background child, polls /status. |
grpc-client.ts | GrpcEmulatorClient: connects to emulator's gRPC port with JWT bearer token auth. Wraps sendTouch, sendKey, getScreenshot, clipboard RPCs. |
grpc-discovery.ts | Auto-discovers emulator JWT token from pid_*.ini files in platform-specific temp dirs (macOS, Linux, Windows). |
grpc-touch.ts | Pure gesture functions: tap, swipe (interpolated 60fps), long press, pinch (two-finger), rotate (two-finger). |
hermes-cdp.ts | Phase 3.5. React Native sync backend. Discovers Metro's /json/list inspector targets, picks the one matching the package, opens the Hermes CDP WebSocket, enables Runtime, and implements waitForHermesJsIdle as a JS-event-loop idle probe. Debug builds only; every failure returns undefined silently. |
dart-vm-service.ts | Phase 3.6 / 3.7. Flutter sync backend. Scrapes logcat for The Dart VM service is listening on ..., parses the URL, sets up adb forward, opens a JSON-RPC 2.0 WebSocket, and exposes callExtension (ext.flutter.*), ensureFlutterSemantics, and waitForFlutterFrameIdle. Debug+profile builds only. |
framework-sync.ts | Phase 3.9 / 3.6. Orchestrator that composes Hermes CDP + Dart VM Service + idling bridge behind a single FrameworkSync object. attach() opens all applicable channels (non-fatal per-channel); waitForSync() runs them in sequence after the helper's a11y-event idle. Also exposes snapshotFiberLabels() — the Phase 3.6 one-shot fiber walker that extracts React component names for unlabeled icon buttons. Fingerprint-keyed cache. Gated on AGENTEST_DISABLE_FRAMEWORK_SYNC=1 and AGENTEST_DISABLE_FIBER_INFERENCE=1 for unit tests. |
fiber-extractor.ts | Phase 3.6. React Fiber walker. Sends a synchronous JS blob to Hermes via CDP Runtime.evaluate, walks __REACT_DEVTOOLS_GLOBAL_HOOK__.getFiberRoots(), collects {tag, host, component, ancestors, props} for every HostComponent fiber. SKIP_NAMES set walks past generic React ancestors (Svg, Path, View, Pressable, RCTView) to find the real icon component. Two-call pattern to work around Hermes's unreliable awaitPromise: walker kicks off stateNode.measureInWindow callbacks into globalThis.__agentest_measures, host waits 150ms, second call drains the bag. Debug builds only. |
fiber-merger.ts | Phase 3.6. Fiber ↔ a11y correlation. mergeFiberLabels runs two-stage matching: (A) exact testID / accessibilityLabel prop match, (B) containment matching — a11y-first iteration, clickables sorted by area ascending, tightest-inside-wins. calibrateOffset auto-detects fiber↔a11y coordinate delta by voting on size-matching pairs (no hardcoded constants). GENERIC_HOSTS set ensures layout wrappers never win over meaningful components. Each fiber labels at most one a11y node so Camera / Photo / Microphone don't all inherit the same label. |
ref-registry.ts | Phase 3.5. Session-scoped map from @ref tokens (@b1, @f2, …) to UnifiedUINode instances. rebuild(tree) walks once, assigns refs by kind (btn/field/check/link/scroll/generic), computes fingerprint, stores compact text; resolve(ref) is O(1) and throws ElementNotFoundError with a stale-ref recovery hint. Rebuilt on every tree snapshot. Lives in server.ts session state and is threaded through all 4 tool handlers. |
tree-parser.ts | Parses uiautomator dump XML AND helper JSON into UnifiedUINode tree. LLM serialization with tree pruning (collapse single-child wrappers, skip invisible/system UI). Phase 3.5 serializeTreeCompact / hoistClickableLabels / computeScreenFingerprint / computeIdleFingerprint. Flexible className matching. Propagates Phase 3.8 Compose extras (hintText / stateDescription / paneTitle / tooltipText) end-to-end. |
input.ts | Resolves element selectors to screen coordinates. Ref-aware: @ref selectors short-circuit via RefRegistry before falling back to legacy selector fields. Executes 20 action types including pinch and rotate. Checks assertions against the tree via effectiveTextOf which traverses text → description → hint → tooltip → descendant text for RN container nodes. |
idle.ts | Two execution paths. Fast (helper available): event-driven via helper's /wait-idle endpoint, ~150-300ms — now followed by an optional FrameworkSync.waitForSync() tail probe when Hermes / Dart VM / idling bridge is attached. Polling (no helper): legacy fingerprint stability, ~600-2000ms. Both with visibility-checked loading indicator detection. |
Layer 5: On-Device Helper APK (android-helper/, Phase 3)
A separate Gradle project that produces two prebuilt APKs committed to android-helper/prebuilt/ and shipped inside the npm package, plus an opt-in idling bridge AAR users can add to their own debug builds:
- Main APK (
com.agentest.helper, ~815 KB): contains the HTTP server source and all helper logic. No launcher activity, no service — exists purely as the target package for the test APK's instrumentation. - Test APK (
com.agentest.helper.test, ~952 KB): single JUnit@Testmethod (HelperEntryPoint) that's invoked byandroidx.test.runner.AndroidJUnitRunner. It starts the embedded NanoHTTPD server bound to127.0.0.1:8765and blocks on aCountDownLatchfor up to 24h. - Idling Bridge AAR (
com.agentest.bridge, Phase 3.10): opt-in library. Users adddebugImplementationto expose a ContentProvider at<app-package>.agentest.idling/statethat reports pending EspressoIdlingResources and customIdleSources. The host-sideAdbClient.queryIdlingBridgereads the provider viaadb shell content queryandFrameworkSyncdrains it between actions. Framework-agnostic — works for RN, Flutter, Compose, and native.
Why two APKs? Android requires instrumentation tests to live in a separate APK signed with the same key as the target package. This is the same pattern Appium's appium-uiautomator2-server uses. Launching via am instrument -w -r com.agentest.helper.test/androidx.test.runner.AndroidJUnitRunner gives the helper process the shell UID, which holds INJECT_EVENTS permission — the only way to call UiAutomation.injectInputEvent without a signature-level grant.
Auto-install flow (ensureHelper in helper-installer.ts, runs inside agentest_connect):
1. Check AGENTEST_DISABLE_HELPER env var — short-circuit return null
2. Locate prebuilt APKs (AGENTEST_HELPER_APK_DIR env, or repo-relative path)
3. adb shell pm list packages com.agentest.helper{,.test}
4. adb shell dumpsys package com.agentest.helper | grep versionCode
5. If versionCode missing/stale: adb uninstall + adb install -r -t (both APKs)
6. adb forward tcp:8765 tcp:8765
7. spawn('adb', 'shell', 'am', 'instrument', '-w', '-r',
'com.agentest.helper.test/androidx.test.runner.AndroidJUnitRunner')
8. Poll http://127.0.0.1:8765/status every 250ms until { ok: true, protocolVersion: 1 }
9. Return HelperHandle { client, status, shutdown() }
Every failure mode (helper APK missing, install failed, port forward failed, instrumentation crashed, /status timeout) returns null silently with a single stderr log — the rest of the connect flow degrades to ADB+gRPC and the user never sees a hard error.
Layer 4: ADB Discovery + Shell Executor (adb-path.ts, shell.ts)
adb-path.ts auto-discovers the adb binary from standard Android SDK locations (ANDROID_HOME, ANDROID_SDK_ROOT, ~/Library/Android/sdk on macOS, ~/Android/Sdk on Linux, %LOCALAPPDATA%\Android\Sdk on Windows). The resolved path is cached for the session and used by both AdbClient (shell commands) and spawnInstrumentation (helper APK launch). Users never need to configure PATH manually.
shell.ts is the dependency injection boundary. All external process execution flows through the ShellExecutor interface. In production, ProcessShellExecutor calls child_process.exec. In tests, a mock executor returns canned responses.
ShellExecutor (interface)
├── ProcessShellExecutor (production — real adb calls)
└── MockShellExecutor (testing — fixture responses)
Data Flow: Executing a Test Step
Here is the exact sequence when The AI calls agentest_run_flow with a tap step:
1. MCP SDK receives JSON-RPC call
2. Zod validates the input (steps array, element selectors)
3. handleRunFlow() is called with validated ActionStep[]
4. For each step:
a. snapshotTree() → adb.dumpUiTree()
→ shell.exec("adb shell uiautomator dump /sdcard/window_dump.xml")
→ shell.exec("adb shell cat /sdcard/window_dump.xml")
→ parseUiAutomatorXml(xml) → UnifiedUINode tree
b. executeAction(adb, tree, step, screenBounds)
→ resolveTarget(tree, step.target)
→ findElements(tree, selector) — depth-first search
→ returns first matching UnifiedUINode
→ adb.tap(element.center.x, element.center.y)
→ shell.exec("adb shell input tap 540 960")
c. waitForIdle(adb)
→ loop:
→ dumpUiTree() → parseUiAutomatorXml() → computeFingerprint()
→ compare with previous fingerprint
→ if stable for 2 consecutive snapshots, return tree
→ if timeout, return last tree anyway
d. Record StepResult { stepIndex, action, success, durationMs }
5. After all steps (or on first failure):
→ serializeTreeForLlm(currentTree) — compact JSON
→ Return FlowTrace { success, stepsCompleted, results, finalUiTree }
6. MCP SDK serializes result as JSON-RPC response
Element Resolution Pipeline
When a step targets an element (e.g., { id: "email", className: "android.widget.EditText" }), the resolution follows this pipeline:
ElementSelector
│
▼
findElements(tree, selector)
│ Depth-first traversal of UnifiedUINode tree
│ Each node checked against ALL specified criteria (AND logic)
│
│ Matching rules:
│ - id: substring match against resourceId
│ - text: exact match
│ - textContains: substring match
│ - className: exact match
│ - description: substring match against content description
│
▼
Matching nodes (array)
│
│ If selector.index is set: pick the Nth match
│ Otherwise: return all matches
│
▼
resolveTarget() picks first match
│
│ If no matches → throw ElementNotFoundError
│
▼
UnifiedUINode.center → { x, y } screen coordinates
│
▼
adb.tap(x, y) → "adb shell input tap {x} {y}"
Idle Detection Strategy
The idle detection system determines when the UI has finished updating after an action. This is the hardest problem in the system because Android apps may never fully "stabilize" — clocks tick, cursors blink, animations loop.
Algorithm
1. Dump the UI tree
2. Compute a fingerprint of the tree
3. Compare with the previous fingerprint
4. If they match, increment stableCount
5. If stableCount >= 2 (configurable), UI is idle
6. If they differ, reset stableCount to 0
7. If timeout exceeded, return the last tree anyway
Noise Filtering
The fingerprint intentionally ignores known-noisy properties:
| Property | Why it's noisy | Mitigation |
|---|---|---|
focused state | Cursor blink on text fields | Excluded from fingerprint entirely |
| Timestamp-like text | Clock widgets, "Last updated" labels | Regex filter: /^\d{1,2}:\d{2}(:\d{2})?(\s?(AM|PM))?$/ |
| Bounds jitter | Sub-pixel rendering differences | Rounded to nearest 2px before comparison |
Timing Budget
- Each batched
uiautomator dumpcall (rm + dump + cat in single shell): ~500ms-1s - Poll interval: 200ms
- Minimum time to declare idle: ~1-1.5s (dump + wait + dump + compare)
- Default timeout: 10s
- Loading indicator wait: up to 8s extra if spinners/shimmer detected
- Lightweight actions (type, press_key, clear_text, *_coordinates): single snapshot (~500ms), no polling
- Heavy actions (tap, swipe, long_press, double_tap): full idle + loading detection (~1-2s)
Error Handling Strategy
Errors flow through three layers:
1. Custom Error Classes (errors.ts)
Every anticipated failure has a dedicated error class extending AgenTestError, each carrying an error code string:
| Error Class | Code | When |
|---|---|---|
AdbConnectionError | ADB_CONNECTION_ERROR | No device connected, specified device not found |
AdbCommandError | ADB_COMMAND_ERROR | ADB command returned unexpected output |
ElementNotFoundError | ELEMENT_NOT_FOUND | Selector matched zero elements |
IdleTimeoutError | IDLE_TIMEOUT | UI didn't stabilize within timeout |
AssertionFailedError | ASSERTION_FAILED | Assertion condition not met |
AppNotInstalledError | APP_NOT_INSTALLED | Target package not on device |
TreeParseError | TREE_PARSE_ERROR | XML parsing failed or invalid structure |
2. Tool Handler Level
Each tool handler wraps its entire body in try/catch. Errors are never thrown to the MCP SDK — they're serialized as structured JSON in the tool response:
{
"error": "No element found matching selector: {\"id\":\"login_btn\"}",
"code": "ELEMENT_NOT_FOUND"
}
3. Flow Execution Level
run_flow has special error handling: on any step failure, it captures the current UI tree (for debugging context), records the failure in the step results, and returns the full trace up to the failure point. The LLM receives enough context to diagnose what went wrong.
Dependency Injection
The ShellExecutor interface is the single point where external I/O enters the system. Every class and function that needs to run shell commands receives a ShellExecutor instance, never imports child_process directly.
interface ShellExecutor {
exec(command: string, options?: ShellExecOptions): Promise<string>;
}
interface ShellExecOptions {
timeoutMs?: number;
signal?: AbortSignal;
}
This enables:
- Unit testing with
MockShellExecutorthat returns pre-recorded ADB output - Integration testing with a real
ProcessShellExecutoragainst a live emulator - Future platform support (iOS) by creating alternative executors or swapping the Android layer entirely
Why No Screenshots
The entire system operates on the accessibility tree, not screenshots. This is a deliberate choice:
- Accessibility trees are structured data. The LLM can reason about element types, labels, and relationships without vision capabilities.
- Trees are fast. A
uiautomator dump+ parse takes ~500ms. A screenshot capture + base64 encoding + vision model analysis takes 2-5s. - Trees enable precise interaction. Element bounds give exact tap coordinates. Screenshots require coordinate inference.
- Trees work across frameworks. React Native, Flutter, Compose, and native Android all produce accessibility trees. Screenshots look different for each.
- Trees support assertions natively. "Is element X visible?" is a tree search. With screenshots, it requires OCR or vision.
The tradeoff: custom-drawn content (Canvas, OpenGL, games) has no accessibility nodes and is invisible to AgenTest. A screenshot fallback tool is planned for Phase 2.