Overview
An agent is a lightweight in-process HTTP server embedded inside a target application. External clients — CLIs, test drivers, AI tools, IDE extensions — connect to it to:
- Inspect the visual tree and element properties
- Perform UI actions (tap, fill, scroll, gesture, navigate)
- Capture screenshots
- Monitor network traffic, logs, and performance
- Interact with embedded web content (WebViews)
- Read and write application storage
The protocol is framework-agnostic. The same client code works whether the target app is built with .NET MAUI, native Android, React Native, Flutter, WinUI, or anything else. Each framework provides its own agent implementation that translates the standard protocol into framework-specific operations.
Reference docs & downloads
The human-readable protocol page is paired with generated reference docs and downloadable bundles produced from the canonical spec during the website build. The canonical authoring files stay in docs/openapi.yaml, docs/asyncapi.yaml, docs/schemas/*.json, and docs/examples/*.json; generated bundles are published for consumers and code generators.
OpenAPI reference
Generated HTTP API docs with request and response schemas.
WebSocket reference
Generated AsyncAPI docs for streaming channels and message envelopes.
| Artifact | Download |
|---|---|
| Complete spec package | ailoha-devflow-agent-protocol-v1.zip |
| OpenAPI 3.1 YAML bundle | openapi.yaml |
| OpenAPI 3.1 JSON bundle | openapi.json |
| OpenAPI 3.0 client compatibility JSON | openapi-client-3.0.json |
| AsyncAPI YAML and JSON bundles | asyncapi.yaml · asyncapi.json |
Architecture
The agent runs in-process inside the target application, giving it direct access to the UI framework's APIs, the visual tree, and application state. All communication is local — the agent binds to localhost:{port} over HTTP and WebSocket.
- Broker handles discovery. A separate broker daemon manages port assignment and agent registration. Clients ask the broker "where is my app?" and get back a port number.
- Any client language works. Standard HTTP + JSON. Anything that can make HTTP requests can be a client.
For Ailoha's CLI and MCP host, the practical resolution order is:
explicit command args → environment or local project selection (AILOHA_AGENT_HOST, AILOHA_AGENT_PORT, .devflow) → broker discovery → default localhost port.
URL structure & versioning
All HTTP routes share a common shape:
http://localhost:{port}/api/v1/{namespace}/{resource}
ws://localhost:{port}/ws/v1/{channel}
Routes are organized under versioned namespaces:
| Namespace | Purpose | Example |
|---|---|---|
agent | Identity & capabilities | /api/v1/agent/status |
ui | Visual tree, actions, screenshots | /api/v1/ui/tree |
webview | Embedded web content | /api/v1/webview/contexts |
profiler | Performance monitoring | /api/v1/profiler/sessions |
network | HTTP traffic monitoring | /api/v1/network/requests |
logs | Application log access | /api/v1/logs |
device | Hardware, sensors, permissions | /api/v1/device/info |
storage | Preferences & secure storage | /api/v1/storage/preferences |
ext | Third-party extensions | /api/v1/ext/{namespace}/... |
URL path versioning is used everywhere. The version segment (v1) appears in both HTTP and WebSocket URLs. Future breaking changes increment the version (v2, v3) while maintaining backward compatibility on previous versions.
Quick start
1. Discover the agent
Verify the agent is running and inspect its identity:
GET /api/v1/agent/status
{
"agent": {
"name": "devflow-maui",
"version": "1.0.0",
"framework": "maui",
"frameworkVersion": "10.0"
},
"platform": "ios",
"device": { "model": "iPhone 15 Pro", "manufacturer": "Apple", "osVersion": "18.0" },
"app": { "name": "MyApp", "packageId": "com.example.myapp", "version": "2.1.0" },
"running": true,
"uptime": 42.5
}
2. Check capabilities
Find out what this agent supports before calling feature endpoints:
GET /api/v1/agent/capabilities
{
"capabilities": {
"ui.tree": { "version": 1, "features": ["css-selectors", "hit-test", "native-layer"] },
"ui.actions": { "version": 1, "features": ["tap", "fill", "scroll", "navigate", "batch"] },
"ui.screenshot": { "version": 1, "features": ["element-capture", "window-capture"] },
"webview": { "version": 1, "features": ["evaluate", "dom-query", "source"] },
"profiler": { "version": 1, "features": ["samples", "markers", "spans"] },
"network": { "version": 1, "features": ["capture", "stream"] },
"logs": { "version": 1, "features": ["query", "stream"] }
}
}
3. Inspect → act → verify
The most common workflow is three steps: get the tree, perform an action, and check the result. Bundle them in two HTTP calls using the include parameter:
GET /api/v1/ui/tree?depth=5
POST /api/v1/ui/actions/tap
Content-Type: application/json
{ "elementId": "btn_submit", "include": ["screenshot", "tree"] }
The action response now contains the success flag, the resulting screenshot, and the new visual tree — no follow-up calls needed.
Capability discovery
Before calling any feature endpoint, clients should check what the agent supports. Not every agent implements every capability - a Flutter agent may not expose storage.secure; the native Android agent currently returns explicit unsupported responses for secure storage, BLE, jobs, and WebSocket streams.
Capabilities are organized into namespaces with a version number and a list of supported features:
{
"capabilities": {
"ui.tree": {
"version": 1,
"features": ["css-selectors", "hit-test", "native-layer"]
},
"ui.actions": {
"version": 1,
"features": ["tap", "fill", "clear", "scroll", "focus", "navigate", "resize", "back", "key", "gesture", "batch"]
},
"device.sensors": {
"version": 1,
"features": ["accelerometer", "gyroscope", "compass"]
}
}
}
Client usage pattern
caps = client.get("/api/v1/agent/capabilities")
if "ui.actions" in caps["capabilities"]:
actions = caps["capabilities"]["ui.actions"]
if "batch" in actions["features"]:
client.post("/api/v1/ui/actions/batch", { ... })
else:
client.post("/api/v1/ui/actions/tap", { ... })
if "webview" not in caps["capabilities"]:
print("WebView features not available for this framework")
Extension marker on status
GET /api/v1/agent/status includes a lightweight extension marker so clients can skip the heavier capabilities request when nothing's registered:
{ "extensions": { "count": 2, "hash": "a1b2c3d4..." } }
Cache extension metadata by hash and only re-fetch GET /api/v1/agent/capabilities when the hash changes.
Element queries & locator strategies
Elements in the visual tree are located using named locator strategies, inspired by the W3C WebDriver specification.
| Strategy | Description | Example |
|---|---|---|
accessibility-id | Match by automation/accessibility ID | ?strategy=accessibility-id&value=submit-btn |
css-selector | CSS selector adapted for native UI trees | ?strategy=css-selector&value=Button:visible.primary |
type | Match by element type name | ?strategy=type&value=Entry |
text | Match by visible text | ?strategy=text&value=Submit |
xpath | XPath expression over the tree | ?strategy=xpath&value=//Button[@text='Submit'] |
Element model
Every element in the tree conforms to the ElementInfo schema — a cross-framework model with consistent fields regardless of the underlying UI framework:
{
"id": "elem_abc123",
"parentId": "window_0",
"type": "Button",
"fullType": "Microsoft.Maui.Controls.Button",
"framework": "maui",
"automationId": "submit-btn",
"text": "Submit",
"role": "button",
"traits": ["interactive", "focusable"],
"state": {
"displayed": true, "enabled": true, "selected": false, "focused": false, "opacity": 1.0
},
"bounds": { "x": 10, "y": 100, "width": 200, "height": 50, "coordinate": "window" },
"gestures": ["tap"],
"nativeView": {
"type": "android.widget.Button",
"properties": { "elevation": "4.0" }
},
"frameworkProperties": {
"maui:bindingContext": "SubmitViewModel"
},
"children": []
}
| Field | Purpose |
|---|---|
id | Globally unique identifier — use this to target actions |
type / fullType | Short and fully-qualified type name |
role | Semantic role (button, textbox, checkbox, list, window, …) |
traits | Behavioral traits (interactive, focusable, scrollable, adjustable) |
state | Current interactive state — displayed, enabled, selected, focused |
bounds | Bounding rectangle in window or screen coordinates |
nativeView | Underlying platform view type and properties |
frameworkProperties | Framework-specific data not captured by standard fields |
Tree layers
Agents may support multiple tree representations:
GET /api/v1/ui/tree?layer=framework # Default: framework component tree
GET /api/v1/ui/tree?layer=native # Underlying platform views
GET /api/v1/ui/tree?layer=render # Render objects (Flutter-specific)
Windows as tree nodes
Windows are root-level elements in the visual tree, not a separate API resource. There is no ?window=N parameter — windows appear naturally as top-level nodes:
{
"tree": [
{ "id": "window_0", "type": "Window", "role": "window",
"state": { "focused": true, "displayed": true, "enabled": true, "selected": false, "opacity": 1.0 },
"children": [ { "id": "page_main", "type": "NavigationPage", "children": [ ... ] } ] },
{ "id": "window_1", "type": "Window", "role": "window",
"state": { "focused": false, "displayed": true, "enabled": true, "selected": false, "opacity": 1.0 },
"children": [ ... ] }
]
}
- One call shows everything. The full tree includes all windows.
- Element IDs are globally unique. Tap
elem_abc123and it targets the correct window automatically. - CSS selectors work across windows.
Window:focused Button.submitnaturally scopes to the focused window. - Filter when needed.
GET /api/v1/ui/tree?rootId=window_0returns just one window's subtree.
AI-optimized patterns
The protocol is designed for AI agents that loop inspect → act → verify. Every unnecessary HTTP round-trip slows the AI down, so two patterns minimize calls.
Composite responses
Every action endpoint accepts an include parameter to bundle additional data in the response:
POST /api/v1/ui/actions/tap
Content-Type: application/json
{ "elementId": "btn_login", "include": ["screenshot", "tree"] }
{
"success": true,
"screenshot": "iVBORw0KGgoAAAANSUhEUgAA...",
"tree": [ { "id": "window_0", "type": "Window", "children": [ ... ] } ]
}
screenshot and tree are only populated when requested — no wasted bandwidth if you don't need them.
Batch actions
Execute multiple actions in a single request with POST /api/v1/ui/actions/batch:
POST /api/v1/ui/actions/batch
Content-Type: application/json
{
"actions": [
{ "action": "fill", "elementId": "txt_email", "text": "user@example.com" },
{ "action": "fill", "elementId": "txt_password", "text": "s3cret!" },
{ "action": "tap", "elementId": "btn_login" }
],
"include": ["screenshot", "tree"],
"continueOnError": false
}
One request fills two fields, taps a button, and returns the resulting screenshot and tree. Two calls (tree + batch) instead of five.
WebSocket channels
Real-time streaming uses WebSocket connections. All channels share a standard message envelope.
{
"type": "event_type",
"timestamp": "2024-01-15T10:30:00.123Z",
"data": { ... }
}
| Channel | Content | Event types |
|---|---|---|
/ws/v1/network | HTTP traffic stream | replay, request |
/ws/v1/logs | Application log stream | replay, log |
/ws/v1/device/sensors | Sensor readings | subscribed, reading |
/ws/v1/profiler | Performance data | samples, marker, span |
/ws/v1/ui/events | UI lifecycle events | navigation, lifecycle, treeChange |
Subscription model
After connecting, send a subscribe message to configure filtering:
// Client → Agent
{ "type": "subscribe", "filter": { "sensor": "accelerometer", "throttleMs": 100 } }
// Agent → Client (confirmation)
{ "type": "subscribed", "timestamp": "2024-01-15T10:30:00.123Z",
"sensor": { "name": "accelerometer", "type": "accelerometer", "available": true } }
// Agent → Client (live data)
{ "type": "reading", "timestamp": "2024-01-15T10:30:00.223Z",
"reading": { "sensor": "accelerometer", "x": 0.02, "y": -9.81, "z": 0.01 } }
Channels that support historical replay (network, logs) send a replay batch of existing entries immediately after subscription, followed by live events.
Error handling
All errors use the RFC 7807 Problem Details format with a machine-readable errorCode field.
{
"type": "https://devflow.dev/errors/element-not-found",
"title": "Element Not Found",
"status": 404,
"detail": "No element found with id 'btn_submit'. The element may have been removed from the tree.",
"errorCode": "element-not-found",
"instance": "/api/v1/ui/actions/tap"
}
Standard error codes
| Error code | HTTP status | Description |
|---|---|---|
element-not-found | 404 | No element matches the given ID or selector |
stale-element-reference | 404 | Element ID was valid but the UI has changed and it no longer exists |
element-not-interactable | 400 | Element exists but cannot be acted on (hidden, disabled, obscured) |
invalid-selector | 400 | Malformed selector or unsupported locator strategy |
timeout | 408 | Operation did not complete within the allowed time |
unknown-command | 404 | Route or action type not recognized |
unsupported-capability | 501 | Agent does not implement the requested capability |
internal-error | 500 | Unexpected agent-side failure |
Cross-framework element mapping
The ElementInfo schema provides a unified model across frameworks. Here's how key concepts map:
| Concept | .NET MAUI | React Native | Flutter | Android | iOS |
|---|---|---|---|---|---|
| Element type | Button | TouchableOpacity | ElevatedButton | Button | UIButton |
| Automation ID | AutomationId | testID | ValueKey<String> | resource ID, setAilohaAutomationId(...), or Compose Modifier.testTag(...) | accessibilityIdentifier |
| Role | Inferred from type | accessibilityRole | Semantics role | Accessibility role | accessibilityTraits |
| Native view | Platform handler view | Host view | RenderObject / PlatformView | Self | Self |
Framework-specific data that doesn't fit the standard model goes in frameworkProperties:
{
"id": "elem_abc",
"type": "Button",
"framework": "maui",
"frameworkProperties": {
"maui:bindingContext": "LoginViewModel",
"maui:visualStateGroup": "CommonStates",
"maui:currentVisualState": "Normal"
}
}
Extensions
Third-party libraries can register their own tools without forking the protocol. Extensions own a reverse-domain namespace (e.g. com.acme.featureflags) and declare self-describing tool descriptors that mirror MCP tool metadata, so CLI and MCP clients can discover and invoke them with no extension-specific code.
GET /api/v1/ext/com.example.analytics/events
POST /api/v1/ext/com.example.analytics/flush
GET /api/v1/ext/io.sentry.devflow/breadcrumbs
{
"extensions": {
"com.example.analytics": {
"version": "1.0.0",
"description": "Analytics event inspector and replayer",
"tools": [
{
"name": "list_events",
"description": "List recent analytics events captured in-app",
"method": "GET",
"path": "/api/v1/ext/com.example.analytics/events",
"parameters": {
"type": "object",
"properties": {
"limit": { "type": "integer", "default": 50 }
}
},
"annotations": { "readOnly": true, "idempotent": true, "category": "analytics" }
}
]
}
}
}
Each tool descriptor carries name, description, method, path, optional parameters/returns JSON schemas, and behavioral annotations (readOnly, idempotent, destructive, category). Extension namespaces must use reverse-domain notation to prevent collisions.
Implementation checklist
MUST implement (core)
| Endpoint | Purpose |
|---|---|
GET /api/v1/agent/status | Identity, platform, app info, uptime |
GET /api/v1/agent/capabilities | Declare supported capabilities and features |
Without these, clients cannot discover or interact with the agent.
SHOULD implement (UI inspection & interaction)
| Endpoint | Purpose |
|---|---|
GET /api/v1/ui/tree | Visual tree with windows as root nodes |
GET /api/v1/ui/elements/{id} | Single element lookup by ID |
GET /api/v1/ui/elements?strategy=… | Element query with locator strategies |
GET /api/v1/ui/screenshot | Capture screenshot |
POST /api/v1/ui/actions/tap | Tap element or coordinates |
POST /api/v1/ui/actions/fill | Fill text input |
POST /api/v1/ui/actions/scroll | Scroll |
POST /api/v1/ui/actions/batch | Batch actions |
MAY implement (extended capabilities)
Implement as appropriate for the framework and use case: profiler, network, logs, device info & sensors, preference & secure storage, WebView, all WebSocket channels, and extensions.
Complete route reference
Agent
GET /api/v1/agent/status
GET /api/v1/agent/capabilities
UI inspection
GET /api/v1/ui/tree
GET /api/v1/ui/elements/{id}
GET /api/v1/ui/elements?strategy={strategy}&value={value}
GET /api/v1/ui/hit-test?x={x}&y={y}
GET /api/v1/ui/screenshot
UI actions
POST /api/v1/ui/actions/tap
POST /api/v1/ui/actions/fill
POST /api/v1/ui/actions/clear
POST /api/v1/ui/actions/focus
POST /api/v1/ui/actions/scroll
POST /api/v1/ui/actions/navigate
POST /api/v1/ui/actions/resize
POST /api/v1/ui/actions/back
POST /api/v1/ui/actions/key
POST /api/v1/ui/actions/gesture
POST /api/v1/ui/actions/batch
UI element properties
GET /api/v1/ui/elements/{id}/properties/{name}
PUT /api/v1/ui/elements/{id}/properties/{name}
WebView
GET /api/v1/webview/contexts
POST /api/v1/webview/evaluate
GET /api/v1/webview/dom
POST /api/v1/webview/dom/query
GET /api/v1/webview/source
POST /api/v1/webview/navigate
POST /api/v1/webview/input/click
POST /api/v1/webview/input/fill
POST /api/v1/webview/input/text
GET /api/v1/webview/screenshot
Profiler
GET /api/v1/profiler/capabilities
POST /api/v1/profiler/sessions
DELETE /api/v1/profiler/sessions/{id}
GET /api/v1/profiler/sessions/{id}/samples
POST /api/v1/profiler/markers
POST /api/v1/profiler/spans
GET /api/v1/profiler/hotspots
Network
GET /api/v1/network/requests
GET /api/v1/network/requests/{id}
DELETE /api/v1/network/requests
Logs
GET /api/v1/logs
Device
GET /api/v1/device/info
GET /api/v1/device/display
GET /api/v1/device/battery
GET /api/v1/device/connectivity
GET /api/v1/device/app
GET /api/v1/device/sensors
POST /api/v1/device/sensors/{name}/start
POST /api/v1/device/sensors/{name}/stop
GET /api/v1/device/permissions
GET /api/v1/device/permissions/{name}
GET /api/v1/device/geolocation
Storage
GET /api/v1/storage/preferences
GET /api/v1/storage/preferences/{key}
PUT /api/v1/storage/preferences/{key}
DELETE /api/v1/storage/preferences/{key}
DELETE /api/v1/storage/preferences
GET /api/v1/storage/secure/{key}
PUT /api/v1/storage/secure/{key}
DELETE /api/v1/storage/secure/{key}
DELETE /api/v1/storage/secure
Extensions
GET /api/v1/ext/{namespace}/{path}
POST /api/v1/ext/{namespace}/{path}
PUT /api/v1/ext/{namespace}/{path}
DELETE /api/v1/ext/{namespace}/{path}
Extension routes are defined by registered extensions using reverse-domain namespaces. Use GET /api/v1/agent/capabilities to discover available extensions and their tools.
WebSocket channels
WS /ws/v1/network
WS /ws/v1/logs
WS /ws/v1/device/sensors
WS /ws/v1/profiler
WS /ws/v1/ui/events
Inspired by WebDriver & Appium
The protocol incorporates battle-tested patterns from the W3C WebDriver specification and the Appium ecosystem:
- Formalized locator strategies instead of ad-hoc query parameters — unambiguous and extensible.
- Standard error codes familiar to anyone using Selenium or Appium (
element-not-found,stale-element-reference,element-not-interactable). - W3C Actions for gestures — a simplified pointer-action model lets you express any gesture (swipe, long-press, drag-and-drop, pinch-zoom) without a dedicated endpoint per gesture. Simple actions (tap, fill, scroll) remain as convenience endpoints.
- WebDriver state model — every element exposes
displayed,enabled,selected,focusedconsistently across frameworks.
What's next
- Pick a framework and add the agent to your app.
- Build a custom extension to expose your library's tools to AI agents.
- Skip manual setup and let your AI coding agent run
ailoha init.