Agent #deepmind #gemma4 #on-device ai #accessibility

DeepMind's Running Guide Agent: On-Device Gemma 4 for Blind Athletes

DeepMind's chest-mounted AI system lets blind runners navigate independently using dual-path on-device inference—no cloud, no tether.

Creeta

May 30, 2026

DeepMind's Running Guide Agent: On-Device Gemma 4 for Blind Athletes

What the Running Guide Agent Actually Does

The Running Guide agent is an AI system developed by Google DeepMind that enables blind and low-vision (BLV) athletes to run without physical guide lines or sighted human partners. Published on May 20, 2026 and updated May 28, 2026, the agent runs entirely on a chest-mounted Pixel 10 Pro — no cellular connection required during an active run — and communicates via bone-conduction headphones: ticking audio cues encode direction, while spoken alerts flag hazards. The word "unbounded" refers strictly to physical independence from tethered assistance; it is not an AI safety or capability claim. The system is a research-grade prototype tested with SG Enable in Singapore, not a consumer release.

Quick Answer: Google DeepMind's Running Guide agent lets BLV athletes run independently using a chest-mounted Pixel 10 Pro with no cellular dependency during the run. It pairs offline on-device segmentation with Gemma 4 E4B on-device reasoning, coordinated by three specialized subagents. As of May 2026, it is a research prototype trialed with SG Enable in Singapore — not a publicly available product.

The project was led by Robin Dua, Senior Director of AI Innovation & Research for Platforms & Devices at Google DeepMind, with Dr. Ramine Tinati of DeepMind APAC also publicly associated. Google DeepMind is piloting the system with SG Enable, Singapore's national disability and inclusion agency, making Singapore the initial deployment geography for real-world BLV athlete trials.

Prior assistance options for BLV runners fell into three categories: sighted human guides running alongside the athlete, physical guide wires fixed to a track, or GPS-only systems that provided route data but no real-time hazard awareness. All three require either another person or permanent infrastructure. The Running Guide agent replaces these dependencies with continuous on-device camera inference: a segmentation model handles immediate directional output, and an on-device multimodal model (Gemma 4 E4B) interprets scene context — track curves, surface changes, nearby runners.

The audio interface is deliberately minimal. Ticking sounds vary in rhythm and tone to encode direction. Verbal alerts from the Coach Agent are kept short and telegraphic: a severity tier followed by a brief descriptor. This design maps directly to the cognitive load constraints of running at race pace — dense spoken language would compete with physical concentration, so the output vocabulary is tightly constrained by intent, not by technical limitation.

"A step towards running unbounded," wrote Robin Dua, Senior Director of AI Innovation & Research for Platforms & Devices at Google DeepMind — framing the project as directional research rather than a finished deployment. That distinction matters for developers evaluating this as a production reference.

This is a research-grade prototype. No public availability date, SDK, or pricing has been announced. Developers should treat it as a technical reference demonstrating what is now feasible on commodity smartphone hardware — not as an immediately buildable platform.

Dual-Path Inference: Safety vs. Reasoning

The Running Guide agent uses two concurrent inference paths that never block each other. Path 1 — a custom segmentation model running on Pixel 10 Pro's dedicated silicon — handles safety-critical output: immediate STOP commands and directional audio cues. It runs fully offline, with no dependency on cellular connectivity or the reasoning path completing. Path 2 — Gemma 4 E4B running via AICore — provides higher-level multimodal scene understanding, combining image and text inputs to interpret context the segmentation model cannot: track layout, distant obstacles, and environmental conditions.

The critical design constraint is isolation: the safety path must never stall waiting for the reasoning path to complete. On commodity smartphone hardware, where NPU time is shared across thermal and battery budgets, a single inference queue would introduce latency variance on the safety-critical path. The dual-path architecture resolves this by partitioning compute at the hardware level — the segmentation model runs on dedicated custom silicon, while Gemma 4 E4B uses the NPU via AICore, each with its own execution context.

A mechanism called Smarter Frame Selection controls which frames reach the Gemma 4 reasoning path. Rather than passing every captured frame through the heavier model, the system filters for high-entropy frames: those where the scene has changed significantly since the prior frame. A new obstacle appearing, a sudden surface change, or a junction in the track path triggers frame selection; a straight empty lane at steady pace does not. This reduces compute load on the reasoning path without sacrificing situational awareness, since routine frames add no new information to the model's assessment.

This architecture reflects a broader principle for real-time edge agents: the latency requirement of the response determines which path handles it, not the complexity of the input. A STOP command needed within one inference cycle belongs on the fast path regardless of how complex the scene is. A "track curves left in 40 meters" advisory can tolerate higher latency and benefits from richer model reasoning. Separating these concerns at architecture design time — rather than trying to make a single model handle both at different speeds — is what makes the system viable on a single smartphone.

Inference Path	Model / Engine	Connectivity	Output Type	Latency Priority
Safety Path (Path 1)	Custom segmentation model, Pixel 10 Pro dedicated silicon	Fully offline	STOP commands, directional ticking audio cues	Ultra-low latency (safety-critical, never blocked)
Reasoning Path (Path 2)	Gemma 4 E4B via AICore NPU	Fully on-device	Verbal scene context, hazard classification, route advisories	Higher latency tolerable (contextual, advisory)

The Smarter Frame Selection mechanism sits between the camera feed and Path 2. It evaluates each incoming frame for entropy delta relative to the prior processed frame. Only frames exceeding the entropy threshold enter the Gemma 4 inference queue. The segmentation path (Path 1) receives all frames continuously and is unaffected by this filter — hazard detection never misses a frame because the reasoning path is busy processing a prior one.

Three Specialized Subagents and Their Roles

The Running Guide system is structured as three stateless, event-driven subagents that hand off to each other rather than sharing persistent state through a single monolithic loop. Each subagent owns a distinct phase of a running session — before, during, and at rest — with well-defined handoff boundaries. Coordination is event-driven: agent activation is triggered by session state transitions, not by periodic polling.

The Planner Agent operates pre-run. It uses Gemma 4's function-calling capability to pull live weather data and Google Maps context, then conducts a conversational exchange with the runner to establish workout goals and route expectations. It also calibrates a digital starting line — a reference point the system uses throughout the session to maintain spatial orientation. The Planner is the only agent that requires internet access; all subsequent agents operate entirely on-device with no network dependency.

The Coach Agent takes over once the run starts and is the highest-throughput component in the stack. It processes dual-path inference output and distills it into alerts using a strict three-tier hierarchy: DANGER (requires immediate evasive action), WARNING (nearby runners or obstacles within threat range), and NOTICE (informational — upcoming track curves, route features, environmental conditions). Each tier maps to a different expected response latency: DANGER triggers immediate audio, WARNING provides several seconds of lead time, NOTICE is advisory only. The verbal output is deliberately telegraphic — short noun phrases, not complete sentences — to minimize cognitive interruption during physical exertion.

The Break Agent manages rest intervals. When the runner pauses, it tracks the rest period and preserves session context so the Coach Agent can resume seamlessly without re-initialization. If session state were lost at each rest stop, the Planner would need to recalibrate on every resumption — adding setup friction that accumulates over a full training session.

Subagent	Phase	Primary Tools / Capabilities	Internet Required?	Key Output
Planner Agent	Pre-run setup	Gemma 4 function calling, Google Maps API, weather data, conversational input	Yes (pre-run only)	Route context, digital starting line, workout parameters
Coach Agent	Active run	Dual-path inference output, DANGER / WARNING / NOTICE alert hierarchy	No	Real-time ticking cues, telegraphic verbal hazard alerts
Break Agent	Rest intervals	Session state preservation, timer management	No	Seamless session resumption context for Coach Agent

The stateless handoff model has direct implications for developers. Each subagent's scope is well-defined: the Planner doesn't attempt real-time coaching, and the Coach doesn't attempt to manage rest intervals. This decomposition limits the blast radius of a failure in any single subagent — if the Break Agent's state serialization fails, only session resumption is affected, not the safety-critical directional cues. It also enables independent iteration: the team can refine the Coach Agent's alert hierarchy without touching the Planner's function-calling schema.

Gemma 4 E4B: Edge Deployment Profile

Gemma 4 E4B is the smallest variant in Google DeepMind's Gemma 4 model family and is designed specifically for edge deployment across mobile (iOS and Android via AICore), desktop, and IoT hardware — not server inference. Despite its size, it supports multi-step planning, autonomous action, function calling, and audio-visual reasoning entirely on-device. According to the Google Developers Blog, Gemma 4 achieves approximately 3,700 prefill tokens per second and 31 decode tokens per second on Qualcomm Dragonwing IQ8 with NPU acceleration. The E2B variant below E4B in the edge-optimized lineup runs under 1.5 GB RAM at 2-bit quantization.

E4B sits above E2B and trades some RAM efficiency for improved multimodal accuracy — specifically the image-text joint reasoning that the Running Guide's scene interpretation relies on. For developers targeting current flagship Android hardware, E4B occupies a practical performance/capability balance point: capable enough for function calling and multimodal input, small enough to coexist with other on-device processes within a shared NPU budget.

The function-calling capability in E4B is what enables the Planner Agent to query live weather and Maps APIs before a run without a separate server-side orchestration layer. On-device function calling — where the model generates structured tool invocations, executes them against registered functions, and incorporates the results — eliminates the round-trip latency and privacy exposure of cloud-side tool use. For applications where sensor data or user context is privacy-sensitive (health monitoring, accessibility tools, location-aware agents), this architectural choice has consequences beyond raw performance.

The Google Developers Blog frames Gemma 4 as bringing "state-of-the-art agentic skills to the edge" — a characterization that maps directly to the Running Guide's requirements: scene understanding, external tool calls at setup time, and subagent coordination, all on hardware a runner carries on their chest. Full deployment details at Google Developers Blog.

Developers building on Gemma 4 E4B should note what "on-device via AICore" means in practice: AICore is the Android system service that manages on-device ML models, providing a shared runtime so multiple apps can use the same model without each bundling its own weights. This lowers per-app storage overhead and allows Google to push model weight updates centrally. The iOS path uses a comparable on-device inference stack, but AICore is specific to Android.

Hardware Configuration and Prototype Roadmap

The current Running Guide configuration uses a chest-mounted Pixel 10 Pro as the sole compute host. Chest mounting provides a stable, forward-facing camera field of view that is consistent across different runners and running styles — an important consistency property for a model calibrated to a fixed perspective. The device handles both inference paths, audio output processing, and session state management within the thermal and battery constraints of a run that could last an hour or more.

A second-generation prototype under active development targets intelligent eyewear as the primary visual input source. Eyewear provides a wider and steadier field of view than a chest-mounted phone — eye-level perspective eliminates the parallax error introduced by chest height and captures a more natural forward-looking scene. Crucially, the eyewear streams video to the Pixel rather than replacing it: the Pixel remains the compute host for both inference paths. This keeps the architecture change incremental — only the input pipeline changes, not the inference stack.

Battery life is the hardware constraint that most directly shaped the dual-path architecture. On-device inference is thermally and energetically expensive. Running Gemma 4 E4B continuously on every frame would drain a Pixel battery well before most training sessions end. The Smarter Frame Selection mechanism is partly a compute optimization but also a power management strategy: by limiting reasoning path activations to high-entropy frames, the system extends the window of viable continuous operation. Developers building agents for extended mobile sessions should treat battery as a first-class constraint alongside latency — not as an afterthought.

Thermal management is the related constraint. NPU-intensive workloads on mobile hardware can trigger thermal throttling after sustained load, degrading inference latency — precisely the opposite of what a safety-critical path requires. The segmentation model running on dedicated custom silicon is less exposed to NPU thermal state, which provides another reason the safety path and the NPU-backed reasoning path are architecturally separated.

What 'First Unbounded Consumer AI Agent in Production' Actually Means

Neither the Google DeepMind blog post nor any official source describes the Running Guide agent as "the first unbounded consumer AI agent in production." That phrase is a commentator label — an interpretive characterization applied after the fact, not an official claim from the team. The actual subtitle of the DeepMind blog post is "A step towards running unbounded" — forward-looking language that explicitly positions this as ongoing research.

The official description is research-grade prototype. No public availability date has been announced. No SDK, API, or consumer app has been released. The deployment is a structured partner trial with SG Enable in Singapore — a methodology consistent with responsible accessibility research, where real-world testing requires controlled conditions and collaboration with the target community, but does not constitute a product launch.

What is technically accurate — and what makes this noteworthy for developers — is more specific than the inflated label suggests: this is one of the first publicly documented systems to combine fully on-device multi-subagent architecture with multimodal reasoning in a real-world deployment trial outside a lab. The emphasis belongs on "multi-subagent" and "multimodal on-device" together. Single-model on-device inference has existed for several years. Multi-agent coordination on a single smartphone with no cloud dependency during operation is what the system demonstrates as newly viable on commodity hardware.

Dr. Ramine Tinati of DeepMind APAC, publicly associated with the project on LinkedIn, framed the work under #AIforScience and #AIforGood — signaling its positioning as applied research with a social mission, not a competitive product announcement. That framing is relevant when assessing what "production" means in this context.

The practical implication for developers: do not plan a product roadmap around building on top of the Running Guide system today. There is no public API. The value is in the architectural patterns it demonstrates — dual-path inference, Smarter Frame Selection, on-device function calling via Gemma 4 — which are composable and applicable now using publicly available tools. The system itself is inaccessible for third-party development as of May 2026.

One specific gap worth noting: no latency figures in milliseconds for either inference path have been published. No quantitative benchmarks comparing this system against prior guide systems or other agent deployments appear in the public record. The architecture is technically sound and the design choices are well-motivated — but quantitative validation is absent.

Implications for Developers Building On-Device Agents

The Running Guide agent is a useful reference architecture because its constraints — commodity mobile hardware, no cloud fallback during critical operation, real-time sensor input, mixed latency requirements — map directly to a wide class of practical edge agent applications. The patterns it demonstrates are not accessibility-specific; they generalize across robotics, automotive, AR overlays, warehouse automation, and consumer health applications.

Dual-path partitioning is the most transferable pattern. Any agent that receives real-time sensor input and must produce both immediate reflexive outputs and higher-level contextual judgments benefits from separating these into concurrent paths with independent execution contexts. The operational rule: if a response must arrive within one inference cycle of the triggering event, it belongs on a dedicated low-latency path. If it can tolerate several hundred milliseconds and benefits from richer model reasoning, it belongs on the contextual path. Merging these onto a single path creates latency variance that is acceptable for advisory outputs and unacceptable for safety-critical ones.

High-entropy frame selection is a practical optimization for any video-input agent running on constrained hardware. Passing every frame to a large model is wasteful when most consecutive frames in a steady-state scenario are nearly identical. Building a lightweight entropy estimator — frame differencing, optical flow magnitude, or a small binary classifier — to gate model invocations directly reduces inference cost proportional to scene stability. The Running Guide implementation demonstrates this is worth implementing even on NPU-accelerated hardware, where power and thermal savings over an extended session compound significantly.

On-device function calling via Gemma 4 removes the architectural assumption that structured tool use requires a server-side LLM. Mobile developers who have previously routed tool orchestration through a cloud model now have a path to fully local tool use — with the associated privacy and latency benefits. The Planner Agent's pre-run weather and Maps queries are a minimal example; the same pattern applies to local health data queries, device calendar lookups, or on-device sensor aggregation.

Partner-piloted accessibility research as a deployment methodology is a practical template for teams building assistive or safety-critical technology. The SG Enable partnership demonstrates how to conduct real-world trials with a target community before broader deployment: a defined partner organization, a specific geography, a controlled trial population, and explicit research framing. This structure is both ethically appropriate for assistive technology and practically useful — it generates real-world feedback without the support overhead of a public release.

Developers looking to build on these patterns today have concrete starting points: Gemma 4 E4B is available via AICore on Android and through standard model hubs; on-device function calling is documented in the Google Developers Blog; and the dual-path inference pattern can be implemented using any combination of on-device model runtimes (AICore, Core ML, ONNX Runtime) alongside a dedicated low-latency processing pipeline.

Frequently Asked Questions

Does the Running Guide agent require an internet connection during a run?

No. Both active inference paths — the segmentation model on Pixel 10 Pro's dedicated silicon and Gemma 4 E4B on the NPU — run entirely on-device with no cellular dependency during a run. Internet access is only required pre-run by the Planner Agent, which calls external APIs (weather data, Google Maps) to establish route context and calibrate the digital starting line before the session begins. Once a run starts, the system operates fully offline.

What is Gemma 4 E4B and how does it differ from larger Gemma 4 variants?

Gemma 4 E4B is the smallest edge-optimized variant in Google DeepMind's Gemma 4 model family. It targets mobile, desktop, and IoT hardware — specifically Android and iOS via AICore — and supports multimodal input (image + text), multi-step planning, function calling, and audio-visual reasoning entirely on-device. The E2B variant below it runs under 1.5 GB RAM at 2-bit quantization; E4B trades some RAM efficiency for improved multimodal accuracy. Larger variants in the Gemma 4 lineup target higher-compute edge hardware and server deployments and are not suited for single-smartphone operation in this configuration.

Is the Running Guide agent available to download or use?

Not currently. As of May 2026, the Running Guide agent is a research-grade prototype being trialed with SG Enable in Singapore. No public SDK, consumer app, or developer API has been announced by Google DeepMind. Developers cannot build on top of the system directly today. The architectural patterns it demonstrates — dual-path inference, Smarter Frame Selection, Gemma 4 on-device function calling — are available independently through existing tools and model hubs.

How does Smarter Frame Selection reduce latency without losing safety coverage?

Smarter Frame Selection applies only to the Gemma 4 reasoning path (Path 2), not to the safety-critical segmentation path (Path 1). Path 1 receives every frame continuously and is unaffected by the filter. The selection mechanism evaluates incoming frames for entropy delta — how much the scene has changed relative to the last processed frame. Only frames exceeding the entropy threshold (indicating a new obstacle, terrain shift, or sudden scene change) are forwarded to Gemma 4 for reasoning. Routine frames from a stable, unchanged environment are skipped, reducing NPU load and power draw without any reduction in hazard detection coverage on the safety path.

Can the dual-path inference pattern be applied to non-accessibility use cases?

Yes. The core principle — separate a latency-critical reactive path from a slower reasoning path, run them concurrently with independent execution contexts — is applicable wherever you have real-time sensor input with mixed response latency requirements. Concrete examples: warehouse robotics (collision avoidance on the fast path, route optimization on the reasoning path); automotive ADAS (emergency braking on the fast path, lane-change advisories on the reasoning path); AR overlays (object anchoring on the fast path, semantic labeling on the reasoning path); consumer health monitoring (alert thresholds on the fast path, trend analysis on the reasoning path). The accessibility framing of the Running Guide is domain-specific; the architectural pattern is not.

What Comes Next for On-Device Multi-Agent Systems

The Running Guide agent demonstrates something that has been theoretically feasible for several years but rarely implemented in practice: a multi-agent, multimodal AI system operating entirely on commodity smartphone hardware in a real-world environment. The combination of on-device function calling (Gemma 4 E4B), hardware-partitioned dual-path inference (Pixel 10 Pro silicon + AICore NPU), and event-driven subagent coordination represents a deployable template for edge agents that need both immediate reflexes and contextual judgment — without cloud dependency during critical operation.

For developers, the near-term takeaway is practical: the tools to replicate these architectural patterns are already available. Gemma 4 E4B on Android AICore is accessible today. The dual-path inference pattern is implementable with existing runtimes. High-entropy frame selection is a lightweight optimization any video-input agent team can add. The distance between the Running Guide's architecture and what a well-resourced development team can build today is narrower than the research framing might suggest.

The longer-term signal is about the trajectory of on-device agents as a class. The second-generation eyewear prototype points toward a hardware evolution that provides richer sensor input without restructuring the compute architecture. As Gemma 4 variants reach broader hardware targets — Qualcomm, MediaTek, Apple Silicon — the patterns demonstrated here apply to a far larger installed base. The accessibility use case is specific to its domain; the engineering decisions transfer directly to any developer building agents where latency, privacy, and offline operation are constraints that matter.

Last updated: 2026-05-29. Article based on the Google DeepMind Running Guide announcement published May 20, 2026 (updated May 28, 2026) and the Google Developers Blog Gemma 4 edge deployment documentation, both published May 2026.