LLM #gemini #managed-agents #google #agentic-ai

Google Managed Agents API: Sandbox, Skills, and Agentic Stack Analysis

One API call provisions a hosted Linux agent with persistent state and GCS mounts. Here's what developers need to know.

Creeta

May 28, 2026

What the Managed Agents API Provisions Out of One Call

The Managed Agents API is Google's hosted agent runtime: a single API call provisions a complete, stateful Linux environment — Bash, Python, and Node.js runtimes included — along with web browsing, file handling, persistent state across calls, and optional repository or Google Cloud Storage mounts . The agent identifier is antigravity-preview-05-2026, a preview-channel designation that signals breaking changes before a stable release lands. Google announced this alongside Gemini 3.5 Flash at Google I/O on May 19, 2026 , positioning the API as infrastructure for agentic workloads rather than a consumer feature.

Quick Answer: A single Managed Agents API call provisions a hosted Linux sandbox with Bash, Python, and Node.js runtimes, persistent state, web browsing, GCS and repo mounts, and markdown-defined custom skills — all under the antigravity-preview-05-2026 preview identifier powered by Gemini 3.5 Flash. Breaking changes before stable release should be expected.

The sandbox eliminates the scaffolding developers typically assemble around a raw LLM: no separate container orchestration, no SSH sessions for code execution, no custom sync layer between cloud storage and agent workspace. Repository mounts let agents operate directly against version-controlled codebases; GCS mounts provide a shared artifact layer for pipelines where multiple agents or downstream steps need to read from the same file tree . This positions the API as infrastructure-inclusive rather than model-serving only.

Markdown-defined custom skills are the extension point for domain-specific behavior. A skill is a structured markdown document that instructs the agent how to handle a particular class of task — effectively prompt engineering elevated to a first-class API surface. The design keeps skill authoring accessible to domain experts who lack the infrastructure to fine-tune a model but can describe task workflows in natural language. The limitation is that skills are soft guidance: they shape behavior without hard behavioral guarantees, which matters for compliance-sensitive workloads.

The preview designation carries real operational weight. The identifier antigravity-preview-05-2026 names a specific month, which is a strong signal that this channel is not frozen. Parameter names, response shapes, or runtime behavior may change before a stable channel is published. According to Google's Managed Agents announcement, developers building production workloads should treat this as an architecture exploration window, not a stable dependency .

"Frontier Intelligence with Action" — the product positioning tagline for the Gemini 3.5 generation, per the Google DeepMind launch post authored by Koray Kavukcuoglu, VP of Research. The framing signals Google's intent to compete on execution capability, not just model quality.

The practical boundary of the current preview is clear enough: code-native agentic workflows that don't require a stable external contract are well-suited to building on this API today. Anything that needs a versioned, contractual endpoint should monitor the channel status closely and wait for a stable identifier to be published.

Antigravity 2.0: Runtime Architecture and Parallel Execution

Antigravity 2.0 is a full agent-first development platform built around Gemini 3.5 Flash. It ships a desktop application, CLI, and developer SDK — making it accessible outside the cloud console — and runs Gemini 3.5 Flash at 12× the speed of the public API . The public API throughput of approximately 280 tokens per second already exceeds comparable frontier models; the 12× multiplier inside the Antigravity runtime makes latency-sensitive agentic loops practically viable at scale .

Capability	Antigravity 2.0 / Managed Agents API
Model	Gemini 3.5 Flash (`gemini-3.5-flash`)
Runtime speed	12× public API throughput
Public API throughput	~280 tokens/sec
Context window	1,048,576 tokens (~1M)
Max output tokens	65,536
Developer access	Desktop app, CLI, SDK
Parallel subagents	Yes — I/O keynote demonstrated 93 concurrent agents
Scheduled background tasks	Yes
I/O demo (OS build)	93 agents, ~12 hrs, 15,000+ API calls, 2.6B tokens, <$1,000

The architecture natively supports parallel subagent orchestration. Google's I/O keynote demo made this concrete: 93 parallel subagents were dispatched to build a functioning operating system in approximately 12 hours, consuming over 15,000 API requests and 2.6 billion tokens at under $1,000 in API credits . That math — roughly $0.38 per million tokens at dispatch scale — is what makes parallel agentic pipelines economically viable rather than just technically feasible.

The 1,048,576-token context window and 65,536 max output tokens are relevant for agents that need to hold large codebases or long document sets in context during a single reasoning pass . In practice, this means fewer context-truncation errors in loops that read large files and longer uninterrupted code generation passes before an output is cut off. For most repo-scale tasks, the 1M token window is sufficient to hold an entire moderately sized codebase in context.

Scheduled background tasks is one of the less-discussed capabilities in the initial announcement. Agents can be configured to run on a timer — relevant for monitoring, data ingestion, or report generation workflows that currently require external cron infrastructure. Combined with persistent state across calls, this shifts the Managed Agents API closer to a managed job runtime than a stateless inference endpoint.

The desktop application and CLI complement the SDK by providing interactive testing surfaces. Developers can iterate on agent prompts and skill definitions locally before embedding them in production pipelines — a feedback loop that cloud-console-only development environments cannot match. According to Google's I/O developer highlights, the full Antigravity 2.0 surface is designed for teams that want to build and test agentic systems without managing underlying cloud infrastructure .

Pricing Model: What Managed Agents Costs at Production Scale

Gemini 3.5 Flash is priced at $1.50 per million input tokens, $9.00 per million output tokens, and $0.15 per million cached input tokens — approximately a 90% discount on cache hits . Non-global regions (EU, APAC) carry a small premium at $1.65/$9.90 input/output — a relevant factor for deployments with data residency requirements . The Managed Agents API cost layers on top: sandbox compute and orchestration overhead is currently undisclosed beyond token costs, which means production cost modeling remains incomplete until Google publishes the full pricing schedule.

The headline tension in the pricing is that Gemini 3.5 Flash is approximately 5.5× more expensive per token than Gemini 3 Flash . Whether that premium is justified depends entirely on task compression: how much does higher per-call capability reduce total call count? Simple code generation tasks may not compress; complex multi-step reasoning tasks likely compress more. The I/O keynote demo implicitly made this argument — 93 agents completing an OS build in 12 hours suggests the per-task cost can be competitive even at higher per-token rates.

"The higher per-token investment is offset by fewer total calls to complete an agentic task, making total cost competitive at the workload level." — Google's stated pricing rationale for Gemini 3.5 Flash, as reported in Latent Space's Google I/O 2026 analysis.

Thought preservation across multi-turn calls is the mechanism that makes the cache discount practically valuable. Reasoning state persists across API calls, meaning agent loops don't need to re-inject the same prompt preamble and context on every turn. For workloads with a stable system prompt and repeated tool descriptions, cache-hit rates of 70–80% or higher are achievable. At $0.15/1M for cached input versus $1.50/1M uncached, the effective input cost drops by 10× on those tokens . Designing prompts for cache reuse from the start is the single highest-leverage cost optimization available at the current pricing tier.

The Gemini 3.5 Flash vs. Gemini 3.1 Pro comparison is the other pricing lever worth tracking. Flash is approximately 40% cheaper than 3.1 Pro on both input and output . If Flash's benchmark improvements over 3.1 Pro translate to your workload (see Section 6), you can move down in price and up in raw agentic capability simultaneously. The one exception: Flash trails 3.1 Pro on MRCR v2 128k long-context retrieval. Workloads that depend heavily on document retrieval over very long contexts should validate this tradeoff explicitly before migrating.

Competitive Map: Managed Agents vs. Codex Cloud vs. Anthropic Computer Use

Three providers now offer distinct approaches to managed agent execution: Google's Managed Agents API, OpenAI's Codex Cloud environment, and Anthropic's computer use API. The architectures diverge in what "managed" means and where the infrastructure boundary sits. Google provisions a full Linux runtime with deep GCS and repo integrations; OpenAI provides a sandboxed Python execution environment focused on code; Anthropic provides pixel-level GUI interaction primitives and no hosted runtime at all — developers bring their own infrastructure.

Feature	Google Managed Agents API	OpenAI Codex Cloud	Anthropic Computer Use
Hosted runtime	Linux sandbox (Bash, Python, Node.js)	Python sandbox (code-focused)	None — bring your own infrastructure
Cloud storage integration	Native GCS mount	No	No
Repo mounts	Native	No native support	No
Web browsing	Yes	Limited	Via GUI screenshot
GUI / pixel interaction	No	No	Yes (primary capability)
Custom skills	Markdown-defined	N/A	N/A
Tool use benchmark	MCP Atlas 83.6%	Not publicly disclosed	Not publicly disclosed
Infrastructure portability	Low — tightly coupled to Google Cloud	Moderate	High

Google's differentiator at launch is the depth of single-call provisioning: full Linux sandbox plus MCP tool use at an 83.6% Atlas score plus GCS and repository integration in one package . OpenAI's Codex Cloud is well-suited for Python code generation and execution loops, but lacks native cloud storage integration and the broader OS-level tool surface. Anthropic's computer use API occupies a different category: it's the appropriate choice when agents need to interact with existing desktop GUI software — legacy apps, browsers, IDEs — not when they need a clean code execution environment.

The portability tradeoff deserves direct treatment. GCS mounts and Google Cloud repository integration are powerful when your infrastructure is already on Google Cloud. They become a lock-in surface if it isn't. A workflow built around antigravity-preview-05-2026 with GCS artifact pipelines creates operational dependencies that are non-trivial to migrate. Anthropic's computer use primitives require your own infrastructure (higher implementation cost upfront, but no cloud lock-in), and Codex's Python sandbox can be replicated with standard cloud compute at comparable cost.

For teams already running on Google Cloud — Vertex AI, BigQuery, GCS-backed data pipelines — the Managed Agents API's integration story is compelling on day one. For teams with multi-cloud requirements or a deliberate vendor-independence policy, the tighter Google Cloud coupling is a concrete factor in the architecture decision, not just an abstract concern.

"The new class of agentic tasks requires models that can execute code, browse the web, and call external APIs — not just generate text." — Google Cloud, on the design intent behind the Managed Agents API, per the Google Cloud I/O 2026 blog.

SDK Migration: thinking_budget Is Deprecated — Here Is the Replacement

The most immediate breaking change in the new SDK is the removal of thinking_budget. Any existing call that passes thinking_budget as an integer will throw a hard error — not a deprecation warning, a thrown exception. The replacement is thinking_level, a string enum accepting minimal, low, medium, or high . The migration is a one-line change at each callsite; locating every callsite in a large codebase is the actual work.

# Before — deprecated, throws a hard error on the new SDK
response = client.models.generate_content(
    model="gemini-3.5-flash",
    config={"thinking_budget": 1024}  # Hard error: parameter removed
)

# After — correct for new SDK
response = client.models.generate_content(
    model="gemini-3.5-flash",
    config={"thinking_level": "high"}  # Set explicitly; do not rely on defaults
)

The subtler issue is the default change. The prior SDK defaulted to high reasoning; the new SDK defaults to medium . Code ported from gemini-3-flash-preview without an explicit thinking_level argument will silently reason less than before. Outputs look plausible — this isn't an obvious failure mode detectable from log inspection alone. The practical fix: set thinking_level="high" explicitly at every migrated callsite first, then tune down per-agent once you have empirical quality measurements.

Multi-turn thought preservation is the positive architectural addition in this SDK version. Reasoning state now persists across API calls in multi-turn conversations, meaning agents don't need to re-inject the same reasoning context on every turn. For agentic loops with 10 or more turns, eliminating redundant reasoning-context re-injection can reduce total input tokens by 20–40% depending on prompt structure — a meaningful reduction when multiplied across high-volume pipelines.

The migration checklist for teams porting existing Gemini code:

Search all callsites for thinking_budget — remove the parameter or replace with the thinking_level string enum
Set thinking_level="high" explicitly at all migrated callsites initially to preserve prior reasoning behavior
Run existing eval suites or spot checks after migration — validate output quality hasn't silently degraded
Update integration tests that mock the config object; tests that hardcode the old parameter schema will fail silently
Review multi-turn conversation handling to take advantage of thought persistence and reduce context re-injection overhead

Benchmark Signals Most Relevant to Agentic Use Cases

Benchmark scores are only useful when they predict real task performance. For agentic workloads, the benchmarks that matter measure code execution, tool use, and context handling — not general knowledge or text fluency. Gemini 3.5 Flash's profile is strongest precisely in the dimensions relevant to the Managed Agents API: shell execution, tool calls, and coding tasks. That alignment is not accidental — Google positioned this generation explicitly around agentic execution capability.

Terminal-Bench 2.1: Gemini 3.5 Flash scores 76.2%, compared to Gemini 3.1 Pro's 70.3% . Terminal-Bench directly measures the model's ability to complete shell-level tasks — exactly the capability exercised inside a Bash, Python, and Node.js sandbox. A 5.9-point improvement over the prior flagship on this specific benchmark is the most direct available evidence that Flash performs comparably or better than Pro for agentic shell workflows, despite being the lower-tier model.

MCP Atlas tool use: 83.6% . This is the reliability floor for external tool calls made within managed agents. At 83.6%, roughly 1 in 6 tool invocations will fail or require retry under worst-case conditions. In practice, tool call reliability depends heavily on how well the tools are described in the prompt; MCP Atlas measures the ceiling on reliability given well-described tools. Factor in retry logic for any tool call in a production agent loop.

MRCR v2 128k long-context retrieval: 77.3% — Gemini 3.1 Pro still leads here . If your agent pipeline relies on deep document retrieval within very long contexts — scanning full codebases or long legal documents in a single pass — this score indicates measurable quality degradation versus 3.1 Pro. For pipelines that chunk documents or use retrieval augmentation with shorter context windows, the practical gap is unlikely to be hit regularly.

Text and Code Arena: Gemini 3.5 Flash ranks #9 overall with an Elo score of 1,507, up 70 points from Gemini 3 Flash . Arena rankings come from blind human preference comparisons, not Google-run benchmark suites. A 70-point Elo improvement at this score range is meaningful and provides independent cross-model validation of the capability increase that isn't subject to the same reproducibility concerns as proprietary benchmark suites.

GDPval-AA agentic Elo: 1,656 . This score comes from Google's own agentic evaluation framework; independent third-party reproduction had not been published as of the May 2026 announcement. Treat it as directional signal for now, and weight the Arena and Terminal-Bench scores more heavily until external reproductions appear.

Gemini 3.5 Pro: What to Track Before the June 2026 Release

Gemini 3.5 Pro is in private testing as of May 2026, with general availability targeted for June 2026 . No public benchmarks, pricing, or full API specification have been released. The key open questions for the Managed Agents API: whether Pro extends the same API surface, introduces its own pricing tier, and whether the Pro GA coincides with publishing a stable — non-preview — agent identifier. These unknowns should gate production architecture decisions for any team planning to build on the Pro tier.

"Gemini 3.5 Pro is in testing and will be available in June 2026." — Google DeepMind, per the official Gemini 3.5 announcement .

The most important open question is the API contract. Flash ships under antigravity-preview-05-2026. It is unclear whether Pro will share that identifier, get its own, or whether the preview designation will be lifted alongside the Pro GA. If Google uses the Pro launch to also publish a stable Managed Agents API identifier, it could function as the de-facto stabilization event for the entire API surface — making the Pro GA date more significant than the model release itself for teams evaluating production readiness.

Developers on waitlists should use the time before Pro GA to validate their agent architecture against Flash. If workflows perform well against Flash's benchmark profile — strong shell task execution, reliable tool use, acceptable long-context retrieval — the Flash-to-Pro transition may offer incremental improvement without requiring architectural changes. If you're hitting limits on reasoning depth or long-context retrieval, Pro may close gaps that matter to your use case specifically.

The Flash-to-Pro capability gap will also determine whether the Managed Agents API becomes a general platform or remains a Flash-tier product. If Pro significantly outperforms Flash on complex reasoning but ships at a substantially higher price, the platform bifurcates into a high-volume Flash tier and a high-complexity Pro tier — a more nuanced procurement decision than a simple upgrade path.

Build Now vs. Wait: An Adoption Decision Framework

The adoption decision reduces to two questions: what is your use case's tolerance for API instability, and what is the cost of re-architecting if the contract changes before stable release? For internal tooling, developer automation, and non-production pipelines, the answer points clearly toward building now — the preview is stable enough to iterate on architecture. For customer-facing agents, compliance workloads, or anything requiring a versioned API contract, the antigravity-preview-05-2026 identifier is an explicit signal to wait.

Build now if your workload fits:

Internal developer tooling, CI/CD automation, or code review pipelines — workloads that can absorb a breaking change without customer impact
Architecture validation where you need empirical data from a real API to compare against OpenAI and Anthropic alternatives
Workflows where GCS or repo mounts directly solve an existing integration problem you're currently handling with custom sync code
Prompt design work where you want to architect for cache reuse from day one — the $0.15/1M cached rate rewards repeated-prompt patterns substantially over uncached tokens

Wait if your workload requires:

Customer-facing agents where a breaking API change requires coordinated incident response and user-visible downtime
Compliance workloads requiring a stable, versioned API contract (SOC 2, HIPAA, GDPR-regulated processing)
Cost modeling that depends on knowing Pro's pricing and capabilities before committing to infrastructure investment
Long-context document retrieval at scale where Gemini 3.1 Pro's MRCR v2 score remains the relevant performance benchmark

The key signal to track is not the Pro GA date itself, but whether that launch includes a stable Managed Agents API identifier. A stable identifier published alongside Pro GA would be the most actionable green light for production deployments. A Pro launch that maintains the preview designation would extend the wait window regardless of model capability improvements.

One design decision worth making now regardless of your deployment timeline: structure agent prompts for cache reuse. The $0.15/1M cached input rate versus $1.50/1M uncached is a 10× cost difference on those tokens. System prompts, tool descriptions, and any static context that appears in every call are prime candidates for caching. Investing in cache-friendly prompt architecture during the preview period means you capture that discount from the first production request, not after a refactor forced by a large bill.

Frequently Asked Questions

What does the Google Managed Agents API provision out of the box?

A single API call provisions a hosted Linux sandbox with Bash, Python, and Node.js runtimes; persistent state across calls; file handling; web browsing; optional Google Cloud Storage mounts; optional repository mounts; and markdown-defined custom skills for domain-specific behavior. All of this runs on Gemini 3.5 Flash under the identifier antigravity-preview-05-2026, which is a preview-channel designation indicating that breaking changes should be expected before a stable release is published.

What breaks when migrating existing Gemini code to the new SDK?

The thinking_budget integer parameter raises a hard error in the new SDK — it is not deprecated with a warning, it throws immediately at call time. Replace it with thinking_level, a string enum accepting minimal, low, medium, or high. The second issue is the default change: the prior SDK defaulted to high reasoning; the new SDK defaults to medium. Code ported without an explicit thinking_level argument will silently reason less than before. Set thinking_level="high" explicitly at all migrated callsites initially, then tune down based on empirical output quality measurements.

How does Google's Managed Agents API compare to OpenAI Codex Cloud?

Both provision a hosted sandbox for agentic code execution. Google adds native GCS and repo mounts, web browsing, and markdown-defined custom skills on top of the Linux runtime. OpenAI Codex Cloud is more narrowly focused on Python and code-execution tasks, without native cloud storage integration or a full OS-level tool surface. Anthropic has no managed runtime equivalent: its computer use API provides pixel-level GUI interaction primitives for interacting with existing desktop software, but developers must supply and manage their own execution infrastructure.

Is the Managed Agents API ready for production workloads?

Not for customer-facing or compliance-sensitive workloads. The agent identifier antigravity-preview-05-2026 is a preview-channel designation — parameter names, response shapes, and runtime behavior may change before a stable release. For internal tooling, developer automation, and non-production pipelines, the API is stable enough to build on and iterate architecture. The clearest production-readiness signal to watch: whether Google publishes a stable, non-preview agent identifier alongside the Gemini 3.5 Pro GA, currently targeted for June 2026.

Why is Gemini 3.5 Flash 5.5× more expensive per token than Gemini 3 Flash?

Google's stated argument is that higher capability per call means fewer total calls to complete an agentic task, making total cost competitive at the workload level even at a higher per-token rate. Thought persistence across multi-turn calls reduces context re-injection overhead, further compressing total token spend for agentic loops. The $0.15/1M cached input rate — approximately 90% cheaper than the $1.50/1M uncached rate — provides additional savings for workloads with stable, repeated system prompts and tool descriptions. Whether the per-call compression holds in practice depends on task complexity; simple tasks compress less than complex multi-step reasoning workflows.

What Comes Next for the Agentic Stack

The Managed Agents API's preview status is a temporary state, not a permanent limitation. The two events most likely to shift the adoption calculus are the Gemini 3.5 Pro GA in June 2026 and the publication of a stable agent identifier. Until those land, the most productive use of the preview period is validating whether the GCS and repo integrations and parallel subagent architecture actually solve problems in your specific workload — documentation-level evaluation cannot substitute for running real tasks against a real API.

The SDK migration from thinking_budget to thinking_level should be treated as urgent, not discretionary. Any pipeline running against gemini-3-flash-preview that hasn't been updated carries a latent breakage: the next SDK update will fail those callsites without a deprecation window. Migrating now, with an explicit thinking_level="high" override to preserve prior reasoning behavior, is considerably lower-risk than discovering the break under production load or a scheduled pipeline run.

The broader picture — where Antigravity 2.0, the Managed Agents API, and Gemini 3.5 Pro will intersect once Pro is generally available — is the most important open question in the Google AI developer ecosystem heading into the second half of 2026. Flash's benchmark improvements over 3.1 Pro on Terminal-Bench and MCP Atlas are meaningful signals, but the Pro release will determine whether this platform scales across a serious range of task complexity or remains optimized primarily for high-throughput Flash-tier use cases. The benchmark profile and pricing Google publishes for Pro will be the clearest indicator of which direction this goes.

Last updated: 2026-05-28. Based on Google I/O 2026 announcements from May 19, 2026 and subsequent developer documentation published by Google DeepMind and Google Cloud. Pricing, API specifications, and availability details are subject to change; verify against current Google AI Studio and Vertex AI documentation before production use.