Gemini 3.5 Flash: Benchmarks, Pricing, and API Changes for 2026

Gemini 3.5 Flash is GA: 1M-token context, a breaking thinking_level change, and full pricing breakdown.

Creeta

May 28, 2026

Gemini 3.5 Flash: Benchmarks, Pricing, and API Changes for 2026

What Google Shipped at I/O 2026: The Gemini 3.5 Family

Gemini 3.5 Flash is Google DeepMind's new production model for agentic workloads, available under the stable model ID gemini-3.5-flash since May 19, 2026 . The launch, part of Google I/O 2026, introduced a two-model family: Flash is GA now, while Gemini 3.5 Pro is in private testing with a June 2026 GA target via the Gemini API . Google's stated positioning for the series — "Frontier Intelligence with Action" — marks a deliberate architectural pivot away from benchmark leaderboard optimization toward multi-step agentic execution that reduces total API call count per completed task. Alongside the Gemini 3.5 models, Google shipped Gemini Omni (multimodal video editing) for paid Gemini app users on the same day .

Quick Answer: Gemini 3.5 Flash (gemini-3.5-flash) went GA on May 19, 2026, priced at $1.50/1M input tokens and $9.00/1M output tokens — roughly 40% cheaper than Gemini 3.1 Pro but approximately 5.5× more expensive than Gemini 3 Flash. The thinking_budget parameter is deprecated and throws a runtime error; all callers must migrate to thinking_level before switching to this model ID.

The release was authored by Koray Kavukcuoglu, VP of Research at Google DeepMind, whose framing for the generation was direct:

"Frontier Intelligence with Action." — Koray Kavukcuoglu, VP of Research at Google DeepMind, introducing the Gemini 3.5 series at Google I/O 2026 (source: Google DeepMind Blog)

The higher per-token cost of 3.5 Flash relative to Gemini 3 Flash follows directly from this positioning. Google's rationale: if the model completes a task in two API calls instead of eight, the per-token premium can still reduce total spend for agentic pipelines where task completion — not token count — is the unit of value. Whether that math holds for your specific workload is empirical, not guaranteed, and explored in the pricing section below.

The broader I/O 2026 developer surface also includes the Managed Agents API (hosted execution environments built on the Antigravity 2.0 runtime), Gemini Spark (a 24/7 personal background agent available exclusively for Google AI Ultra subscribers in the US at $100/month with no developer API access announced), and a revamped Antigravity 2.0 platform shipping as desktop app, CLI, and developer SDK. According to the Google I/O 2026 developer highlights, gemini-3.5-flash is also now the default model in the Gemini consumer app and Google Search AI Mode worldwide.

Context Window, Throughput, and Runtime Architecture

Gemini 3.5 Flash supports a 1,048,576-token (~1M) context window and up to 65,536 max output tokens, with full multimodal input support across text, image, video, and audio . Public API throughput sits at approximately 280 tokens per second, which Google claims is 4× faster than comparable frontier models at this capability tier . Inside the Antigravity 2.0 runtime, throughput climbs to 12× the public API rate — a meaningful gap if you run workloads on Google's own orchestration layer rather than calling the inference API directly .

The 1M context window is large enough to hold multi-session conversation histories, full repository codebases, or concatenated document corpora without chunking. In practice, the primary constraint at scale is cost, not capacity: feeding a million tokens per call in a high-volume pipeline accumulates at $1.50/1M input. The prompt caching tier at $0.15/1M cached input (a ~90% discount on cache hits) significantly changes the economics when large shared prefixes repeat across calls — explored in the pricing section.

The most concrete throughput data point from the launch is the I/O keynote demo: 93 parallel sub-agents built a functional operating system in approximately 12 hours, consuming over 15,000 API requests and 2.6 billion tokens at under $1,000 in total API credits . The implied cache hit rate is substantial: 2.6B tokens at the standard $1.50/1M input rate would approach $3,900 before discounts, so the under-$1,000 bill requires heavy cache utilization. Treat this demo as a cost-at-scale calibration point, not a directly reproducible benchmark.

Antigravity 2.0 sits architecturally between raw API access and a fully managed execution environment. It underlies the Managed Agents API, runs Gemini 3.5 Flash at the 12× speed advantage, and adds first-class parallel subagent orchestration and scheduled background task support. These are platform primitives, not user-built wrappers around the inference API — which matters if you are weighing build-versus-buy for multi-agent orchestration infrastructure. Details on the full Antigravity 2.0 platform surface are covered in the Managed Agents section.

Benchmark Results: Where 3.5 Flash Leads and Where It Falls Short

Gemini 3.5 Flash clears Gemini 3.1 Pro — the prior flagship — on most agentic and coding benchmarks, while carrying one disclosed weakness in long-context retrieval. Terminal-Bench 2.1 comes in at 76.2% versus 3.1 Pro's 70.3%; GDPval-AA agentic Elo reaches 1,656; MCP Atlas tool use scores 83.6%; and MMMU-Pro and CharXiv Reasoning land at 83.6–84.2% . The one disclosed weakness: MRCR v2 128k (long-context retrieval) at 77.3%, where 3.1 Pro still leads without a disclosed score . The practical takeaway: 3.5 Flash is a strong default for agentic tool-use pipelines; long-context RAG workloads warrant validation before migrating off 3.1 Pro.

Benchmark	Gemini 3.5 Flash	Gemini 3.1 Pro	Notes
Terminal-Bench 2.1	76.2%	70.3%	3.5 Flash leads by +5.9 pp
GDPval-AA (Agentic Elo)	1,656	—	Google-reported; no independent verification as of launch
MCP Atlas (Tool Use)	83.6%	—	Google-reported; no independent verification as of launch
MMMU-Pro	83.6%	—	Multimodal understanding
CharXiv Reasoning	84.2%	—	Chart comprehension and reasoning
Text + Code Arena	#9 / 1,507	—	+70 pts vs. Gemini 3 Flash; independent crowd ranking
MRCR v2 128k	77.3%	Leads (score undisclosed)	Long-context retrieval — 3.1 Pro still ahead

Before weighting these numbers, two caveats are worth flagging. GDPval-AA and MCP Atlas are Google-designed, Google-reported benchmarks; as of the May 19 announcement, no independent third-party reproductions had been published, according to Latent Space. The Text and Code Arena rank (#9, score 1,507) is the most neutral cross-model data point available — it is based on crowd-evaluated pairwise comparisons using real tasks, not a controlled benchmark designed by any of the model vendors. The +70 point improvement over Gemini 3 Flash (implied baseline score: ~1,437) offers a rough generational calibration.

The MRCR v2 128k gap has direct production implications for retrieval-augmented generation workloads that pass large document corpora in context and expect precise span-level retrieval. If your pipeline concatenates multiple reference documents into a single long prompt and queries across them, stay on 3.1 Pro until 3.5 Pro benchmark data is public. For tool-call-heavy workflows — function calling, MCP tool use, agent loops with multiple planning steps — the Terminal-Bench and MCP Atlas results make 3.5 Flash the more capable option at lower cost than 3.1 Pro. The two workload types warrant separate migration timelines.

Pricing: Cost Trade-offs vs. Gemini 3 Flash and 3.1 Pro

Gemini 3.5 Flash is priced at $1.50 per 1M input tokens, $9.00 per 1M output tokens, and $0.15 per 1M cached input tokens for global region endpoints — a ~90% discount on cache hits . Non-global region access runs slightly higher at $1.65/1M input and $9.90/1M output . Compared to the prior flagship, 3.5 Flash is approximately 40% cheaper than Gemini 3.1 Pro on both dimensions while posting stronger agentic benchmark scores — the price-to-capability case versus 3.1 Pro is the cleaner comparison .

Model	Input ($/1M tokens)	Output ($/1M tokens)	Cached Input ($/1M)	vs. 3.5 Flash
Gemini 3.5 Flash (global)	$1.50	$9.00	$0.15	—
Gemini 3.5 Flash (non-global)	$1.65	$9.90	n/a	+10%
Gemini 3.1 Pro (est.)	~$2.50	~$15.00	n/a	+67% (3.5 Flash ~40% cheaper)
Gemini 3 Flash (est.)	~$0.27	~$1.64	n/a	−82% (3.5 Flash ~5.5× pricier)

Gemini 3.1 Pro and 3 Flash figures are estimated from stated multipliers: 3.5 Flash is ~40% cheaper than 3.1 Pro and ~5.5× more expensive than 3 Flash . Verify against the Google Cloud blog and the official pricing page before building cost models.

The comparison versus Gemini 3 Flash demands more scrutiny. At ~5.5× higher per-token cost, cost neutrality only holds if your agentic tasks complete in significantly fewer total API calls. Google's argument is that a model with stronger planning and tool-use capability requires fewer retry calls, plan-correction loops, and validation steps — reducing total token consumption per completed task. That logic is plausible; whether it materializes in your specific pipeline is something to measure, not assume. If your current Gemini 3 Flash pipeline already has high first-attempt task success rates and low call counts, the 5.5× premium is difficult to justify on cost grounds alone.

The prompt caching tier at $0.15/1M — the ~90% discount — is the highest-leverage cost lever for high-volume deployments. If your system prompt, reference context, and persona definitions total 50,000 tokens and repeat across 10,000 daily calls, that prefix costs $0.75/day at the cached rate versus $7.50/day at the standard input rate. At 100,000 calls per day the difference is $675/day. The prerequisite: your prompt prefix must remain stable across requests to maintain cache key consistency. Frequent modifications to the shared prefix invalidate the cache and eliminate the discount.

Breaking API Change: Migrating from `thinking_budget` to `thinking_level`

The thinking_budget parameter — an integer controlling token allocation for reasoning traces — is deprecated and throws a runtime error in the current SDK when the target model is gemini-3.5-flash . Any code that passes thinking_budget fails immediately on model ID switch. There is no grace period, no silent fallback, and no deprecation warning phase — it raises at call time. Fix this before you change the model ID in your configuration, not after.

The replacement is thinking_level, a string enum accepting four values: minimal, low, medium, or high . The more impactful change is not the parameter rename — it is the default shift from high to medium. Code ported from gemini-3-flash-preview that omits an explicit thinking_level value will silently produce less thorough reasoning per call without any error, warning, or behavioral signal that something changed. This regression is easy to miss in testing if your eval suite focuses on output format correctness rather than reasoning depth.

# Before — throws RuntimeError when model="gemini-3.5-flash"
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=prompt,
    config={"thinking_budget": 8192}  # Deprecated — raises immediately
)

# After — set explicitly to match prior default behavior
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=prompt,
    config={"thinking_level": "high"}  # Restores former default
)

# Accept the new default only if you've validated quality at medium
response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents=prompt,
    config={"thinking_level": "medium"}  # New default — explicit is safer
)

How to calibrate thinking_level values: high allocates more tokens to internal reasoning traces and increases latency proportionally; minimal executes fastest with the least deliberation. For agentic tool-use tasks — multi-step planning, function calling sequences, complex code generation — high produces demonstrably better results because the additional reasoning steps catch planning errors before they propagate through a tool-call chain. For straightforward classification, extraction, summarization, or short-form generation tasks, medium or low typically produces equivalent outputs at lower cost and latency. The right setting is task-specific; run both levels against your eval set and measure the quality delta before accepting a default.

The second new capability — thought preservation across multi-turn conversations — is additive rather than breaking. When enabled, the model's reasoning state persists between API calls without requiring you to re-prompt the reasoning chain at the start of each turn . In multi-turn agent loops, re-establishing context across turns has been a consistent latency and token overhead. Thought preservation eliminates that overhead by allowing the model to continue from its prior reasoning state rather than reconstructing it from conversation history.

Migration checklist — in order — before switching your model ID:

Search your codebase for thinking_budget; replace every instance with thinking_level="high" to match prior behavior
Add explicit thinking_level values to all call sites that currently omit it; do not rely on the default at any call site
Run your eval suite with thinking_level="medium" to quantify the quality delta versus "high" before accepting the new default for any task type
Validate thought preservation behavior in your multi-turn flows and ensure your orchestration layer handles session state correctly
Change the model ID last, after the above steps pass

Managed Agents API and the Antigravity 2.0 Platform

The Managed Agents API (agent identifier: antigravity-preview-05-2026) provisions a complete hosted execution environment from a single API call . Unlike calling the Gemini inference API directly — which returns model outputs and nothing else — the Managed Agents API gives your agent a persistent Linux sandbox running Bash, Python, and Node.js, with file handling, web browsing, and the ability to mount Google Cloud Storage buckets or code repositories directly into the agent environment. State persists across calls within a session, so the sandbox does not need to be re-provisioned or re-initialized on each turn.

Additional built-in capabilities available inside a managed agent session, per the Managed Agents API announcement:

Markdown-defined custom skills — extend agent capabilities by providing skill definitions in markdown; no custom orchestration code required
GCS and repository mounts — attach storage or source trees directly; the agent reads, writes, and executes against them natively
Web browsing — fetch and parse external URLs without a separate browser automation layer
Parallel subagent orchestration — dispatch multiple agents concurrently and aggregate results; this is a platform primitive, not a pattern you build on top of the raw API
Scheduled background tasks — register agents to execute on a schedule without maintaining your own cron infrastructure

This positions Google directly against OpenAI's Codex execution environment and Anthropic's computer use primitives — both of which provision managed compute for agent tasks. Google's architectural distinction: the Antigravity 2.0 runtime underlies the Managed Agents API, so the 12× throughput advantage over the public API rate applies inside managed execution contexts. Antigravity 2.0 ships as a desktop application, CLI, and developer SDK, making it accessible as both an end-user product and a programmatic integration target.

Two practical limitations to account for before adopting this in production. First, sandbox compute pricing beyond Flash token costs has not been disclosed as of launch — the full cost model for managed execution is unknown. Second, the antigravity-preview-05-2026 identifier signals a preview product; API surface changes between preview and GA are likely. Evaluate capabilities now, but treat the Managed Agents API as a capability assessment window rather than a stable production dependency until the stable identifier and full pricing are published by Google.

Gemini 3.5 Pro Timeline and Developer Migration Path

Gemini 3.5 Pro entered private testing on May 19, 2026, with a June 2026 GA target via the Gemini API . Benchmarks and pricing have not been announced. Based on the disclosed 3.5 Flash weakness — MRCR v2 128k at 77.3%, where 3.1 Pro still leads — the expected role for 3.5 Pro is closing that long-context retrieval gap and leading on reasoning-intensive workloads where Flash currently trails . Pricing will almost certainly sit above 3.1 Pro given the "full-capability flagship" positioning — expect it to exceed 3.1 Pro rates when the official API pricing page updates.

The practical implication for teams building now: route model access through an abstraction layer rather than hard-coding "gemini-3.5-flash" at every call site. A configuration variable or model resolver means the transition to 3.5 Pro for high-complexity tasks becomes a config change, not a codebase refactor. Apply the same principle to thinking_level defaults — centralize them so that any default behavior changes at 3.5 Pro GA (expected to follow the same medium-default pattern as Flash) do not silently degrade task quality across your pipeline without a clear audit trail.

What to monitor through June 2026:

The official Google Cloud blog for the 3.5 Pro GA date and pricing announcement
MRCR v2 128k benchmark numbers for 3.5 Pro — the primary technical signal for long-context RAG migration decisions
Independent third-party reproductions of GDPval-AA (Elo 1,656) and MCP Atlas (83.6%) figures for 3.5 Flash — validate before committing agentic pipeline architecture to those numbers
Managed Agents API stable identifier and full compute pricing disclosure

Decision rule for the current window: if you are running Gemini 3.1 Pro primarily for long-context retrieval workloads, hold migration until 3.5 Pro MRCR v2 numbers are public. If you are running 3.1 Pro for agentic tool use, coding, or multi-step planning tasks, 3.5 Flash already offers better benchmark performance at ~40% lower cost — migrate now, with the thinking API changes applied first.

Frequently Asked Questions

Is Gemini 3.5 Flash backward compatible with gemini-3-flash-preview?

Not fully. The model ID has changed to gemini-3.5-flash, and the thinking API is a hard breaking change: the old thinking_budget integer parameter throws a runtime error on the new SDK and must be replaced with thinking_level, a string enum accepting minimal, low, medium, or high . The default reasoning level shifted from high to medium, so code ported from gemini-3-flash-preview that omits an explicit thinking_level will reason less per call without any error, warning, or visible behavioral signal. Set thinking_level="high" explicitly at every call site to preserve prior behavior, then evaluate whether the medium default is acceptable for each task type.

What does thinking_level='medium' vs 'high' actually change in practice?

thinking_level controls how many tokens the model allocates to internal reasoning traces before generating output. Higher levels consume more tokens and add latency; lower levels execute faster with less deliberation. medium is the new SDK default — adequate for straightforward generation tasks (classification, extraction, summarization, short-form Q&A) but likely insufficient for complex multi-step planning or agentic tool-use chains where deeper reasoning traces catch planning errors before they propagate through a call sequence. high restores the prior default behavior and is the recommended starting point for agentic pipelines, complex code generation, and multi-hop reasoning tasks. Calibrate by running both levels against your task distribution in eval: if medium meets your quality bar, you gain reduced latency and lower token cost without a meaningful output difference.

How does Gemini 3.5 Flash compare to Claude Sonnet or GPT-4o on agentic benchmarks?

Direct cross-vendor comparison is complicated by benchmark design. Google reports GDPval-AA (Elo 1,656) and MCP Atlas (83.6%) for Gemini 3.5 Flash, but as of the May 19, 2026 announcement, no independent third-party reproductions of those figures had been published . The most neutral cross-model signal is the Text and Code Arena ranking: Gemini 3.5 Flash sits at #9 with a score of 1,507, based on crowd-evaluated pairwise comparisons using real tasks — not a vendor-designed benchmark . For a production comparison against Claude Sonnet or GPT-4o, run both models on identical tasks in your actual environment; vendor-reported figures from different benchmark designs are not directly comparable across vendors.

What is the Managed Agents API and how does it differ from calling the Gemini API directly?

The Managed Agents API provisions a full hosted execution environment — a persistent Linux sandbox with Bash, Python, and Node.js runtimes, plus file handling, web browsing, GCS and repository mounts, and parallel subagent orchestration — from a single API call under the agent identifier antigravity-preview-05-2026 . Calling the Gemini inference API directly returns model outputs only — the compute environment, state management, and execution infrastructure are your responsibility to build and maintain. The Managed Agents API is closer to an operator-managed agent platform than a raw model API; it competes with OpenAI's Codex execution environment in concept. It is currently in preview; compute pricing beyond Flash token costs has not been disclosed.

When will Gemini 3.5 Pro be available and what will it cost?

Gemini 3.5 Pro entered private testing on May 19, 2026, with a June 2026 GA target via the Gemini API . Pricing has not been announced. Given its "full-capability flagship" positioning above both 3.5 Flash and 3.1 Pro, expect pricing to exceed 3.1 Pro rates when the official announcement lands. The benchmark to watch on Pro release is MRCR v2 128k long-context retrieval — this is the disclosed gap in 3.5 Flash (77.3%), and whether 3.5 Pro closes it determines whether long-context RAG workloads should migrate off 3.1 Pro. Monitor the Google DeepMind blog for the GA date, pricing, and benchmark disclosure.

What to Build On, What to Wait For

Gemini 3.5 Flash gives developers a cleaner upgrade path than most generational transitions: stronger agentic benchmark performance, ~40% lower cost than the prior flagship, and GA availability from day one. The thinking API migration is the primary operational risk — a silent behavioral regression if you port code without explicitly setting thinking_level="high". The correct sequence is to fix the thinking_budget replacement and add explicit thinking_level values to all call sites first, run your evals, then change the model ID. Reversing that order means you are debugging a parameter deprecation error and a reasoning depth regression simultaneously.

The Managed Agents API and Antigravity 2.0 platform are worth a hands-on evaluation if you currently maintain custom orchestration infrastructure for multi-agent pipelines. First-class parallel subagent support and scheduled background task execution, running at 12× the public API throughput rate, could meaningfully reduce the infrastructure surface you own and operate. The preview designation and undisclosed compute pricing argue against committing production workloads to it yet — build against it in a staging environment and watch for the stable release and full cost disclosure.

For teams currently on Gemini 3.1 Pro: split your migration decision by workload type. Agentic tool use and coding tasks should move to 3.5 Flash now — the benchmarks are better, the cost is lower, and the GA status is confirmed. Long-context RAG workloads should hold until June 2026 when 3.5 Pro MRCR v2 data is available to make an informed decision. Structuring your code around model-ID abstraction layers now means that decision is a configuration update, not a refactor.

Last updated: 2026-05-28. Article reflects the Gemini 3.5 Flash GA release as announced at Google I/O 2026 on May 19, 2026. Gemini 3.5 Pro benchmark data, pricing, and GA date are pending official announcement, expected June 2026.