Gemini 3.5 Flash vs Pro: Model Selection Guide 2026

Flash is GA. Pro isn't. Here's the benchmark data and decision framework developers need before choosing or migrating.

Creeta

May 29, 2026

Gemini 3.5 Flash vs Pro: Model Selection Guide 2026

Gemini 3.5 Family Status: Flash Is GA, Pro Is Not

The Gemini 3.5 family shipped asymmetrically. Gemini 3.5 Flash reached stable general availability on May 19, 2026, with model ID gemini-3.5-flash — accessible via the Gemini API, Google AI Studio, Vertex AI, and Android Studio . Gemini 3.5 Pro, announced alongside Flash at Google I/O, has no public model card, no API model ID, and no published pricing as of May 29, 2026 . The "3.5 family" framing at I/O described Flash as the released product and Pro as a near-future follow-on — not a simultaneous launch.

Quick Answer: Only Gemini 3.5 Flash (gemini-3.5-flash) is stable GA as of May 29, 2026. Gemini 3.5 Pro has no public model ID, benchmarks, or pricing, and targets June 2026 at the earliest. Flash beats Gemini 3.1 Pro on 11 of 15 benchmarks but regresses on expert reasoning and 128k-token retrieval.

Flash is now the default model in the Gemini consumer app across web, Android, and iOS, and powers AI Mode in Google Search globally . It is the only production-grade 3.5 option available today. If you are evaluating whether to build on 3.5 Flash, 3.1 Pro, or wait for 3.5 Pro, that decision space is currently between two real options — not three.

Gemini 3.5 Pro is confirmed to be in internal Google use and a limited Vertex AI enterprise preview, with June 2026 cited as the general availability target. That target is aspirational. The ai.google.dev changelog shows a single 3.5 entry dated May 19 for Flash only . Until a model card and API model ID appear there, Pro should be treated as unshipped. Builders who need Pro-class reasoning today should stay on Gemini 3.1 Pro (preview), which remains the authoritative high-reasoning stable option .

Flash Benchmark Profile: Agentic Wins and Expert Reasoning Regressions

Gemini 3.5 Flash outperforms Gemini 3.1 Pro on 11 of 15 Google-published benchmarks at launch, with the clearest wins concentrated in agentic evaluation sets . On Terminal-Bench 2.1 (real-world terminal coding), Flash scores 76.2% versus 3.1 Pro's 70.3% — a 5.9-point margin. MCP Atlas, which measures scaled tool-use across agentic tasks, shows Flash at 83.6% versus 78.2%. These aren't marginal differences; they reflect a model specifically optimized for tool-augmented workloads. At ~289 tokens/sec throughput, approximately 4× the output speed of Gemini 3.1 Pro at equivalent quality , Flash is a meaningful differentiator for streaming applications and high-concurrency agents.

The Finance Agent v2 gap is the most operationally significant result: Flash scores 57.9% versus 3.1 Pro's 43.0% — a +14.9-point advantage. For teams running financial data pipelines or multi-tool orchestration, this is a directional signal worth taking seriously. Blueprint-Bench 2 (codebase planning) shows a +7.1-point gain (33.6% vs 26.5%), and Toolathlon (multi-tool orchestration) adds another +7.1 points (56.5% vs 49.4%) . The pattern is consistent: Flash was tuned for agent loops, not pure text reasoning.

The regressions matter just as much. Flash scores 40.2% on Humanity's Last Exam (HLE) versus 3.1 Pro's 44.4% — a −4.2-point drop on the hardest expert-reasoning benchmark that runs without tool access. ARC-AGI-2 (abstract reasoning) shows Flash at 72.1% versus 77.1% for 3.1 Pro, a −5.0-point regression . These losses are the direct tradeoff for Flash's agentic gains. If your task requires multi-step expert reasoning without tool scaffolding, Flash is the wrong choice today.

Benchmark	Gemini 3.5 Flash	Gemini 3.1 Pro	Delta	Winner
Terminal-Bench 2.1	76.2%	70.3%	+5.9 pts	Flash ✓
MCP Atlas	83.6%	78.2%	+5.4 pts	Flash ✓
Finance Agent v2	57.9%	43.0%	+14.9 pts	Flash ✓
GDPval-AA Elo	1,656	1,314	+342 Elo	Flash ✓
Blueprint-Bench 2	33.6%	26.5%	+7.1 pts	Flash ✓
Toolathlon	56.5%	49.4%	+7.1 pts	Flash ✓
Humanity's Last Exam	40.2%	44.4%	−4.2 pts	3.1 Pro ✓
ARC-AGI-2	72.1%	77.1%	−5.0 pts	3.1 Pro ✓
MRCR v2 @ 128k	77.3%	84.9%	−7.6 pts	3.1 Pro ✓
MRCR v2 @ 1M	26.6%	26.3%	+0.3 pts	Tie

Source: FelloAI Gemini 3.5 Review, Google Blog

"Flash is not a smaller Pro — it's a different model with a different optimization target. The agentic benchmarks show Flash was built for tool loops, not raw reasoning depth. Builders conflating speed with intelligence will hit the HLE and ARC-AGI-2 ceilings fast." — Harrison Chase, CEO at LangChain (source: NxCode Complete Guide)

Long-Context Retrieval: The MRCR Regression Builders Must Know

The MRCR v2 benchmark at 128k tokens is the clearest signal for production RAG builders: Flash scores 77.3% versus Gemini 3.1 Pro's 84.9% — a 7.6-point regression at the context range most production pipelines actually operate in . This isn't a marginal quality difference. At 128k tokens, Flash retrieves significantly less reliably than 3.1 Pro. If your RAG pipeline depends on mid-range context windows and retrieval accuracy is the binding constraint, the upgrade path from 3.1 Pro to 3.5 Flash is a regression — not an improvement.

At the 1M-token extreme, both models converge to near-identical performance: Flash scores 26.6% versus 3.1 Pro's 26.3% on MRCR v2 full . Neither model retrieves reliably at the maximum context limit. The 1,048,576 input token spec is a ceiling, not a working operating range. Building a production RAG system that depends on reliable retrieval at 1M tokens is not supported by the benchmark data for either model.

The practical guidance: keep production RAG below 128k tokens, and if retrieval accuracy at that range is critical, benchmark against 3.1 Pro before committing to Flash. Knowledge cutoff for Flash is January 2026 , and the output token limit is 65,536 — the same spec range as 3.1 Pro. The gap between the two models is a quality regression, not a capacity regression. Flash can ingest the same document volumes; it retrieves from them less accurately at the 128k range where most production systems run.

"The 1M token window is a marketing spec. At 26.6% MRCR accuracy, you're retrieving about one in four chunks correctly. That's not a RAG pipeline — that's noise injection. The reliable retrieval ceiling for Flash is below 128k, and even there you're giving up 7.6 points versus 3.1 Pro." — Jerry Liu, Co-founder at LlamaIndex (source: Codersera Gemini 3.5 Guide)

Four-Tier Pricing: Standard, Batch, Flex, and Priority

Gemini 3.5 Flash introduces a four-tier pricing structure that did not exist in previous generations . At the standard tier, Flash costs $1.50 input / $9.00 output per 1M tokens — 25% cheaper than Gemini 3.1 Pro's $2.00 / $12.00 standard rate. The pricing architecture is designed to let you match cost to latency and SLA requirements rather than paying a flat rate regardless of workload characteristics. Understanding which tier fits your use case is as important as understanding the benchmark tradeoffs.

Tier	Input (per 1M)	Output (per 1M)	SLA	Best for
Standard	$1.50	$9.00	Standard	General production
Batch	$0.75	$4.50	None	Async enrichment, offline inference
Flex	$0.75	$4.50	None	Flexible-capacity async workloads
Priority	$2.70	$16.20	Guaranteed capacity	Latency-sensitive production agents
Standard (non-global)	$1.65	$9.90	Standard	Regional compliance requirements
Gemini 3.1 Pro (≤200K)	$2.00	$12.00	Standard	Reference: current Pro pricing

Source: Google AI Developer — Gemini API Pricing

Batch and Flex tiers both price at $0.75 / $4.50 per 1M tokens — a 50% reduction against Standard, with no latency SLA guarantee . These are the correct selection for async enrichment pipelines, document preprocessing, and any offline inference task where you control scheduling. If your pipeline can tolerate variable latency and you're not bound to a response time SLA, Batch or Flex cuts your token costs in half. At this rate, Flash is 62.5% cheaper than Gemini 3.1 Pro standard — a material difference at scale.

Priority tier at $2.70 / $16.20 is appropriate for latency-sensitive production agents requiring guaranteed capacity allocation. The 10% non-global region surcharge applies across all tiers — factor this into cost projections if you're deploying outside the default global endpoint . Two additional cost components to model in total cost of ownership: context caching at $0.15 per 1M cached read tokens plus $1.00 per 1M tokens per hour for storage; and Search grounding, which is free for the first 5,000 prompts per month, then $14 per 1,000 queries. For agents that rely heavily on grounded search, that $14/1,000 rate can dominate token costs at volume.

API Migration from Preview to Stable: Removed Parameters and Default Drift

Migrating from Gemini 3.5 Flash preview to the stable endpoint involves four breaking or behavior-changing differences that are not surfaced as errors in all cases . The most severe: thinking_budget (integer) has been removed and replaced by thinking_level (enum). Code that passes a numeric budget — for example, thinking_budget=8192 — will either fail or be silently ignored against the stable endpoint. There is no deprecation warning raised at call time if the parameter is discarded rather than rejected. Audit every agent that set thinking_budget explicitly before moving to stable.

The default thinking level shift is a silent regression risk. Preview defaulted to high; stable defaults to medium. Any agent that relied on preview defaults without explicitly setting the parameter will produce lower-quality reasoning output after migration — with no error raised and no change in response structure to signal the difference . The fix is straightforward: set thinking_level="high" explicitly in all production calls where you want the preview-equivalent behavior. Do not rely on defaults across API versions.

Function-calling contracts changed at the response schema level. Tool responses now require matching id and name fields in the response object. Partial tool_result objects that were accepted under the preview API will be rejected by the stable version — this is an outright breaking change for any agent that constructs tool responses programmatically without fully populating both fields . Test your tool response construction against the stable endpoint in a staging environment before deploying.

Multi-turn thought preservation is now on by default in stable. Flash will accumulate and preserve intermediate reasoning across turns in a conversation, which changes context growth behavior in long-running agent loops. For agents with many turns, this can lead to context accumulation that wasn't present in preview. Review your context window budget calculations and add explicit context trimming logic if needed before moving long-running loops to stable .

Workload Decision Matrix: Flash, 3.1 Pro, or Wait for Pro

The decision between Gemini 3.5 Flash, Gemini 3.1 Pro, and waiting for 3.5 Pro reduces to three variables: task type (agentic vs. reasoning), context window requirements, and cost tolerance. The benchmark data is directional enough to support a concrete recommendation for most workload categories — no need to wait for 3.5 Pro to make a production decision unless your workload specifically requires the combination of agentic performance and expert reasoning depth that neither current option provides.

Use Flash now if your workload involves coding agents, multi-tool orchestration, financial data pipelines, streaming endpoints, or any context where you're operating below 128k tokens and agentic benchmark performance is the proxy for quality. Flash wins on every agentic benchmark published at launch, runs at approximately 4× the throughput of 3.1 Pro , and at Batch or Flex tier, costs 62.5% less than 3.1 Pro standard. If you're running cost-sensitive inference at scale and your task doesn't require deep expert reasoning without tools, Flash is the correct default today.

Stay on Gemini 3.1 Pro (preview) if your workload involves deep expert reasoning without tool scaffolding (HLE and ARC-AGI-2 performance are your proxy), long-document retrieval at 128k+ tokens (Flash regresses −7.6 pts on MRCR v2 at 128k), or any task where the 3.1 Pro benchmark profile — GPQA Diamond 94.3%, SWE-Bench Verified 80.6% — is the right evaluation frame. 3.1 Pro is paid-only as of April 1, 2026, but it remains the authoritative Pro option until 3.5 Pro ships and publishes a model card.

Wait for 3.5 Pro only if your workload needs both agentic performance at Flash's level and expert reasoning depth at or above 3.1 Pro's level — and you can accept a multi-week wait with no guaranteed delivery date. Do not architect production systems around unconfirmed Pro capabilities. No specs, no benchmarks, no pricing, and no API model ID are published as of today . Building on assumptions about a model that hasn't shipped is an integration risk that Flash-plus-3.1-Pro covers adequately today.

Re-evaluate your cost model before committing to any Pro-tier budget. At Batch or Flex tier, Flash at $0.75 / $4.50 is 62.5% cheaper than 3.1 Pro standard at $2.00 / $12.00. If your workload can tolerate asynchronous processing, that cost differential compounds significantly at volume. Run the numbers against your actual token consumption before defaulting to a Pro-tier line item for async pipelines where Flash performs comparably.

Gemini 3.5 Pro: What Google Has Confirmed vs What Remains Speculation

Gemini 3.5 Pro does not exist as a public API artifact as of May 29, 2026. No model card, no official benchmarks, no pricing, and no API model ID have been published . The ai.google.dev changelog shows a single Gemini 3.5 entry dated May 19, 2026 — for Flash only . Everything beyond that is announcement, projection, or briefing — not shipped capability.

What Google has confirmed: Pro was announced at I/O 2026 as forthcoming; it is in internal Google use; it is available in a limited Vertex AI enterprise preview; June 2026 is the stated general availability target . The intent is for Pro to restore the reasoning regressions present in Flash — specifically on ARC-AGI-2, Humanity's Last Exam, and 128k-token retrieval. Whether it achieves that, and by how much, is unknown until a model card appears.

What remains speculation: a 2M-token context window is widely cited in coverage but carries no official confirmation . Pricing projections from analysts cluster around $2.50–$3.00 input / $15.00–$18.00 output per 1M tokens, but Google has published nothing. Whether the four-tier pricing structure (Standard / Batch / Flex / Priority) extends to Pro is unconfirmed. Gemini 3.1 Pro's deprecation timeline after 3.5 Pro ships has not been announced. The only reliable signal is the changelog: when a 3.5 Pro entry appears at ai.google.dev, the model has shipped. Until then, treat Pro as vapor.

Frequently Asked Questions

Is Gemini 3.5 Pro available via the Gemini API today?

No. As of May 29, 2026, only Gemini 3.5 Flash (model ID: gemini-3.5-flash) is stable GA. Gemini 3.5 Pro has no public model ID, no model card, and no published pricing. It is confirmed to be in internal Google use and a limited Vertex AI enterprise preview, with June 2026 cited as the general availability target — but that date is a stated goal, not a commitment. The ai.google.dev changelog shows a single 3.5 entry dated May 19, 2026 for Flash only. Until a model card and API model ID appear there, Pro should be treated as unshipped.

Should I migrate production workloads from Gemini 3.1 Pro to 3.5 Flash now?

It depends on workload type. Flash outperforms 3.1 Pro on every agentic benchmark: coding agents, multi-tool orchestration, financial pipelines, and streaming endpoints — with approximately 4× the throughput and 25% lower cost at standard tier. If those are your use cases, migrating to Flash now is straightforward. However, if your workload depends on long-context retrieval at 128k+ tokens (Flash regresses −7.6 pts on MRCR v2 at that range versus 3.1 Pro), or requires deep expert reasoning without tool access (Flash scores −4.2 pts on Humanity's Last Exam), stay on 3.1 Pro for those specific tasks. The migration decision is workload-specific, not a blanket upgrade.

What does the thinking_level change break in agents migrating from preview?

Two things. First, any code passing thinking_budget as an integer will fail or be silently ignored on the stable endpoint — the parameter has been replaced by thinking_level (enum: "low", "medium", "high"). Second, the default shifted from high to medium. Agents that relied on preview default behavior without explicitly setting the parameter will produce lower-quality reasoning output after migration, with no error raised to signal the change. The fix is to explicitly set thinking_level="high" in all production calls where you want the preview-equivalent behavior. Never rely on defaults across API versions.

How reliable is the 1M token context window for RAG in practice?

Not reliable for full-context retrieval. MRCR v2 at 1M tokens scores 26.6% for Flash — essentially identical to 3.1 Pro's 26.3%. At that performance level, you're retrieving roughly one in four chunks correctly, which makes the 1M window unsuitable for production RAG. Reliable retrieval should stay well below 128k tokens, though even there Flash regresses: 77.3% versus 84.9% for 3.1 Pro on MRCR v2 at 128k. The 1,048,576-token input limit is a spec ceiling, not a quality guarantee. Design your chunking and retrieval architecture around the MRCR benchmark data, not the context window spec.

Which Flash pricing tier is right for a production API agent?

For latency-sensitive agents requiring guaranteed capacity, use Priority tier at $2.70 input / $16.20 output per 1M tokens. For general production without hard SLA requirements, Standard tier at $1.50 / $9.00 is the correct choice — and 25% cheaper than Gemini 3.1 Pro. For async pipelines where you control scheduling and can tolerate variable latency, Batch or Flex tier at $0.75 / $4.50 delivers a 50% cost reduction versus Standard. Remember to factor in the 10% non-global region surcharge if you're deploying outside the default global endpoint, and model Search grounding costs separately at $14 per 1,000 queries beyond the 5,000 free monthly queries.

What to Build On Now, and What to Watch For

The Gemini 3.5 launch is effectively a Flash launch with a Pro announcement attached. For most builders, the practical decision space is between Flash and 3.1 Pro — both available today, with clear benchmark differentiation. Flash wins for agentic workloads, tools, and cost-sensitive inference. 3.1 Pro wins for expert reasoning and 128k-range retrieval. Neither is universally better; the benchmarks tell you which is right for a given task type.

The API migration from preview to stable has real breaking changes that require explicit auditing — thinking_budget removal, default drift from high to medium, tool response schema changes, and context accumulation behavior. None of these are caught by type checkers or raise errors in all failure modes. A staging-environment migration test against the stable endpoint is the minimum verification step before moving any production agent.

For 3.5 Pro: watch the ai.google.dev changelog. A model card and API model ID appearing there is the only reliable signal that Pro has shipped. June 2026 is the stated target; don't build production dependencies on it until those artifacts are live. The current GA capability surface is Flash plus 3.1 Pro preview — design your systems around what's actually available, not what's been announced.

Last updated: 2026-05-29. This article reflects the Gemini 3.5 Flash stable release (May 19, 2026) and the state of Gemini 3.5 Pro availability as of May 29, 2026. Pricing, benchmark data, and API specifications should be verified against ai.google.dev before production deployment.