What Ships in ChatGPT Images 2.0: Five Capability Changes
gpt-image-2 is the API model identifier for ChatGPT Images 2.0, the direct successor to GPT Image 1.5, released on April 21, 2026 . It is the first OpenAI image model to ship with a native reasoning layer — a pre-generation planning pass that runs before any pixels are produced. There is no versioned alias: callers must pass the literal string "gpt-image-2". According to OpenAI's launch announcement, the model delivers "significantly enhanced world knowledge, instruction following, and generating detail and complexity such as dense text."
gpt-image-2 launched April 21, 2026 with five new capabilities over GPT Image 1.5: a native reasoning layer, live web search grounding (thinking mode only), 2560×1440 max resolution, up to 10-image coherent batching, and print-quality multilingual text. DALL-E 2 and DALL-E 3 were retired May 12, 2026 — swap the model string and update billing logic to migrate.
The five headline additions over GPT Image 1.5 are distinct enough that each requires its own integration decision:
- Thinking mode: A reasoning pass runs before generation to plan layouts, validate composition, and optionally invoke live web search. Access is gated by plan tier — Plus, Pro, Business, and Enterprise only . Free-tier users receive instant mode exclusively.
- Web search grounding: Operates only within thinking mode. The model can fetch current data mid-generation — relevant for event posters, live-data infographics, and time-sensitive visual assets . This is a text-side fetch; it does not pull reference images from the web.
- 2K resolution and flexible aspect ratios: Maximum output climbs to 2560×1440 (marked experimental at launch) , with aspect ratios from 3:1 ultra-wide to 1:3 ultra-tall.
- Multi-image batching with continuity: A single prompt can generate up to 10 coherent images with consistent character identity and object continuity across the batch — useful for storyboard, sequence, and campaign asset work.
- Multilingual text and layout rendering: Accurate text in Japanese, Korean, Chinese, Hindi, and Bengali; scannable QR code generation; print-ready multi-panel layouts .
The reasoning layer is the structural differentiator from every prior OpenAI image model. Previous models produced output from a single forward pass without a separate planning step. gpt-image-2 decouples planning from generation, enabling compositional validation before pixel output begins. The tradeoff is latency: thinking mode adds 15–30 seconds per request — a cost addressed in detail in the next section.
The absence of a versioned alias matters for lifecycle management. When OpenAI eventually ships a successor, your code will not automatically upgrade — you control the migration. Hardcode "gpt-image-2" explicitly and track it as a named dependency. Azure OpenAI Service supports the model from the April 21, 2026 launch date .
Thinking Mode and Web Search Grounding: Architecture Details
Thinking mode in gpt-image-2 is a pre-generation reasoning pass — structurally separate from the pixel generation step — that performs layout planning, compositional validation, and optional live web data retrieval before the model produces any output. It is available only to Plus, Pro, Business, and Enterprise plan subscribers . Free-tier users cannot access it; the instant mode path is their only option. This is a hard infrastructure gate, not a soft feature flag.
Web search grounding is architecturally nested inside thinking mode — it cannot be triggered from the standard instant mode path. When the model executes a web fetch during a thinking-mode generation run, it retrieves current text data (venue details, event dates, live statistics) and incorporates that into the compositional plan. The fetch happens mid-generation, not as a post-processing step . It is text-only: the model does not pull reference images from the web to inform visual rendering.
"The model can invoke live web search" exclusively within thinking mode, enabling "accurate event posters, news-style infographics, and time-sensitive visual assets" — from OpenAI's ChatGPT Images 2.0 Safety System Card, OpenAI Deployment Safety Hub, April 2026.
The latency cost is real and non-trivial: 15 to 30 seconds of additional processing per request, on top of the base generation time . This makes thinking mode structurally incompatible with synchronous user-facing generation pipelines. If your product shows a spinner and waits for an image before rendering the next UI state, a 15–30 second addition is not a workable latency budget for most consumer-facing contexts. Thinking mode is better suited for async batch jobs, offline asset pipelines, or operator-side creation flows where users are already expecting a processing delay.
A second operational implication that many developers will underweight: thinking mode reduces the safety detection rate. The combined detection rate published in OpenAI's safety card drops from 96.1% in standard mode to 87.5% in thinking mode . The card does not fully explain the gap. The most likely mechanism is that the reasoning pass introduces compositional edge cases the safety stack is less well-calibrated to intercept. For regulated or high-risk content verticals, this 8.6-point gap in detection rate warrants additional application-layer moderation when thinking mode is enabled.
There is currently no public API parameter to explicitly toggle thinking mode per-request. The access mechanism is entirely plan-tier-based. If you run multi-tenant architectures where different tenants operate under different plan tiers, thinking mode availability is non-uniform across your user base. Architect this constraint into your access-tier logic: surface thinking mode capabilities only to tenants whose authenticating credentials carry the required plan tier, rather than exposing a feature that silently degrades or errors for lower-tier users.
API Integration: Endpoints, Quality Tiers, and Token-Based Pricing
Integrating gpt-image-2 at the API level requires changing one string in existing calls. Pass model="gpt-image-2" to either the /v1/images/generations or /v1/images/edits endpoint — both are unchanged in URL structure and authentication scheme . No new route, no new auth token format, no SDK version requirement. The model is also available on Azure OpenAI Service from the April 21, 2026 launch date .
The quality parameter accepts three values: low, medium, and high. Per-image pricing at 1024×1024 resolution is $0.006, $0.053, and $0.211 respectively . Edit requests are a behavioral exception: the API processes them at maximum quality regardless of the quality value you pass. If you were using quality-tiered billing to reduce edit costs under DALL-E 3, that lever no longer exists — factor this into cost modeling before migrating edit-heavy workflows.
The more significant pricing change is the token-based billing layer. Image input tokens are billed at $8.00 per million tokens; image output tokens at $32.00 per million tokens . This applies on top of the per-image flat rate, replacing DALL-E 3's flat-rate-only model. The API response schema now returns token counts — you need to read and log response.usage from day one. For Azure deployments, update cost monitoring dashboards and any internal chargeback tooling to consume token-count fields rather than a fixed per-call multiplier.
| Quality Tier | Per-Image Price (1024×1024) | Image Input Token Rate | Image Output Token Rate | Edit Behavior |
|---|---|---|---|---|
low |
$0.006 | $8.00 / M tokens | $32.00 / M tokens | Always runs at max quality — ignores quality parameter |
medium |
$0.053 | |||
high |
$0.211 |
A minimal Python integration looks like the following. Note the response.usage field — it did not exist in DALL-E 3 responses and is the field you need to wire into observability and per-tenant billing logic:
import openai
client = openai.OpenAI()
response = client.images.generate(
model="gpt-image-2",
prompt="Product shot of a matte black mechanical keyboard on a wooden desk",
quality="high",
size="1024x1024",
n=1
)
print(response.data[0].url)
# Token billing fields — log these:
print(response.usage) # input_tokens, output_tokens
For high-resolution generation requests, image size contributes meaningfully to token consumption. A 2560×1440 output will carry a substantially higher token bill than a 1024×1024 output at the same quality tier. If your application exposes resolution as a user-selectable parameter, update your cost estimation logic to scale with output dimensions, not just the quality tier value.
DALL-E 2 and DALL-E 3 Retired May 12, 2026: Migration Path
DALL-E 2 and DALL-E 3 reached end-of-life on May 12, 2026 . Any code passing model="dall-e-2" or model="dall-e-3" to the images API returns an error as of that date. This is not a soft deprecation with a fallback — those model strings are invalid. The migration is straightforward in the common case, but three semantic differences require explicit handling before promoting the change to production.
Step 1 — Model string. Replace the model parameter. Endpoint URL, path, and Authorization header are unchanged. This is a one-line change in most codebases:
# Before
response = client.images.generate(model="dall-e-3", ...)
# After
response = client.images.generate(model="gpt-image-2", ...)
Step 2 — Quality value rename. DALL-E 3 used "standard" and "hd". gpt-image-2 uses "low", "medium", and "high". Passing the old values to the new model will error or silently default. A reasonable starting mapping: "standard" → "medium", "hd" → "high". Validate output quality against your acceptance criteria before promoting to production — the perceptual difference between quality tiers is model-specific and may not map 1:1 to what DALL-E 3 produced.
According to OpenAI's ChatGPT Images 2.0 announcement, the migration is a model parameter swap — endpoint URLs and authentication are unchanged . OpenAI, April 2026.
Step 3 — Billing logic update. DALL-E 3 priced on a flat per-image rate with two quality tiers. gpt-image-2 adds a token-based billing layer — image input tokens at $8.00/M, output tokens at $32.00/M . The response schema now returns token counts in response.usage. Update cost monitoring before traffic switches — any cost-cap logic or internal chargeback system that assumed a fixed per-image rate will produce incorrect figures from the first request. One additional edge case: edit requests on gpt-image-2 always run at maximum quality regardless of the parameter passed. If you were previously using quality="standard" on DALL-E 3 edit calls to save cost, those savings disappear. Factor that into your migration cost model .
Safety System Card: Three-Layer Content Review Architecture
OpenAI published a formal safety system card for gpt-image-2 on April 21, 2026, available at deploymentsafety.openai.com with the full PDF linked directly from that page . The card documents a three-layer content review architecture, published detection rates split across standard and thinking modes, a formal biological risk evaluation, and a methodological shift from raw taxonomy-matching to outcome-based harmful-output risk assessment.
The three layers run in sequence for every request:
- Layer 1 — Upstream refusals: Specialized text classifiers evaluate every incoming request before it reaches the generation model. Policy-violating requests are refused at this stage without engaging the generation system .
- Layer 2 — Input blocking: A multimodal safety reasoning model inspects all text and image inputs. If any component violates policy, generation halts before any pixels are produced .
- Layer 3 — Output blocking: The same safety reasoning model reviews the final rendered image before it is returned to the caller. If the output violates policy despite passing layers 1 and 2, it is blocked at this final stage .
"A multimodal model trained to reason about content policies" inspects both inputs and outputs as a "safety reasoning model," operating independently of the generation model — from the ChatGPT Images 2.0 Safety System Card, OpenAI Deployment Safety Hub, April 2026.
The published safety metrics reveal a meaningful gap between standard and thinking modes:
| Metric | Standard Mode (Images 2.0) | Thinking Mode |
|---|---|---|
| Combined Detection Rate | 96.1% | 87.5% |
| Safe Output Rate | 99.1% | 99.2% |
| Violative Outputs Generated | 22.0% | 6.7% |
The asymmetry here is worth parsing carefully. Thinking mode generates fewer violative outputs (6.7% vs 22.0%) but has a lower combined detection rate (87.5% vs 96.1%). OpenAI's card does not resolve this apparent paradox. The most operationally conservative reading: thinking mode is better at not generating policy-violating content, but the safety stack is less reliable at catching violations before generation begins — a distinction that matters if you are using detection rate as your primary risk metric for regulated content verticals. For healthcare imaging, legal documentation pipelines, or children's platforms, the 8.6-point detection gap warrants additional application-layer moderation when thinking mode is enabled.
The card documents a biological risk evaluation using a 772-image set covering prompts designed to elicit infographics for biotoxin synthesis. A bioweapons expert reviewed outputs and rated the model's effective assistance as limited to "novice uplift." An image-specific variant of the biological safety policy now runs over all inputs and outputs via the safety reasoning model. All generated images carry C2PA metadata embedding and imperceptible, content-specific watermarking for downstream provenance verification .
Benchmark Results and Documented Limitations
Within 12 hours of the April 21, 2026 launch, gpt-image-2 reached the top position across every category on the Image Arena leaderboard by a margin of +242 points , described as the largest recorded lead on that platform. Image Arena uses human preference voting rather than algorithmic scoring, so the result captures aesthetic quality and instruction-following as perceived by evaluators — not a narrow technical capability metric. It is a useful directional signal, but your own evaluation against production-representative prompts matters more than a leaderboard position.
The documented limitations deserve precise description rather than a generic disclaimer:
Knowledge cutoff: December 2025. The model cannot accurately render post-cutoff products, events, or public figures . Web search grounding in thinking mode can retrieve current text data — a venue's address, a recent event date — but the model's visual rendering of post-cutoff subjects will not be accurate. A product released in January 2026 or a newly prominent public figure will not render correctly regardless of grounding, because the model has no visual training data for that subject. This matters for marketing asset generation and editorial illustration workflows targeting recent subjects.
Brand logo reproduction is inconsistent. This is a capability limitation, not a policy restriction — the model may approximate a logo but not reproduce it accurately. For text inside images, accuracy is strong across Japanese, Korean, Chinese, Hindi, and Bengali . QR codes are scannable. Latin-script text rendering is also accurate; the multilingual improvement is the notable addition relative to prior models.
Architecture undisclosed; fine-tuning unavailable. OpenAI has not stated whether the generation component is diffusion-based or autoregressive , and no fine-tuning pathway has been announced as of April 2026 . If your application requires strong domain-specific style consistency — product photography with a defined visual identity, branded illustration with consistent stylistic rules — you are limited to prompt engineering. There is no adapter or LoRA pathway currently available.
Thinking mode latency is structural, not temporary. The 15–30 second addition reflects the reasoning pass and optional web fetch — it is an architectural property, not a performance regression that will improve in a patch release . Design your pipelines around this constraint rather than waiting for it to disappear.
When to Enable Thinking Mode: A Developer Decision Framework
The plan-tier gate on thinking mode creates a binary for most developers: access is available or it is not, based on the authenticating organization's subscription. Assuming access, the per-request decision reduces to three variables: latency tolerance, per-image cost sensitivity, and content risk profile.
Use thinking mode for:
- Print assets, event posters, and editorial infographics where layout correctness and compositional planning matter more than generation speed
- Content requiring current data — live venue details, recent event information, updated statistics — where web search grounding provides factual accuracy the model cannot derive from training data alone
- Multi-panel layouts, QR-embedded designs, storyboard sequences, and campaign asset batches where cross-image consistency and compositional validation are required
- Async pipelines and batch jobs where the generation is not blocking a synchronous user-facing interaction
Skip thinking mode for:
- Real-time user-facing generation where latency above 15 seconds is unacceptable — the reasoning pass is not configurable and cannot be shortened
- High-volume generation batches where per-image cost is a primary constraint and compositional planning adds no meaningful quality improvement for simple, well-defined prompts
- Straightforward product renders or icon generation where instant mode output already meets your quality threshold and thinking-mode overhead is pure waste
"The combined detection rate is 87.5% for thinking mode, compared to 96.1% for standard mode" — from the ChatGPT Images 2.0 Safety System Card PDF, OpenAI Deployment Safety Hub, April 2026. OpenAI does not explain why the gap exists in the published card.
The safety detection rate gap is an additional factor for regulated content verticals. Operating in thinking mode with a 87.5% detection rate versus 96.1% in standard mode means a meaningful fraction of policy-violating inputs that would be caught in standard mode will not be caught before generation begins. For healthcare, legal, or children's platforms, apply additional application-layer content moderation when thinking mode is active — do not rely solely on the API's built-in safety system.
No public API parameter exists to force thinking mode on or off per-request. The current control surface is entirely at the account plan tier. If OpenAI introduces a per-request toggle in a future API version, the decision framework above still applies — but you would gain finer-grained control at the call level rather than the account level. For now, architect the constraint into your access-tier logic: expose thinking mode capabilities only where the underlying credentials support them, and make that conditional explicit and documented in your codebase.
Frequently Asked Questions
Is gpt-image-2 available on the free tier?
Free-tier OpenAI accounts can access gpt-image-2 in instant mode, but thinking mode and web search grounding are unavailable. These capabilities require a Plus, Pro, Business, or Enterprise plan . This is a hard infrastructure gate, not a usage limit — passing a thinking-mode request from a free-tier credential will not silently fall back to instant mode. Design your access control logic to check plan tier before surfacing thinking-mode features to end users.
How do I migrate from DALL-E 3 to gpt-image-2 in existing code?
Three changes are required. First, swap the model string from "dall-e-3" to "gpt-image-2" — endpoint URLs and Authorization headers are unchanged. Second, rename quality values: "standard" maps to "medium" and "hd" maps to "high"; the DALL-E 3 quality strings are not valid in the new API . Third, update billing logic — gpt-image-2 uses token-based pricing with image input tokens at $8.00/M and output tokens at $32.00/M on top of the per-image rate, replacing DALL-E 3's flat-rate model. The response schema now returns usage.input_tokens and usage.output_tokens; wire those into your cost monitoring before switching traffic. Also note: edit requests on gpt-image-2 always process at maximum quality regardless of the quality parameter — if you relied on quality="standard" for cheaper edits under DALL-E 3, that saving is gone.
What does the gpt-image-2 safety system card actually measure?
The card measures two primary rates across standard and thinking modes. Combined detection rate is the percentage of policy-violating inputs that the three-layer safety stack successfully catches before or during generation — 96.1% for standard mode, 87.5% for thinking mode . Safe output rate is the percentage of returned images that are policy-compliant — 99.1% for standard mode, 99.2% for thinking mode. OpenAI shifted this evaluation from raw taxonomy-matching to outcome-based assessment of real harmful-output risk. The 8.6-point gap in detection rate between modes is documented in the published card but not fully explained. The full PDF is available at deploymentsafety.openai.com.
Does thinking mode add meaningful latency in production?
Yes — 15 to 30 seconds of additional latency per request, on top of base generation time . This is structural and reflects the pre-generation reasoning pass and optional web fetch — it is not a transient performance issue. Thinking mode is unsuitable for synchronous user-facing flows where sub-5-second generation is expected. It is well-suited for async batch jobs, scheduled content pipelines, or operator-side asset creation where users are already expecting a processing delay. There is no API parameter to shorten or skip the reasoning pass while remaining in thinking mode.
Can gpt-image-2 render accurate images of events after December 2025?
Not reliably. The model's knowledge cutoff is December 2025 . Web search grounding in thinking mode can retrieve current text data — a recent event date, updated venue information — and incorporate it into the compositional plan. However, the model cannot accurately render the visual appearance of post-cutoff subjects: a product released in 2026, a newly prominent public figure, or a recently redesigned logo will not be visually accurate regardless of grounding, because the model has no training data for that subject's appearance. Use web search grounding to get current facts into your prompt; do not rely on it to make the visual rendering reflect post-cutoff reality.
Build Decisions and What to Track
gpt-image-2 closes the capability gap between text and image generation on OpenAI's platform — the reasoning layer, multi-image continuity, and instruction-following improvements address the most common developer complaints about DALL-E 3. The migration deadline has passed: DALL-E 2 and DALL-E 3 are retired as of May 12, 2026 . If you have not migrated, the model string swap and quality value rename are the immediate blockers. The token billing migration is the part most likely to produce unexpected cost increases — log response.usage from the first request and establish a baseline before routing production traffic.
The more interesting architectural question is where thinking mode fits in your stack. For async pipelines — scheduled content generation, operator-side asset creation, document-to-image workflows — thinking mode is the right default when plan tier allows it. For real-time user-facing generation, instant mode is the only viable path until latency characteristics improve. The safety detection rate differential (96.1% vs 87.5%) between modes is a risk factor worth actively monitoring: if OpenAI ships safety improvements specific to thinking mode, the calculus for high-risk content verticals shifts. OpenAI's explicit disclosure of this gap in the published card is itself useful — it provides a documented baseline to track against future safety card revisions.
Four things to watch: a per-request API toggle for thinking mode (not yet available — currently plan-tier-gated only); fine-tuning support for gpt-image-2 (not announced as of April 2026 ); thinking mode expansion to free-tier accounts; and any published explanation from OpenAI for the detection rate gap. The Deployment Safety Hub is the authoritative source for safety metric updates as the card is revised. Bookmark the PDF directly — OpenAI has historically updated system cards with new evaluation results without major announcements.
Last updated: 2026-05-27. Based on OpenAI's April 21, 2026 launch announcement, the ChatGPT Images 2.0 Safety System Card published at deploymentsafety.openai.com, and OpenAI's image generation API documentation. DALL-E 2 and DALL-E 3 retirement confirmed as of May 12, 2026.

