LLM #xAI #grok-build-0.1 #agentic coding #LLM pricing

grok-build-0.1: Model Spec, Pricing, and the Beta Rollout Story

xAI's grok-build-0.1 hit public beta in May 2026. Here's what the spec says — and what the caching incident revealed.

Creeta

May 29, 2026

grok-build-0.1: Model Spec, Pricing, and the Beta Rollout Story

What grok-build-0.1 Is: Model Family and Versioning

grok-build-0.1 is the externally documented model slug for xAI's first purpose-built agentic coding model, callable via the xAI API with a standard API key and listed on OpenRouter as x-ai/grok-build-0.1 . Internal identifiers found in xAI documentation — grok-code-fast-1, grok-code-fast, and grok-code-fast-1-0825 — indicate this is a versioned checkpoint of a broader grok-code-fast model family, not a standalone one-off release . The version number (0.1) signals that subsequent iterations are planned, though no public roadmap has been disclosed as of May 2026.

Quick Answer: grok-build-0.1 is xAI's agentic coding model with a 256K-token context window, always-on reasoning, and text-plus-image input. Priced at $1.00/M input and $2.00/M output tokens with an 80% cache discount, it's available via the xAI API (API key only) or through the Grok Build CLI, which requires a SuperGrok or X Premium+ subscription.

The modality profile is text and image input with text output . In a coding agent context, image input primarily serves debugging workflows: you can pass an error screenshot, a UI mockup, or a terminal rendering anomaly directly into the session without manually converting it to text first. This is a practical difference from text-only coding models that require you to transcribe or describe visual artifacts before the model can act on them.

Reasoning in grok-build-0.1 is always-on: the model thinks before responding without a separate toggle or mode switch . This contrasts with extended thinking in Claude, which is opt-in via API parameter, and with OpenAI's o-series, which are separate model variants rather than a feature flag. Whether always-on reasoning produces meaningfully better code edits versus adding latency that compounds in tight agent loops is an open empirical question — no third-party benchmarks have been published for grok-build-0.1 as of May 2026.

Deployment is available in two regions: us-east-1 and eu-west-1 . The EU region is directly relevant for developers building under GDPR constraints. Routing inference to eu-west-1 keeps processing within EU jurisdiction — important when the model is handling source code that contains personal data or proprietary business logic subject to data residency requirements.

256K Token Window and Rate Limits in Practice

A 256,000-token input window is the headline capability that most directly separates grok-build-0.1 from competing models for coding workloads . In practical terms, 200,000 tokens covers a mid-sized codebase — a 30K-line TypeScript monorepo, or a Python service with a full test suite and configuration files — loaded in a single context pass. That eliminates the chunking logic most agent frameworks bolt on to handle 128K-limit models: no sliding windows, no smart truncation heuristics, no context prioritization schemes that inevitably drop files the agent needs later.

The rate limits are 1,800 requests per minute and 10,000,000 tokens per minute . The token-per-minute ceiling is the binding constraint for heavy agentic use. Running 8 parallel subagents, each consuming a 256K input context, sends 2,048,000 tokens per parallel turn — well within the 10M/min headroom, assuming turns don't all fire simultaneously. Under normal multi-turn agent patterns with staggered completion times, the 8-branch parallelism described in xAI documentation is achievable without hitting per-minute ceilings .

There is no documented output token cap per call for extended autonomous sessions . This is worth calling out explicitly: GPT-4o defaults to 4,096 output tokens per call and requires explicit parameter overrides to reach higher limits; some providers enforce 8K hard caps regardless of model capability. An uncapped output is meaningful for tasks like generating complete files, writing full test suites, or producing long migration scripts in a single call without splitting the work across multiple requests.

The combination of a large input window and uncapped output has a direct implication for agent loop design. Repeated system prompt injection — where a large base context (instructions, file tree, project conventions) is prepended on every turn — is feasible without aggressive trimming. You can maintain a stable, comprehensive context across a long multi-turn session and let the model see the full project state on each call, which reduces the class of bugs that arise from the agent losing track of earlier state.

Pricing Breakdown vs. Competing Agentic Models

grok-build-0.1 is priced at $1.00 per million input tokens, $0.20 per million cached input tokens, and $2.00 per million output tokens . The 80% discount on cached input is the number that matters most for agentic coding workloads, where the system prompt and file tree are re-injected on every call. A session with a 70% cache hit rate on input brings the blended input cost to approximately $0.44 per million tokens — a 56% reduction from the list price, which is meaningful at scale.

The table below compares grok-build-0.1 against three models commonly used in agentic coding pipelines. grok-build-0.1 pricing is sourced from the OpenRouter model listing; competitor figures are from respective provider pricing pages as of May 2026.

Pricing comparison: agentic coding models, May 2026. Competitor figures from provider pricing pages; grok-build-0.1 from OpenRouter.
Model	Input ($/M tokens)	Cached Input ($/M tokens)	Output ($/M tokens)	Context Window
grok-build-0.1	$1.00	$0.20 (80% off)	$2.00	256K
Claude 3.5 Haiku	$0.80	$0.08 (90% off)	$4.00	200K
GPT-4o mini	$0.15	$0.075 (50% off)	$0.60	128K
Claude 3.5 Sonnet	$3.00	$0.30 (90% off)	$15.00	200K

GPT-4o mini is substantially cheaper on raw input cost, but its 128K context window creates architectural pressure for any codebase beyond small services — you pay for chunking complexity in engineering time rather than token cost. Claude 3.5 Haiku is cheaper on input but significantly more expensive on output ($4.00 vs. $2.00 per million), which matters for output-heavy tasks like code generation where output tokens dominate the cost profile. Claude 3.5 Sonnet belongs in a different category; it applies when output quality is the binding constraint and throughput volume is low.

To make the pricing concrete: consider a mid-size feature request dispatched to 8 parallel subagents, each receiving a 256K input context (file tree + system prompt + task description), returning 40K output tokens, with a 65% cache hit rate on input :

Uncached input per agent: 256,000 × 0.35 × $1.00/M = $0.0896
Cached input per agent: 256,000 × 0.65 × $0.20/M = $0.0333
Output per agent: 40,000 × $2.00/M = $0.0800
Total per agent: ~$0.20
Total for 8 agents: ~$1.62

The same workload on Claude 3.5 Sonnet (at $3.00 input / $0.30 cached / $15.00 output, with 90% cache discount) runs approximately $7.35 — roughly 4.5× higher. The math shifts for shorter, lower-volume sessions where base model quality may outweigh throughput cost, but for high-frequency agentic pipelines where cost per run compounds across dozens or hundreds of daily sessions, grok-build-0.1's $2.00/M output pricing is its clearest competitive advantage.

The Staged Beta: Five Phases From May 14 to May 27

xAI's Grok Build launch followed a five-phase staged rollout over two weeks, progressing from a closed subscriber beta to broad API availability . The sequencing — SuperGrok Heavy first, then API exposure, then third-party integrations, then general subscriber access — is a standard canary pattern. The 14-day gate-to-broad-availability timeline is compressed relative to how Anthropic and OpenAI have paced comparable rollouts, and the caching bug that surfaced on May 26 is at least partly a consequence of that pace.

Phase 1 (May 14–15): The CLI beta launched exclusively for SuperGrok Heavy subscribers . Installation was a single shell command. The product introduced three capabilities not previously available in the Grok ecosystem: a terminal-based TUI, parallel subagent support spanning up to 8 branches running simultaneously, and a Plan-then-Approve workflow that gates code execution behind developer review. SuperGrok Heavy is xAI's highest subscription tier, giving xAI an initial cohort of high-engagement users positioned to surface bugs through active use before broader rollout.

Phase 2 (May 19–21): The model slug grok-build-0.1 appeared on the xAI API, opening access to developers with standard API keys independent of subscription tier. ReleaseBot.io places the listing on May 19 ; OpenRouter's metadata shows May 20 ; and a KuCoin flash note cites May 21 . The three-day spread across independent tracking sources is consistent with a staged canary rollout rather than a single global toggle.

Phase 3 (May 21): OpenCode integration shipped on the same day the API rollout window closed . SuperGrok and X Premium+ subscribers could connect their account inside OpenCode via browser OAuth or a headless token, with both paths routing to the same grok-build-0.1 backend. The auth tradeoffs between those two paths are addressed in a later section.

Phases 4–5 (May 24–27): Broader access was extended to all SuperGrok and X Premium+ subscribers globally between May 24 and 27 . On May 26, xAI pushed a caching fix and fully restored rate-limit quotas that had been consumed early due to a billing inefficiency in the caching layer . That episode is covered in detail in the next section.

The Caching Bug That Burned Beta Quotas Early

On May 26, 2026, xAI acknowledged that caching inefficiencies in the Grok Build beta had caused subscribers to exhaust their usage quotas significantly faster than expected . The mechanism: cache misses were being billed at the full input token rate ($1.00/M) rather than the cached rate ($0.20/M) — a 5× cost difference per token. In a coding agent session where the file tree and system prompt constitute the bulk of each input and should be cache hits on every turn after the first, users were effectively paying 5× the intended input cost per turn for the duration of the bug.

"We identified caching inefficiencies that caused some users to exhaust their beta quotas sooner than expected. We've pushed a fix and fully restored the affected rate-limit quotas." — xAI, communicated via the Grok Build /feedback channel, May 26, 2026 (source: Basenor)

The fix was communicated through the in-CLI /feedback channel rather than a public status page. If you weren't running the CLI interactively during that window, you may not have caught the notification. xAI fully restored quotas for affected users after deployment. The incident reveals that the quota and billing system wasn't exercised at the load of a broad public beta before launch — a data point about pre-launch stress testing practices worth weighing when evaluating production readiness of any early-access model.

This pattern isn't unique to xAI. Anthropic's initial prompt caching rollout for Claude had edge cases where hit rates were lower than expected on certain prompt structures; OpenAI's cached token feature launched with exact-match and prefix-length requirements that weren't immediately obvious from the documentation. The shared lesson: prompt caching implementations are sensitive to implementation details that don't surface until real, diverse workloads hit them at scale.

The operational takeaway for teams integrating any early-beta model with a caching layer: treat cache hit rate as a first-class metric from day one, not an afterthought. The usage object in the xAI API response exposes cached versus uncached token counts per request. Log those fields. If your blended input cost per session diverges from the expected blended rate by more than 20%, investigate immediately — don't assume prompt caching is functioning correctly just because the feature is listed as available.

Headless Invocation and the Agent Client Protocol

Grok Build ships with two headless execution mechanisms that make it practical to embed in automated pipelines: a -p flag for script-level invocation and an Agent Client Protocol (ACP) interface for structured orchestration by third-party systems . The combination means the CLI can function as a controlled subprocess in a CI/CD runner or as a callable service inside a custom IDE or agent framework — not exclusively as an interactive developer tool.

The invocation pattern for script-level use:

grok -p "Add input validation to the user registration endpoint" \
  --output-format streaming-json

With --output-format streaming-json, the CLI emits machine-readable token-by-token output rather than the TUI interface. This is directly embeddable in a pipeline that needs to parse streaming responses, collect structured metadata (token counts, cached vs. uncached tokens, intermediate plan steps), and pass the result downstream — for example, into a PR description generator or a test runner that validates the model's output before a commit is created .

"ACP exposes a structured interface for third-party orchestrators — custom IDEs, autonomous agent frameworks, or CI systems — to drive the CLI without a human in the loop." — xAI Grok Build documentation

The grok inspect command is the primary diagnostic tool for orchestration mismatches. It surfaces the active project configuration: instructions loaded, skills registered, plugins installed, hooks configured, and MCP (Model Context Protocol) servers connected . When an agent behaves unexpectedly in a complex setup — wrong tool called, MCP server not responding, custom instructions not applied — grok inspect externalizes the agent's configuration state without requiring you to instrument the model's own output or add debug logging to the session.

The custom model override via config is worth noting for teams running hybrid setups. You can substitute a different API endpoint or a locally-served model (via an OpenAI-compatible interface) while keeping the Grok Build orchestration layer — Plan mode, subagent routing, MCP integration — intact. This enables A/B testing a different backend against the same task definition, or running the full agentic scaffold against a staging model endpoint without rebuilding your orchestration from scratch.

OpenCode Integration: OAuth vs. Headless Token

OpenCode added grok-build-0.1 support on May 21, 2026, giving SuperGrok and X Premium+ subscribers two authentication paths into the same backend model . Both paths route to the same grok-build-0.1 endpoint, so model behavior — context handling, reasoning, output format — is identical regardless of which auth method you use. The choice between OAuth and headless token is an operational decision about your environment, not a model capability decision.

The two auth paths have distinct operational tradeoffs:

Browser OAuth: Best for local development environments where a browser redirect is available. The flow is standard, handles token refresh automatically, and doesn't require you to manually store or rotate credentials. Impractical for containerized environments, remote VMs, or CI runners where there is no browser available to complete the redirect.
Headless token: A static credential that works anywhere a browser is not — containers, remote VMs, CI runners. Requires secure storage (environment variable injection, secrets manager, or vault) and a manual rotation process on expiry. Inject at runtime rather than hardcoding in configuration files committed to version control.

For teams already using OpenCode as their primary environment, the integration is additive: you select grok-build-0.1 as the model backend without changing orchestration logic, prompt structure, or tool configurations. OpenCode's routing layer abstracts the endpoint, so the same task definitions and agent configurations that worked with a prior model work unchanged — you are not rewriting plumbing to add a new backend .

One quota consideration worth flagging: the headless token draws from subscription compute quotas rather than a separate API token quota pool. If you're running high-volume automated workflows through OpenCode with a headless token, those runs compete with your interactive sessions for the same quota ceiling. Teams planning both interactive and automated workloads should account for that when sizing their subscription tier, or route automated workloads through the direct xAI API using an API key to keep the quota pools separate .

Frequently Asked Questions

How does grok-build-0.1 differ from the Grok Build CLI product?

grok-build-0.1 is the underlying model slug, callable via the xAI API or through OpenRouter (x-ai/grok-build-0.1) with a standard API key — no CLI installation required . Grok Build is the full CLI product layered on top, adding the Plan-then-Approve workflow, the terminal TUI, parallel subagent orchestration up to 8 branches, MCP server integration, and the ACP interface for third-party orchestration. Developers who want direct model access for custom integrations can call grok-build-0.1 as a standard LLM API endpoint without touching the CLI. The CLI is the higher-level product for interactive agentic development; the model is the API primitive usable in any framework.

Is grok-build-0.1 accessible without a SuperGrok subscription?

Yes. Starting May 19–21, 2026, grok-build-0.1 became available on the xAI API independently of subscription tier . A standard xAI API key is sufficient to call the model via the OpenAI-compatible endpoint or through OpenRouter. The SuperGrok or X Premium+ subscription unlocks the full Grok Build CLI — Plan mode, TUI, and parallel subagents — along with higher beta usage quotas. If you only need the model for API-level integration within your own orchestration framework, the subscription is not required; the API key path gives you the model directly.

What caused the beta quota exhaustion issue in May 2026?

Caching inefficiencies caused cache misses to be billed at the full input token rate ($1.00/M) rather than the cached rate ($0.20/M) — an 80% cost difference per token . In agentic coding sessions where the file tree and system prompt make up the bulk of each input, this resulted in approximately 5× the expected input cost per turn. xAI pushed a fix on May 26, 2026 and fully restored quotas for affected users. To catch similar issues early in any LLM integration with prompt caching, monitor the cached_tokens field in the API response usage object on every request and alert when blended input cost diverges from your expected rate by more than 20%.

What are the internal model aliases for grok-build-0.1?

Internal identifiers found in xAI documentation include grok-code-fast-1, grok-code-fast, and grok-code-fast-1-0825 . The 0825 suffix suggests a checkpoint date (August 25), indicating the model was snapshotted at that point during training or evaluation. The grok-code-fast base name signals a model family specifically optimized for speed and coding tasks. The public-facing slug grok-build-0.1 is the stable identifier for API integrations; the internal aliases may change across model updates and should not be used in production integrations where stability is required.

Can I use grok-build-0.1 in a CI/CD pipeline without installing the Grok Build CLI?

Yes. Direct API calls to grok-build-0.1 work like any standard LLM HTTP endpoint — set the XAI_API_KEY environment variable and POST to the xAI OpenAI-compatible endpoint or route through OpenRouter . No CLI installation is needed for direct model access. If you want the CLI's Plan mode, subagent orchestration, and ACP interface available in your pipeline, the headless invocation path (grok -p "<prompt>" --output-format streaming-json) supports machine-driven execution in runners where no human is present. Both paths are viable: direct API calls are simpler for custom integrations using your own orchestration framework; CLI headless mode is better if you specifically need Plan mode's step-level approval workflow embedded in your automation.

What to Watch Next

grok-build-0.1 lands as a technically credible entry in the agentic coding market: a 256K context window that reduces chunking overhead for mid-sized codebases, pricing that favors high-throughput output-heavy workloads over smaller context models, and a headless protocol that makes it embeddable in CI/CD pipelines without custom wrapper code. For developers evaluating alternatives to existing coding agents, the API-first access path — no subscription required, callable like any LLM endpoint — lowers the integration bar enough to justify a focused technical test against your actual codebase and task types.

What the model currently lacks is published third-party benchmark data. xAI has not released SWE-bench or HumanEval results for grok-build-0.1 as of May 2026. Performance claims in the initial rollout were self-reported — standard practice at launch, but it means teams evaluating this model need to run their own evals on representative tasks before committing production workloads. The caching bug episode on May 26 also suggests the quota and billing system wasn't exercised at full beta load before opening access. The fix was prompt and quotas were restored, but real-world throughput behavior under heavy sustained load remains an open question worth tracking through the coming weeks.

Three signals worth monitoring: whether xAI publishes benchmark data or independent evaluations surface; how fast the model versioning cadence moves from 0.1 onward and whether rapid checkpoint releases create model stability concerns for production integrations locked to the current slug; and whether OpenRouter pricing stays aligned with the direct xAI API as xAI scales its direct developer base. For now, the model is available, the pricing is transparent, and the headless tooling is functional — a reasonable starting point for an informed technical evaluation rather than a production commitment.

Last updated: 2026-05-29. Article reflects grok-build-0.1 model specification and Grok Build beta rollout details as of the publication date. Model pricing, rate limits, and access paths may change; verify current figures against xAI model documentation and OpenRouter's model listing before production deployment.