Agent #grok build #xai #coding agent #grok-build-0.1

xAI Grok Build: Sub-Agents, MCP Compat, and the SWE-Bench Numbers

xAI shipped its terminal coding agent on May 14, 2026. Here's what the CLI actually does, where the benchmark numbers hold, and what $299/month buys.

Creeta

May 28, 2026

xAI Grok Build: Sub-Agents, MCP Compat, and the SWE-Bench Numbers

What Grok Build Actually Ships: CLI, Model, and Install

Grok Build is xAI's terminal-native autonomous coding agent, powered by a purpose-built model called grok-build-0.1. It installs with a single command (curl -fsSL https://x.ai/cli/install.sh | bash) and runs against a local codebase through either an interactive terminal UI or a headless grok -p flag for scripting and CI pipelines. The CLI is written in Rust. xAI launched the product on May 14, 2026 , with grok-build-0.1 formally released on May 19–20, 2026 , and access expanded to all SuperGrok and X Premium+ subscribers on May 25, 2026 . The agent enters a competitive field that includes Anthropic's Claude Code, OpenAI's Codex CLI, GitHub Copilot's cloud agent, and Google's Jules.

Quick Answer: Grok Build is xAI's Rust-based CLI coding agent powered by grok-build-0.1, with a 256,000-token context window and image input support. It runs a structured Plan → Search → Build pipeline with up to 8 parallel sub-agents in isolated Git worktrees. xAI self-reports 70.8% on SWE-Bench Verified; Claude Code (Opus 4.7) sits 10–18 points higher on the same benchmark per independently cited figures.

The headline spec for grok-build-0.1 is its 256,000-token context window . A mid-sized codebase of 50,000–80,000 lines typically fits within that window in a single pass, which removes the chunking strategies and retrieval-augmented workarounds that smaller-context agents require. The model also accepts images alongside text — diagrams, UI mockups, and error screenshots — enabling it to reason from visual inputs without a separate vision step.

The context window comparison to Claude Code is worth stating plainly. Claude Code's 1M-token context provides roughly 4× the capacity of grok-build-0.1. For monorepos or large enterprise codebases exceeding several hundred thousand tokens, this gap becomes an architectural constraint. You'll hit chunking requirements with Grok Build in contexts where Claude Code wouldn't.

The CLI ships with two primary interaction modes. The interactive terminal UI provides structured diff previews and an explicit approval gate between planning and execution. The headless -p flag enables non-interactive use in CI/CD pipelines and automation workflows. The approval semantics shift meaningfully in headless mode — addressed in the next section — but the dual-mode design is well-suited to the range of contexts where a coding agent gets used day-to-day.

Parameter	grok-build-0.1 (Grok Build)	Claude Code (Opus 4.7)	OpenAI Codex CLI
Context window	256K tokens	1M tokens	~200K tokens (o4-mini)
Image inputs	Yes (diagrams, mockups, screenshots)	Yes	Limited
CLI implementation	Rust	Node.js / TypeScript	Node.js
Max parallel sub-agents	8 (Git worktree isolation)	4	1
SWE-Bench Verified	70.8% (xAI self-reported)	80.4–88%	Not publicly disclosed
Install command	`curl -fsSL https://x.ai/cli/install.sh \| bash`	`npm install -g @anthropic-ai/claude-code`	`npm install -g @openai/codex`
Headless / scripting	Yes (`grok -p`)	Yes	Yes
Workflow model	Structured: Plan → Search → Build	Turn-by-turn (conversational)	Task-based

The Plan → Search → Build Pipeline

Grok Build executes through three explicit, sequential stages: Plan, Search, and Build. In the Plan stage, the agent generates a step-by-step implementation plan — not a high-level summary, but a concrete sequence of file changes, function additions, and dependency updates. No files are modified at this stage. In Search, the agent navigates the codebase to identify relevant files, symbols, and dependencies. Only after the developer explicitly approves the plan does the Build stage execute changes. This human approval gate between Plan and Build is mandatory in interactive mode and is the sharpest architectural distinction between Grok Build and Claude Code's more fluid, turn-by-turn model.

The practical upside of the structured pipeline is auditability before commit. Before a single byte changes on disk, you have a written plan to review, reject, or edit. For developers who have encountered agents that confidently refactor the wrong module, this is a meaningful workflow safety net. The trade-off is iteration speed: if the plan is wrong, you reject and re-prompt from the beginning. Claude Code, by contrast, allows mid-execution course corrections through continued conversation — which is faster when the initial direction is close but not quite right.

Headless mode (grok -p) introduces a nuance worth resolving before you wire Grok Build into CI/CD. In interactive mode, the approval gate is explicit and blocking — nothing executes without developer confirmation. In headless mode, the default behavior of this gate is not unambiguously documented; the gate may be bypassed automatically or may require explicit configuration to disable. This distinction is not a minor detail in automated pipelines. Test the boundary explicitly — run a dry-run that would trigger the Build stage in your CI environment before depending on the behavior in production. According to Build Fast With AI, the structured pipeline works best when requirements are stable before execution begins, and the headless mode is most reliably used for well-scoped tasks rather than exploratory ones .

Neither the structured pipeline nor the conversational model is categorically better. The choice maps to workflow style: if review-before-execute is how your team already operates — writing a spec before coding — the Plan → Search → Build model fits naturally. If your pattern is iterate-in-context, refining direction as the agent works, the mandatory gate adds friction per cycle. Both tools will handle a well-defined task; the difference shows up on ambiguous, multi-session work.

Eight Parallel Sub-Agents and Arena Mode

Grok Build's Build stage can spawn up to eight concurrent sub-agents , each isolated in its own Git worktree on a separate branch. xAI cites this as its primary differentiator against Claude Code, which supports a maximum of four parallel agents. In practice, eight-way parallelism means the system can explore eight divergent solution paths simultaneously on the same task — each on its own branch, with no risk of conflicts between agents while they work. The developer reviews the set of resulting branches, selects the winner, and merges or discards the rest.

Git worktree isolation is the technically sound mechanism for this. Each agent operates in a separate filesystem path pointing to the same underlying repository object store, so branches diverge safely without filesystem collisions. The merge decision stays with the developer — the parallelism is additive work, not autonomous decision-making. This matters for teams with code review requirements: you are still reviewing and approving a diff before it lands.

Arena Mode extends the parallel agent pattern with automatic ranking. When enabled, Grok Build scores the competing sub-agent solutions and presents them as an ordered leaderboard rather than requiring the developer to manually diff each branch. The exact scoring methodology is not documented in public materials as of the launch — whether it runs the project's test suite, applies static analysis, uses an LLM evaluator, or combines approaches is unspecified. For tasks with clear correctness criteria (a failing test that must pass), automated scoring is tractable. For exploratory feature work with ambiguous success criteria, the scoring function's assumptions become critical.

Arena Mode's reliability on complex multi-file tasks has not been validated by external reviewers as of May 2026 . The feature is most defensibly useful for two scenarios: exploratory feature implementation where multiple valid architectures exist, and large-scale refactors where the best approach is genuinely unclear upfront. For a targeted bug fix with an obvious solution path, spawning eight agents and waiting for scoring adds latency without proportionate benefit. Treat Arena Mode as experimental until production case studies appear.

Benchmarks: What the 70.8% SWE-Bench Verified Score Means

xAI internally reports a 70.8% score on SWE-Bench Verified for grok-build-0.1 . SWE-Bench Verified is the current standard benchmark for agentic coding systems: agents are given real GitHub issues from open-source Python repositories and must produce patches that pass the repository's test suite, without being told which files to edit. A 70.8% pass rate would place grok-build-0.1 well above early 2025 agentic baselines. However, secondary sources cite Claude Code (Opus 4.7) at 80.4–88% on the same benchmark — putting Grok Build 10–18 points behind depending on the source, even on xAI's own numbers.

The critical issue is provenance. An independent technical review published on Medium was direct about this limitation: the benchmark claims "rest on xAI's own evaluation suite" and "genuine product wins are narrower than launch coverage suggests" . No third-party lab has published an independent replication of the 70.8% figure as of May 28, 2026. This matters because SWE-Bench Verified results vary meaningfully based on evaluation harness choices — scaffolding design, retry budget, tool access, and whether hints are provided all affect the pass rate. Two evaluations of the same model with different harness configurations can produce legitimately different numbers.

"Benchmark claims rest on xAI's own evaluation suite, and genuine product wins are narrower than launch coverage suggests." — Independent technical review, Medium, May 2026

The practical implication of a 10-percentage-point benchmark gap — taking xAI's numbers at face value — is worth quantifying. On SWE-Bench Verified, a 10-point difference means roughly 10 out of 100 real-world GitHub issues that Claude Code resolves correctly, grok-build-0.1 does not. For individual developers running tens of tasks per week on focused codebases, this difference may not surface noticeably. For teams automating hundreds of tasks per week, the aggregate failure rate compounds in ways that show up in review queues and revert rates. The benchmark is not the product, but it is an informative prior for where edge cases will fall down.

Tool / Model	SWE-Bench Verified Score	Source Provenance	As Of
Claude Code (Opus 4.7)	80.4–88%	Anthropic / secondary sources	2026-05
Grok Build (grok-build-0.1)	70.8%	xAI self-reported — no independent replication as of 2026-05-28	2026-05
OpenAI Codex CLI	Not publicly disclosed on SWE-Bench Verified	—	—
GitHub Copilot agent	Not publicly disclosed on SWE-Bench Verified	—	—

Protocol Compatibility: MCP, ACP, and Cross-Tool Config Reading

Grok Build natively supports the Model Context Protocol (MCP), which means any MCP server already configured for Claude Code or Cursor can be reused without modification . Database connectors, internal API bridges, custom search tools — if it runs as an MCP server, Grok Build can consume it. This is the most immediately actionable integration point for developers with an existing MCP toolchain: there is no migration cost, no protocol translation, and no new configuration files to maintain. The MCP ecosystem that Anthropic has built out across Claude Code, Cursor, and third-party tool vendors becomes directly reusable.

Beyond MCP, xAI implements the Agent Client Protocol (ACP), enabling Grok Build to be embedded as a node in larger automation pipelines rather than running as a standalone terminal agent. For teams building multi-agent systems — where a coding agent is one stage in a broader orchestration graph — this means Grok Build can receive tasks from an orchestrator and return structured results without requiring a human in the terminal loop. The ACP implementation also means Grok Build can compose with other ACP-compatible agents in the same pipeline.

The cross-tool config reading claim deserves its own assessment. According to xAI's product documentation, Grok Build automatically reads CLAUDE.md, .claude/rules/ directories, Claude Code plugins, and AGENTS.md (the OpenCode format) at startup — with zero additional configuration . If this works as stated, dropping Grok Build into an existing Claude Code project gives the agent immediate access to your project instructions, coding style guides, and project-specific rules. Verify this against your actual configuration before relying on it — "reads automatically" can mean different things depending on parsing fidelity, instruction file size, and tool-specific syntax that may not translate cleanly between agents.

Enterprise deployment options include OIDC authentication, API-key authentication, sandbox profiles, granular permission rules, and a zero-data-retention mode . Zero-data-retention is relevant for regulated industries where sending production code to a third-party API is subject to data governance requirements. Whether these controls satisfy specific compliance frameworks — SOC 2, HIPAA, FedRAMP — is not addressed in available documentation. This is a due-diligence gap for enterprise procurement teams before committing to a production rollout.

Pricing: Where $299/Month Sits Against the Field

The primary tier for Grok Build is SuperGrok Heavy at $299/month , which bundles Grok Build with Grok 4.3 and Grok Imagine (xAI's image generation model). A promotional rate of $99/month was available for the first six months at launch , reverting to the full rate thereafter. For comparison, Claude Pro at $20/month includes Claude Code access — putting the cost differential at approximately 15× at the Heavy tier once the promotional window closes .

Standard SuperGrok and X Premium+ subscribers received Grok Build access in the May 25, 2026 rollout , at approximately $30–$40/month. The rate limits and feature constraints for these lower tiers — whether Arena Mode is available, how many parallel agents are permitted, daily request caps — are not specified in public documentation. This gap makes cost modeling for team use difficult. If the lower tier has substantially throttled access, the effective price-to-capability comparison shifts accordingly.

API pricing for direct grok-build-0.1 access is unresolved. Two secondary sources report conflicting figures: one cites $1.00 per million input tokens and $2.00 per million output tokens ; another reports $0.20 per million input tokens and $1.50 per million output tokens . xAI's official API documentation did not include pricing at the time of this writing. The difference between these figures is a 5× variance in input cost projection — material for any team modeling the economics of API-level integration. Do not commit to API-based workflows until official pricing is confirmed from xAI's documentation.

Plan	Monthly Cost	Grok Build Access	Additional Includes	Notes
SuperGrok Heavy	$299/mo ($99/mo promo)	Full access	Grok 4.3, Grok Imagine	Promo rate for first 6 months; reverts to $299/mo
SuperGrok / X Premium+	~$30–$40/mo	Included — rate limits unspecified	Standard Grok access	Capability constraints not documented
Claude Pro (with Claude Code)	$20/mo	Claude Code included	Opus 4.7, Sonnet models	~15× cheaper than Heavy tier at list price
GitHub Copilot Enterprise	$39/user/mo	Copilot agent included	IDE integration, org policy management	Per-seat pricing model

The Broader xAI Developer Platform: Skills and Connectors

Grok Build did not ship as a standalone product. Two complementary releases launched alongside it that indicate the broader platform direction. Grok Skills, introduced on May 18, 2026 , provides persistent, reusable capability bundles packaged in .zip, .skill, or .md formats. Built-in generators produce Word documents, PowerPoint presentations, Excel spreadsheets, and PDF files — extending the agent's output surface beyond code into productivity deliverables. For developers building internal tools or document automation workflows, this matters in ways a pure coding agent does not address.

Connectors shipped in two waves. The first wave, on May 6, 2026 , integrated GitHub, Notion, Linear, Google Workspace, Microsoft 365, and a bring-your-own-MCP option. The second wave, on May 22, 2026 , added Vercel, Canva, Gamma, and S&P Global — a notably varied set spanning deployment (Vercel), design tools (Canva, Gamma), and financial data (S&P Global). The S&P Global connector in particular signals a target audience beyond pure software development, toward finance and data-intensive workflows.

The platform positioning here is comparable to two established ecosystem efforts. Anthropic's MCP ecosystem positions Claude as a universal tool-using agent through protocol standardization. OpenAI's Responses API tool layer enables structured tool use within API calls. xAI is building a similar MCP-compatible integration layer across productivity, data, and deployment surfaces — with the additional element that the consumer X platform and Grok Imagine sit in the same subscription tier, potentially lowering activation friction for non-developer users of the same tools. According to Codersera, this positions xAI as building an end-to-end agentic stack rather than a narrow coding tool .

Whether the breadth of connectors translates into a cohesive developer platform depends on integration quality, not connector count. Adding S&P Global and Canva to the list is a signal; developers will make adoption decisions based on whether the GitHub, Vercel, and Linear integrations perform reliably under real workloads — the integrations that sit in the daily engineering path. No independent quality assessments of Connector reliability have been published as of this writing.

Open Questions and What to Watch

Four substantive uncertainties remain after the Grok Build launch that are decision-relevant for developers considering adoption. These are not documentation gaps that will be obvious from a quick trial — they are points where the available information is genuinely insufficient to make confident engineering or procurement decisions at this stage.

SWE-Bench replication. Until an independent lab evaluates grok-build-0.1 with a published methodology, xAI's 70.8% figure cannot be treated as a confirmed data point. The performance gap versus Claude Code could be exactly as stated, narrower, or wider — the data to arbitrate does not exist. The independent review flagged the issue directly: benchmark claims "rest on xAI's own evaluation suite" . Watch for evaluations from groups that publish both scores and methodology — Aider Leaderboard contributors and independent researchers who open-source their harness configurations are the most credible sources.

API pricing. A 5× variance between the two secondary source figures means cost projections for API-level integration are unreliable. Check the official xAI developer release notes for a published pricing page before building any cost model for production API use . This is a blocker for financial planning, not a minor detail.

Arena Mode scoring methodology. Eight competing sub-agents sounds useful; the missing piece is documentation of how Grok Build decides which solution wins. Does it run the project's test suite? Apply a static analysis pass? Use an LLM-as-judge approach? For tasks with clear correctness criteria, automated scoring is tractable. For complex feature work with ambiguous success criteria, the scoring function's assumptions drive the outcome. No production case studies or failure mode documentation have been published as of May 28, 2026 .

Cross-tool config compatibility in practice. The claim that Grok Build reads CLAUDE.md, .claude/rules/, and AGENTS.md automatically needs field-testing before you rely on it. Likely edge cases include nested config hierarchies, conflicting instruction sets, large instruction files that consume meaningful context budget, and Claude Code plugin syntax that may not translate semantically to a different execution environment. Test against your actual project configuration before treating cross-tool compatibility as a production assumption. According to Build Fast With AI, developers have reported the cross-config reading works on standard setups but edge cases remain undocumented .

Frequently Asked Questions

How does Grok Build compare to Claude Code in practice?

Grok Build supports up to 8 parallel sub-agents in isolated Git worktrees, versus Claude Code's 4. It uses a structured Plan → Search → Build pipeline with a mandatory human approval gate before any file is modified. Arena Mode provides automatic ranking of competing agent solutions. Claude Code leads significantly on SWE-Bench Verified: 80.4–88% cited from independent sources versus Grok Build's self-reported 70.8% — a 10–18 point gap that has not been independently replicated. The pricing gap is substantial: $299/month for SuperGrok Heavy versus $20/month for Claude Pro with Claude Code access included, approximately 15× at list price. Both tools support MCP servers natively, and Grok Build claims to read Claude Code's existing project config files automatically. Context window capacity is also a factor: grok-build-0.1 carries 256K tokens versus Claude Code's 1M tokens — a 4× difference relevant for large codebase work.

Can I reuse my existing MCP server configuration with Grok Build?

Yes. Grok Build natively supports MCP servers, and any MCP server already configured for Claude Code or Cursor should work without reconfiguration. Beyond MCP compatibility, xAI states that Grok Build automatically reads CLAUDE.md and .claude/rules/ directories at startup, meaning existing Claude Code project instructions should be available to the agent without manual porting. This cross-tool compatibility claim is xAI-stated and should be verified against your actual project configuration before relying on it in production — particularly for setups with complex instruction hierarchies, large instruction files, or tool-specific plugin syntax that may not transfer cleanly between agent environments.

What is Arena Mode in Grok Build?

Arena Mode is an optional feature in Grok Build's Build stage. When activated, the agent spawns multiple sub-agents — up to 8 — that work concurrently on the same task, each in its own isolated Git worktree on a separate branch. Grok Build automatically scores and ranks the competing solutions, presenting results as an ordered leaderboard for the developer to review and select from. This removes the need to manually diff multiple branches. Arena Mode is most valuable for exploratory or high-ambiguity tasks where multiple valid implementations exist — for example, "implement a caching layer" with no prescribed architecture. For targeted bug fixes with a clear correct solution, the overhead of spawning and ranking multiple agents typically adds latency without proportionate benefit. As of May 2026, the scoring methodology is not publicly documented and no production case studies have been published.

Is the 70.8% SWE-Bench Verified score independently verified?

No. As of May 28, 2026, xAI's 70.8% SWE-Bench Verified figure for grok-build-0.1 has not been replicated by a third-party evaluator. An independent review published on Medium flagged that the benchmark claims rely on xAI's own evaluation suite, and that genuine product wins appear narrower than the launch coverage implied. Claude Code (Opus 4.7) sits at 80.4–88% on independently cited SWE-Bench numbers — 10 to 18 percentage points higher. SWE-Bench Verified results can vary based on evaluation harness configuration choices, so the methodology matters alongside the score. Until a neutral third party runs the benchmark with a published harness and methodology, both the absolute accuracy of xAI's figure and the precise performance gap versus competing tools remain estimated rather than confirmed.

What does the $299/month SuperGrok Heavy plan include beyond Grok Build?

SuperGrok Heavy at $299/month bundles three xAI products: Grok Build (the agentic coding CLI), Grok 4.3 (xAI's frontier language model), and Grok Imagine (image generation). A promotional rate of $99/month was available for the first six months at launch, reverting to $299/month thereafter. Standard SuperGrok and X Premium+ subscribers at approximately $30–$40/month also received Grok Build access in the May 25, 2026 rollout, though the rate limits, parallel agent caps, and feature availability at these lower tiers were not specified in public documentation at launch. API-level access to grok-build-0.1 is reported at conflicting price points by secondary sources; official pricing was absent from xAI documentation at time of writing.

Where Grok Build Fits — and What Still Needs Answering

Grok Build is a technically coherent first release from xAI. The structured Plan → Search → Build pipeline with a mandatory approval gate, Git-worktree-isolated parallel agents, native MCP compatibility, and ACP support are real architectural choices — not surface-level feature matching. The cross-tool config reading and 8-agent parallelism are differentiators that Claude Code does not currently offer. For developers who prefer review-before-execute workflows or who need to explore multiple implementation approaches simultaneously, these are worth evaluating against a real project.

The gaps are equally real and should not be discounted. A 10–18 point SWE-Bench deficit on xAI's own numbers — before any independent verification — is a meaningful prior for task failure rates at scale. The 256K-token context window is a concrete constraint for large codebase work, sitting at roughly a quarter of Claude Code's capacity. API pricing remains unresolved between conflicting secondary sources. The $299/month Heavy tier is a significant commitment relative to Claude Pro at $20/month, particularly for a tool whose benchmark claims have not been independently replicated. Arena Mode's scoring methodology is opaque enough that relying on it for high-stakes decisions would require treating it as a black box.

The practical path for most developers is a scoped trial on a real project — ideally one where Claude Code results already exist for comparison — to evaluate whether the structured pipeline and parallel agent behavior fit your actual workflow. Run a task that has a clear correctness criterion, compare the Plan output to what you'd write manually, and examine how the worktree isolation holds up on a multi-file change. The benchmark questions will be answered by independent researchers over the coming months. The pricing will likely stabilize as xAI moves from beta toward general availability. What you can learn now, at low cost, is whether the architectural decisions behind Grok Build suit how you actually work.

Last updated: 2026-05-28. Research conducted using publicly available sources as of May 28, 2026. xAI API pricing documentation was unavailable at time of writing — verify against official xAI documentation before making API-level integration decisions. SWE-Bench figures should be treated as provisional until independent third-party replication is published.