LLM #cohere #command-a-plus #open-weight-llm #enterprise-ai

Command A+ 2026: Benchmark Results, Citation Tags, and Enterprise Fit

Cohere's first open-weight frontier model: benchmark gaps, native citation design, and the enterprise sovereignty case.

Creeta

May 29, 2026

Command A+ 2026: Benchmark Results, Citation Tags, and Enterprise Fit

Command A+: Cohere's Strategic Open-Weight Pivot

Command A+ is Cohere's first fully open-weight frontier model and the first model the company has ever released under a full Apache 2.0 license . Released on May 20, 2026 and hosted on Hugging Face under the CohereLabs/command-a-plus-05-2026 namespace , it marks a clear strategic break: Cohere has historically kept all frontier weights proprietary and monetized access through API subscriptions and enterprise contracts. That model is now changing.

Quick Answer: Command A+ is Cohere's first Apache 2.0-licensed open-weight model — a 218B sparse MoE architecture with only 25B active parameters per token. Released May 20, 2026 on Hugging Face, it runs on as few as two NVIDIA H100 80GB GPUs at W4A4 quantization, targeting regulated-enterprise and sovereign deployment rather than general consumer inference.

The competitive framing is direct. By releasing under Apache 2.0, Cohere enters a market that Mistral, Meta (Llama), and DeepSeek have been building for two-plus years. The difference in positioning: Command A+ is not aimed at developers who want a capable model for local experimentation. It is aimed at enterprise IT and government buyers who require on-premises or air-gapped deployment — sectors where cloud-API alternatives from OpenAI and Anthropic create regulatory friction that no amount of security certification resolves. Efficiency and sovereignty are the value proposition, not raw composite benchmark rank.

Three quantization tiers are published on Hugging Face: -bf16 for full precision, -fp8 as a mid-tier option, and -w4a4, which Cohere recommends for production deployments . Inference is supported through Hugging Face Transformers and vLLM using the cohere_melody parser. Managed inference is available through Cohere's Model Vault service for teams that prefer not to operate their own stack.

Nick Frosst, co-founder of Cohere, frames the release explicitly around sovereign infrastructure rather than consumer AI:

"Command A+ is part of our effort to make that possible. We're delivering a sovereign, open-source model built for critical infrastructure that gives people, enterprises, and governments the trust, performance, and efficiency they need to run real-world systems at scale." — Nick Frosst, Co-founder at Cohere

Notable Cohere investors include NVIDIA, AMD Ventures, Salesforce Ventures, Oracle, and Cisco — an investor roster that reflects enterprise infrastructure alignment rather than consumer product ambitions. This context matters when evaluating the model's design choices: every architectural decision in Command A+ traces back to the sovereign-deployment use case.

Sparse MoE Internals: 218B Total, 25B Active per Token

Command A+ is a sparse Mixture-of-Experts model with 218 billion total parameters, of which only 25 billion activate per inference token . In a dense model, every parameter participates in every forward pass. In a sparse MoE, a learned routing layer selects a small subset of "expert" sublayers per token, leaving the rest dormant. The practical consequence is that per-token compute cost scales with the 25B active count — not the 218B headline figure — which is a substantial reduction in inference FLOPs relative to what the total parameter count implies.

This architecture is not novel — Mixtral and DeepSeek V3 use the same pattern — but Command A+ applies it at a scale where the active-to-total ratio is approximately 1:9, which is more aggressive than most open-weight contemporaries. Developers evaluating the model should be careful with raw parameter comparisons against dense models. A 70B dense model and a 218B MoE with 25B active parameters are not equivalent in compute cost; the MoE is closer to the dense model than the total parameter count suggests.

The -w4a4 variant uses quantization-aware distillation rather than post-training quantization. In standard post-training quantization, full-precision weights are compressed after training completes, and quality degradation is accepted as an inherent cost. In quantization-aware distillation, the model is trained — or distilled — with the target quantization in the loop, allowing it to learn representations that are robust to the precision loss. Cohere describes the result as 'near-lossless' , though that claim had not been independently verified at launch. Teams with strict quality thresholds should benchmark W4A4 against FP8 on their specific task distribution before committing to the more aggressive quantization tier in production.

The -fp8 tier sits between BF16 and W4A4 on the precision-efficiency curve. For teams that need to validate quantization impact before moving to W4A4 — particularly in regulated workloads where output consistency is auditable — FP8 provides a transition point that uses significantly fewer GPUs than full BF16 without the uncertainty of aggressive int4 weight compression.

Cohere has additionally tuned speculative decoding for the MoE topology, yielding an additional 1.5–1.6× throughput multiplier on top of W4A4 baseline figures . Speculative decoding works by running a smaller draft model to predict upcoming tokens that the larger model verifies in batched form. The MoE topology creates specific opportunities here: expert routing for near-future tokens can be predicted with higher confidence than full next-token distribution, making the speculative acceptance rate higher. The net effect compounds — W4A4 reduces compute per token; speculative decoding reduces sequential generation steps per unit time.

Native In-Generation Citations: How the `<co>` Tag System Works

Command A+'s citation system has the model emit <co> and </co> tags wrapping factual claims directly during the forward pass . Each tagged span carries a reference to a source document index provided in the input context. Attribution is a trained behavior baked into generation — not a post-processing layer appended after the model finishes producing text.

The distinction from standard RAG pipelines matters operationally. The typical approach runs a retrieval step to fetch relevant documents, generates a response, and then runs a secondary pass — a highlight-extraction or retrieval-scoring model — to map generated claims back to source passages. That secondary pass introduces latency, potential misalignment between what the generation says and what the attribution layer links to, and an additional failure mode to monitor in production. Command A+ collapses those steps: citation mapping is produced once, during generation, within a single inference call .

For regulated-industry RAG deployments — healthcare documentation, legal research, financial compliance — this distinction carries concrete compliance weight. Unsourced AI-generated claims in an output that a human professional relies on create liability exposure. A model that tags every factual claim with a source document index gives compliance teams an auditable evidence trail from the first inference call, without requiring a second model or a custom attribution pipeline built on top. Technical analysis of Command A+'s architecture notes the inline citation system is designed specifically for regulated enterprise pipelines — not as a general convenience feature.

The citation behavior is part of a broader structured-output system. Command A+ also emits tags for reasoning (<|START_THINKING|>), tool calls (<|START_ACTION|>), and tool result ingestion (<|START_TOOL_RESULT|>) . This tag-based interface enables deterministic multi-step agentic workflows: downstream application code can parse structured outputs reliably without prompt-engineering hacks or brittle regex over free-form text. Attribution and tool use are both testable at the integration boundary.

VentureBeat's coverage of the release frames native citations alongside W4A4 quantization as the two features that most differentiate Command A+ from other open-weight contemporaries — a model that is simultaneously more deployable in constrained hardware environments and more suitable for regulated use cases where attribution is a requirement rather than a convenience . For teams building RAG pipelines in those verticals, the citation system directly reduces the engineering surface area of a compliant deployment.

Benchmark Scorecard: Task Gains and Composite Score Gaps

Command A+ shows substantial gains over its predecessor, Command A Reasoning, on task-specific benchmarks — particularly agentic and reasoning benchmarks that reflect real enterprise workloads. On the Artificial Analysis Intelligence Index composite, it scores below all major closed frontier models. Both data points are accurate, and misreading either one leads to incorrect deployment decisions. The correct frame is: Command A+ is not a general-purpose frontier competitor — it is an efficient, sovereign-deployable model with strong performance on specific high-value tasks.

Command A+ vs. Command A Reasoning — Task Benchmark Comparison
Benchmark	Command A Reasoning	Command A+	Change	Notes
τ²-Bench Telecom (agentic)	37%	85%	+48 pp	Multi-step agentic task completion
Terminal-Bench Hard (coding)	3%	25%	+22 pp	Hard-tier agentic coding
AIME 25 (math reasoning)	57%	90%	+33 pp	Competition math
MMMU (multimodal)	—	75.1%	New capability	First multimodal Command model
MMMU Pro	—	63.0%	New capability
MathVista	—	80.6%	New capability	Visual math reasoning
GPQA Diamond	—	76.0%	New capability	Graduate-level science questions

Sources: Cohere Blog and mer.vin analysis . All benchmark figures are Cohere-reported; no independent third-party audit was published at launch.

The τ²-Bench Telecom gain from 37% to 85% is the most striking figure — nearly a doubling on a benchmark designed for multi-step agentic task completion in a domain-specific context. Terminal-Bench Hard's jump from 3% to 25% is a large absolute gain on an explicitly hard-tier agentic coding benchmark, though a 25% absolute score still leaves significant headroom against closed frontier models on complex software engineering tasks.

On the composite Artificial Analysis Intelligence Index, Command A+ scores 37 . The competitive context across open-weight and closed models at launch:

Artificial Analysis Intelligence Index — Composite Scores, May 2026
Model	Intelligence Index	Access Type	License
GPT-5.5	60	Closed API	Proprietary
Claude Opus 4.7	57	Closed API	Proprietary
Gemini 3.1 Pro	57	Closed API	Proprietary
Mistral Medium 3.5	39	Open-weight / API	Mistral Research
Command A+	37	Open-weight / API	Apache 2.0

Source: ChatForest independent review

The composite score gap is real and should not be dismissed as a benchmark artifact. A score of 37 against GPT-5.5's 60 and Claude Opus 4.7's 57 reflects the trade-off inherent in the architecture: 25B active parameters at W4A4 quantization on two GPUs cannot match the reasoning depth of closed frontier models running at full precision with significantly higher active parameter counts. Teams whose primary requirement is maximum general reasoning quality should use a frontier closed-API model. Teams whose primary requirement is data residency, on-premises control, and strong performance on specific agentic tasks — with the freedom to self-host and modify the model freely — have a technically coherent option here.

Hardware Requirements and Inference Throughput

Command A+ is a data-center model with no consumer GPU deployment path. At W4A4 — the recommended production tier — the minimum hardware requirement is two NVIDIA H100 80GB GPUs or a single NVIDIA B200 . Full BF16 precision requires eight H100s or four B200s. This is not a model you run on a workstation, a single A100, or a standard cloud GPU instance.

Command A+ Hardware Requirements and Self-Reported Throughput
Quantization Tier	Minimum GPU Configuration	Output Tokens/s (low concurrency)	TTFT (ms)
W4A4 (recommended)	2× H100 80GB or 1× B200	~375	113
FP8	Middle tier between W4A4 and BF16	Not published at launch	Not published at launch
BF16	8× H100 80GB or 4× B200	Not published at launch	Not published at launch

Throughput figures are Cohere's own reported numbers under their specific lab conditions at low concurrency . The approximately 375 output tokens per second and 113 ms time-to-first-token at W4A4 represent approximately 63% higher throughput than Command A Reasoning at matched hardware and concurrency settings . These numbers will not hold at high concurrency, on older GPU generations, or with significantly different prompt-to-generation ratios. Treat them as a directional baseline for initial sizing; validate against your own concurrency profiles before capacity planning.

The two-H100 W4A4 entry point is meaningful in enterprise terms. A two-GPU on-premises node is a routine procurement for large enterprise IT departments — not a special hyperscaler arrangement. For comparison: the BF16 tier at 8×H100 requires a dedicated NVLink interconnect configuration (DGX H100 or equivalent), which is a substantially higher capital and operational commitment. The MoE routing plus W4A4 compression is what makes the 2-GPU floor possible: 25B active parameters at 4-bit weight precision fit within two 80GB HBM3 devices without spilling across interconnects at inference time.

The additional 1.5–1.6× throughput from speculative decoding compounds with W4A4 to produce Cohere's headline throughput numbers. Both optimizations must be active simultaneously to realize the combined gain; teams using vLLM with default settings may need explicit configuration to enable speculative decoding for the MoE topology.

For teams evaluating cost on the managed API: Cohere prices Command A+ at $2.50 per million input tokens and $10.00 per million output tokens — higher than Mistral Medium 3.5 at $1.50/$7.50 and substantially higher than DeepSeek V4 at $0.27/$1.10 . Apache 2.0 licensing means teams can bypass Cohere's API entirely and self-host. The managed pricing is relevant only for teams that prefer not to operate their own inference infrastructure.

Sovereign and Regulated Industry Deployment

Apache 2.0 licensing is the legal foundation of Command A+'s sovereign deployment story. Unlike Meta's Llama community license — which imposes restrictions above certain user-count and revenue thresholds — Apache 2.0 permits unrestricted commercial deployment, modification, fine-tuning, and redistribution with no royalty obligations . Organizations can download the weights, modify the model on proprietary data, package it into a product, and distribute it commercially — none of that requires Cohere's involvement.

The primary deployment targets are organizations where data cannot leave a controlled perimeter: defense agencies, healthcare systems operating under HIPAA or equivalent frameworks, financial institutions with data residency requirements, and government entities with classified or sensitive infrastructure. For all of these, cloud-hosted API models create a data egress vector that compliance teams will not approve regardless of the vendor's security attestations. Self-hosted deployment on air-gapped or private-cloud infrastructure eliminates that constraint structurally.

The W4A4 two-H100 entry point is the key enabler for on-premises adoption at scale. Large enterprise IT departments and government agencies already operate GPU infrastructure for ML workloads — adding a Command A+ deployment to an existing two-GPU node does not require a hyperscaler data-center arrangement. According to Cohere's official release documentation, the model is explicitly designed around this deployment profile: the architecture choices (MoE, W4A4, speculative decoding) exist in service of making a capable model run on the smallest reasonable on-premises hardware footprint.

Cohere frames Command A+ as part of a sovereign AI strategy targeting "people, enterprises, and governments" that need to "run real-world systems at scale" with guaranteed data control, per Nick Frosst's statement at release . This language — sovereign AI, critical infrastructure — signals that Cohere is positioning itself as enterprise infrastructure rather than a developer tools company competing on API features.

One gap worth flagging at launch: Cohere did not announce named enterprise or government customer deployments of Command A+. The sovereign deployment positioning is clear as a product and sales narrative; whether it translates to verified production adoption by the target verticals within two to four quarters remains to be demonstrated. Developers evaluating the model for regulated-industry pipelines should treat it as a technically strong candidate with an unproven production track record in the most sensitive deployment contexts.

Multilingual Coverage: 48 Languages and Tokenizer Efficiency

Command A+ supports 48 languages , more than double the 23 languages in Command A . For the sovereign-enterprise deployment profile the model targets, this expansion meaningfully broadens viable deployment geography. A model that handles Arabic, Japanese, Korean, and a broad set of European and South Asian languages can serve regional government entities and multinational enterprises without requiring separate specialized models per locale or a language-specific fine-tuning pipeline on top of the base model.

The tokenizer improvements are as commercially relevant as the language count expansion. Cohere reports token count reductions of 20% for Arabic, 18% for Japanese, and 16% for Korean relative to Command A. At scale — millions of requests against long-document inputs — a 20% reduction in token count for Arabic directly translates to a 20% reduction in input cost and per-request compute. For teams running high-volume non-English RAG pipelines on managed inference or self-hosted infrastructure, this is not a minor quality-of-life improvement. It affects the unit economics of the deployment.

The 128K input token context window with up to 64K generation output covers the document lengths most common in regulated-industry RAG pipelines: multi-page contracts, regulatory filings, technical manuals, and legislative texts. The context length is sufficient to ingest a substantial document in a single call rather than requiring chunking strategies that introduce retrieval complexity.

The multilingual improvements compound with the native citation system in a specific way worth noting for compliance use cases. For non-English RAG pipelines where attribution accuracy in the source language is a compliance requirement — a Japanese financial disclosure analyzer, an Arabic legal document summarizer — native citations produced during the forward pass are more reliable than a secondary attribution layer trained primarily on English. Stronger multilingual representations mean the <co> tag attribution should hold quality across the supported language set rather than degrading sharply outside English.

Frequently Asked Questions

How is Command A+ different from the original Command A?

Command A+ differs from Command A in five concrete ways. First, the weights are fully open under Apache 2.0 — Command A was API-only with no public weight access. Second, Command A+ is the first Command model to accept image inputs alongside text, enabling analysis of charts, PDFs, and slides. Third, language support expands from 23 to 48 languages , with tokenizer efficiency improvements for Arabic, Japanese, and Korean. Fourth, Command A+ introduces native in-generation citations via <co> tags — attribution is produced during the forward pass, not by a secondary post-processing model. Fifth, the model adds structured reasoning and tool-use tags (<|START_THINKING|>, <|START_ACTION|>, <|START_TOOL_RESULT|>) that enable deterministic multi-step agentic workflows without custom prompt engineering.

Can Command A+ run on a single consumer GPU?

No. The minimum hardware requirement for Command A+ at W4A4 — the most efficient quantization tier — is two NVIDIA H100 80GB GPUs or a single NVIDIA B200 . Full BF16 precision requires eight H100s or four B200s. Consumer GPUs lack sufficient VRAM to load even the most compressed variant. This model is not comparable to Llama 3.1 8B or Mistral 7B for local inference — it is a data-center model designed for enterprise on-premises or private-cloud deployment with no consumer GPU path.

What makes native citations different from standard RAG citation?

Standard RAG pipelines generate text in one step, then run a separate retrieval-scoring or highlight-extraction model to map generated claims back to source documents — two sequential inference calls, two potential failure points, and added latency. Command A+ emits <co> and </co> tags wrapping factual claims during the forward pass itself . Attribution is a trained model behavior, not an appended layer. This eliminates the secondary inference step, produces an auditable citation trail in a single generation call, and ties attribution quality to the primary model's training rather than a separate system's performance.

Why does Command A+ score below GPT-5.5 and Claude Opus on composite benchmarks?

The composite Artificial Analysis Intelligence Index score of 37 — versus GPT-5.5 at 60 and Claude Opus 4.7 at 57 — reflects the trade-off embedded in the architecture. Running 25B active parameters at W4A4 quantization on two GPUs delivers efficient, sovereign-deployable inference. It does not match the reasoning depth of closed frontier models running at full precision with significantly higher active parameter counts and proprietary post-training pipelines. Command A+ outperforms meaningfully on specific agentic benchmarks over its predecessor, but the composite score reflects the general reasoning gap that is inherent in the efficiency-first design.

Does Apache 2.0 allow unrestricted commercial use?

Yes. Apache 2.0 permits commercial deployment, modification, fine-tuning, and redistribution without royalties or license fees . Unlike Meta's Llama community license, there are no user-count thresholds or revenue restrictions that activate additional terms at scale. Organizations can download the weights, modify the model on proprietary data, deploy it in any commercial product, and distribute it — without Cohere's permission or any ongoing contractual relationship with Cohere.

Deployment Decision: What to Evaluate Now

Command A+ is a coherent technical package for a specific, well-defined problem: open-weight, efficient, multilingual, sovereign-deployable inference with native attribution for regulated RAG pipelines. The architecture choices — sparse MoE at 25B active parameters, W4A4 quantization-aware distillation, speculative decoding, in-generation citation tags — are individually well-motivated and collectively consistent with the sovereign-enterprise positioning. This model is not attempting to compete on every dimension simultaneously. It is narrowly optimized for a deployment context where closed API models cannot operate.

The open questions that matter for practical evaluation: first, whether the W4A4 'near-lossless' claim holds on domain-specific task distributions that differ from Cohere's benchmark suite — quantization-aware distillation is a stronger approach than post-training quantization, but degradation on specialized corpora (legal, medical, code) requires independent measurement before production commitment. Second, whether the Intelligence Index composite gap versus closed frontier models surfaces in the specific agentic tasks your pipelines require — the τ²-Bench and AIME gains are large, but most enterprise projects are not running agentic telecom benchmarks; run your own evals on representative workloads. Third, whether the absence of named production deployments at launch resolves quickly — the sovereign deployment story is technically credible, but field validation in the most sensitive regulated environments is still outstanding.

For developers ready to evaluate: weights are available on Hugging Face now, vLLM inference support with the cohere_melody parser is live, and Apache 2.0 removes any legal friction for testing. The lowest-risk path is a two-H100 test deployment measured against your actual task distribution and compared against the closed model you currently use or are evaluating. The composite benchmark gap is real — but whether it appears in your specific workload is an empirical question, not one the composite score answers for you.

Last updated: 2026-05-29. Based on Cohere's official release materials, third-party benchmark data from Artificial Analysis, and independent technical analysis published at launch. Command A+ was released May 20, 2026 ; benchmark data, pricing, and hardware availability may be updated as external evaluations are published.