Inception Labs has unveiled Mercury 2, a new large language model that replaces the traditional word-by-word generation method with a diffusion-based reasoning process. Announced this week, the model is already being positioned as one of the fastest reasoning systems available, with independent testing reporting 1,196 tokens per second on NVIDIA Blackwell GPUs.
The company says Mercury 2 delivers more than five times the speed of competing compact reasoning models such as Anthropic’s Claude 4.5 Haiku, while matching top-tier performance on coding and instruction-following tasks. The launch signals a broader shift in how AI systems generate answers—and where the next wave of real-time enterprise applications may emerge.
Key Summary
- Mercury 2 uses a diffusion process that refines full answers in parallel instead of generating text one word at a time.
- Independent tests report 1,196 tokens per second, making it significantly faster than many compact reasoning models.
- Pricing is listed at $0.38 per million tokens, positioning it as a cost-efficient option for high-volume workloads.
- Performance is competitive on coding and instruction-following benchmarks, targeting agent-style workflows.
- Available via AWS and Microsoft Azure, aimed at enterprise deployment and real-time applications like voice AI.
- Why this matters: Faster inference means AI agents can respond instantly in customer service, coding copilots, and automation pipelines.
From Typewriter to Editor
Most large language models today—including those from OpenAI, Anthropic, and Google—generate responses sequentially. They predict the next word, then the next, building sentences step by step. That approach works, but it introduces latency.
Mercury 2 takes a different path. It begins with noise and refines the entire answer simultaneously, using a diffusion process. In plain language, instead of typing left to right like a typewriter, it drafts and edits the whole response at once.
That architectural shift is not cosmetic. Parallel refinement allows significantly higher throughput, particularly on modern GPUs optimized for parallel workloads. On NVIDIA’s Blackwell architecture, Mercury 2 reportedly sustains 1,196 tokens per second. Tokens are fragments of words used by AI systems to measure output speed. Higher tokens per second translate directly into lower latency for users.
For real-time AI agents, latency is everything.
Mercury 2 is live 🚀🚀
— Stefano Ermon (@StefanoErmon) February 24, 2026
The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs.
Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built.
We’re just getting… pic.twitter.com/McrQG4PFLZ
Speed as Strategy
Inception Labs is framing Mercury 2 less as a benchmark trophy and more as an infrastructure play.
At $0.38 per million tokens, the model targets high-volume usage scenarios—automated coding assistants, voice agents, and enterprise workflow automation. These are not casual chat applications. They are persistent systems generating continuous output.
In agentic coding, for example, a model may produce thousands of tokens in rapid back-and-forth reasoning loops. If each step is slow, the user experience collapses. A fivefold speed improvement changes the economics and usability of such systems.
That positioning places Mercury 2 in direct competition with compact, high-speed models like Claude 4.5 Haiku and other optimized reasoning variants. It does not appear aimed at replacing the largest frontier models used for research-level reasoning. Instead, it is targeting the “always-on” AI infrastructure layer.
This is an increasingly crowded field. Smaller, faster models have become the backbone of enterprise deployments, especially as companies look to control costs while maintaining responsiveness.
Matching on Coding and Instruction Tasks
Inception claims Mercury 2 matches top-tier models on agentic coding and instruction-following benchmarks. While detailed public benchmark breakdowns remain limited, the emphasis is clear: parity in reasoning quality combined with superior speed.
For developers, the first test will be practical. Does it handle multi-step code generation without logical drift? Can it manage tool-calling workflows? How well does it maintain context in iterative loops?
Speed without reasoning stability is useless in production.
The company’s CEO, Stanford professor Stefano Ermon—known for contributions to diffusion modeling—argues this marks a fundamental shift in model design. Industry observers, including Andrew Ng, have publicly praised diffusion-based approaches as promising alternatives for AI agents and voice systems.
But market validation will depend on more than endorsements. Enterprises will look for reproducible benchmarks and real-world performance metrics.
Enterprise Deployment and Cloud Positioning
Mercury 2 is available via Amazon Web Services and Microsoft Azure, signaling a direct enterprise distribution strategy rather than a consumer-first rollout.
Cloud availability lowers adoption friction for Fortune 500 firms already operating within those ecosystems. It also suggests Inception is aiming for integration into existing enterprise AI stacks rather than building a standalone consumer platform.
From a business model perspective, pricing at $0.38 per million tokens undercuts many premium reasoning models, especially when paired with high throughput. For companies deploying AI agents at scale, inference cost becomes a line item measured in millions of dollars annually.
Lower cost per token combined with faster processing creates a compound advantage. Enterprises can either reduce expenses or deploy more AI workloads within the same budget envelope.
Diffusion Enters the LLM Race
Diffusion models are not new. They are widely used in image generation systems like Stable Diffusion. Applying diffusion to language, however, has been technically challenging due to the discrete nature of text.
Mercury 2 suggests that barrier is weakening.
If diffusion-based language models can consistently match autoregressive systems in reasoning quality while outperforming them in speed, the competitive landscape shifts. Major incumbents may need to rethink architecture decisions long optimized around sequential generation.
Still, there are open questions.
Diffusion approaches can require careful tuning to maintain coherence and prevent instability during refinement. The long-term scaling behavior remains less battle-tested compared to traditional autoregressive transformers.
In other words, Mercury 2 may represent a credible alternative—but not yet a proven replacement.
Broader Timing
The launch arrives at a moment when the AI industry is pivoting from capability demos to deployment economics.
Over the past year, the focus has been on frontier intelligence gains. Now, enterprises are asking different questions: How fast? How much does it cost? Can it run reliably 24/7?
Mercury 2 is a response to that shift.
Rather than chasing the largest parameter count or the most complex reasoning benchmark, Inception is competing on throughput efficiency. That aligns with the rise of AI agents, voice assistants, and embedded automation—systems where responsiveness shapes adoption.
What Could Limit Growth
Despite strong performance claims, Mercury 2 faces structural challenges.
First, diffusion-based text generation is still less familiar to developers than traditional models. Tooling ecosystems, optimization frameworks, and fine-tuning workflows are largely built around autoregressive architectures.
Second, enterprise buyers may hesitate without transparent, widely replicated benchmark data beyond speed metrics.
Finally, incumbents are unlikely to ignore the performance claims. If larger providers integrate parallel decoding improvements or hybrid diffusion techniques into existing models, Mercury’s speed advantage could narrow.
Mercury 2 enters the market with a clear value proposition: faster reasoning at lower cost for real-time enterprise applications. Whether that advantage holds will depend on independent validation, developer experimentation, and how quickly competitors adapt.
For now, it marks one of the most direct architectural challenges to the dominant way language models generate text.