Skip to content
Air Automations
All posts
diffusion-modelsMay 25, 20265 min read

Diffusion LLMs Are Fast. Your Agent Is Still Slow.

Nemotron-style diffusion LLMs cut decoding time, but agent latency lives in retrieval and tool calls. Here's what actually changes in production.

Where agent latency actually lives

We've spent enough time in production traces to say it plainly: decoding speed is almost never your bottleneck in an agent system. Here's what a real p50 latency waterfall looks like:

  • Vector embedding: 80ms
  • Retrieval + rerank (Pinecone/pgvector): 600ms
  • LLM call (including network RTT to Anthropic/OpenAI): 80ms + 400ms decoding = 480ms
  • Tool execution (HTTP out, database queries, external APIs): 900ms
  • Orchestration overhead (serialization, state management, retries): 500ms
  • Total: ~2.4 seconds

Decoding is roughly 17% of wall time. Retrieval dominates. Tool calls dominate. Network round-trip to your inference endpoint often costs 80–200ms before a single token arrives.

Throughput benchmarks (tokens per second) measure something different than what matters for agents: time-to-useful-output. A diffusion model might generate a full response in 600ms where an autoregressive model takes 1.2 seconds. But if retrieval already took 600ms and the tool call is pending, your user experiences no difference. The tail latency story is worse: tool timeouts and cold Lambda starts dwarf any decoder improvement at p99.

What diffusion LLMs actually change about inference

Diffusion-based language models work differently enough that you need a clear mental model before shipping them into production.

Autoregressive models generate tokens left-to-right. Each token conditions on all prior tokens. You get one token, then request the next, in a tight loop. Diffusion models use forward diffusion plus reverse sampling: tokens emerge in parallel across multiple denoising steps, refined iteratively toward a final output. There is no natural token-by-token stream.

Nemotron-Labs positions this as faster wall-clock time for full responses. They're right about the math. But "faster" masks a real architectural shift: you either materialize the entire output at once, or stream coarse-to-fine refinement passes instead of tokens. There's no middle ground. That changes everything about how you ship it.

The hallucination profile also shifts. Autoregressive models confabulate when prior-token conditioning goes wrong. Diffusion models refine the entire output jointly, which can reduce some errors but introduces different failure modes — the whole sample may diverge if early denoising steps go off track. Confabulation patterns differ in ways we haven't fully characterized yet.

Streaming UX breaks when there's nothing to stream

Most agent frontends rely on server-sent events (SSE) or WebSocket streaming to feel snappy. Vercel AI SDK, OpenAI's stream=true, Anthropic's message_stream — all assume autoregressive token emission.

Diffusion breaks this pattern. You can't stream tokens that don't exist until the reverse sampling finishes. Your options are: stream denoising passes (draft output → refined → final) or give up streaming and render once.

The UX trade-off is real. Perceived latency matters more than actual latency in chat. Users tolerate a 1-second spinner before output appears, then watch tokens stream in. With diffusion, you either show a skeleton and a 600ms jump-cut to the full response, or you stream refinement passes and hope they're visually interesting enough to not feel like a stall. Neither matches the token-streaming UX users expect.

Tool-call detection becomes harder too. In a streamed autoregressive flow, you can interrupt as soon as you see a tool-call token. Diffusion gives you the entire structured output at once. If the model got the tool call wrong, you find out after sampling completes, not mid-stream.

Checkpointing and the orchestration loop need a rewrite

ReAct-style agent loops assume you can interrupt mid-generation on a tool-call token, execute the tool, feed results back, and resume from there. Diffusion forces a different checkpoint model.

With autoregressive generation, checkpoint granularity is per-token. You can pause, call a tool, and resume. With diffusion, the unit is the full denoising pass or the full sample. That means:

  • Retry strategy shifts: Re-sampling a diffusion model is cheaper per-call, but you can't resume from a partial decode. You re-sample the whole thing or you lose work.
  • Idempotency keys matter more: Tool calls from diffusion outputs need stronger deduplication because re-runs are atomic.
  • State storage changes: Postgres and Redis now hold entire intermediate outputs, not token-by-token state. Dead-letter queues for stuck reverse-sampling jobs become necessary.
  • Model routing: Consider using a cheap autoregressive model (Claude Haiku) for streaming, user-facing turns, and diffusion for silent backend planning steps where streaming doesn't matter.

This is where model routing strategies pay off. You're not choosing one model for all agent turns anymore; you're matching inference curves to use cases.

Where diffusion actually wins

We're not saying diffusion is wrong. It's genuinely useful in specific contexts.

Batch summarization, embedding-time rewriting, offline eval generation — none of these need streaming. Structured JSON output where a partial response is useless anyway: diffusion shines. Planning steps in multi-turn workflows where a full 600ms plan beats a 1.2-second streamed plan: pick diffusion.

The cost math compounds. If diffusion denoising drops your p50 by 40%, that advantage scales across 8-step agent loops. Silent backend turns (research, planning, grading) are cheaper and faster with diffusion. The user-facing turn stays with GPT-4o or Claude.

This is the pattern we see in production: use diffusion for the workflow backbone, autoregressive for the chat interface.

How to actually evaluate a diffusion model for production

Don't run vendor benchmarks. Build a harness that routes the same agent workload to both autoregressive and diffusion backends, measure end-to-end.

Track these metrics: time-to-first-useful-output (not tokens/sec), p50/p95/p99 latency, task success rate, and cost per successful task completion (not $/million-tokens). Run hallucination tests on retrieval-grounded tasks — confabulation patterns shift with diffusion sampling. Verify tool-call schema adherence under structured output mode. Shadow-deploy on 5% of traffic for a week before committing.

Then ask: does this model's inference curve match our latency budget and streaming UX? If retrieval is already your bottleneck, faster decoding is invisible. If you need token-by-token streaming, diffusion breaks your frontend. Be honest about which problem you're solving.

If you're deciding whether diffusion fits your agent stack — or just want a second pair of eyes on a latency waterfall — talk to us at /contact. We've shipped both architectures and have opinions.