AgentsMay 21, 20267 min read

Stop Defaulting to GPT-4o: Right-Sizing Your Agent's Reasoning Loop

OlmoEarth v1.1's efficiency gains expose a habit: overprovisioning frontier models. Here's how to audit token spend and route only the hard cases upstream.

By the airautomations team

OlmoEarth v1.1 is a signal, not a product recommendation

OlmoEarth v1.1 landed in early 2025 with smaller model variants and lower inference costs. We won't tell you to switch to it. What we will say: the release is a forcing function to ask a question most teams skip entirely.

The broader pattern matters more than any single model. Llama 3.1 8B, Mistral Small, Claude 3.5 Haiku, Gemini 2.0 Flash — the gap between frontier and small models has narrowed in the last 18 months. Frontier models still win on hard multi-step reasoning and domain expertise. But here's what we've found: most teams don't know what percentage of their agent's calls actually need that firepower.

You default to GPT-4o or Claude 3.5 Sonnet for every decision in the loop because it's safe, because the vendor shipped it first, because you inherited it. Not because you measured.

Most agent loops are 80% routing and 20% reasoning

Walk through a typical agent execution. The loop usually looks like: classify intent → pick tool → format arguments → call external API → summarize result → decide next step. That's six LLM calls, most of them lightweight.

Intent classification works fine on Haiku or Gemini Flash. Tool selection with a structured JSON output schema — same. Argument formatting when your schema is defined? Same. Summarizing an API response that came back as JSON? Same. Only the final step — "do we have enough to answer, or do we need another tool?" — sometimes needs real chain-of-thought reasoning.

We've instrumented support agents that make seven LLM calls per ticket. One of those calls is hard: parsing a messy email and extracting entities when the customer wrote three paragraphs across two separate threads. The other six are routing decisions. You can serve the six for $0.0005 each on Haiku and the one hard call for $0.01 on Sonnet. Or you can serve all seven on Sonnet and spend ten times the money doing something simpler.

Structured outputs and JSON mode compress the reasoning you actually need. When your model is forced to return valid JSON, it stops generating prose. When you define the schema upfront, you've removed degrees of freedom the model doesn't need. That shrinks the gap between small and frontier models for these tasks.

Audit before you optimize: instrument token spend per decision

You can't route what you haven't measured. Start by logging every LLM call: model, input tokens, output tokens, latency in milliseconds, the decision type (planner, tool_select, summarizer, judge), and whether the downstream action succeeded.

Ship those logs to Postgres or ClickHouse. Or use an off-the-shelf trace store like Langfuse, Helicone, or Phoenix — they're built for exactly this. Tag each call with the loop step. Then ask: what's the cost per successful task, not per call? Where does p95 latency exceed three seconds? Which step types generate more than two thousand output tokens on the regular?

Most teams have a long tail. The planner call on Monday usually takes 400ms and 800 tokens. But on Thursday morning when the queue fills up and the model is stressed, same call takes 2.5 seconds and generates 3,500 tokens because the reasoning path branches. You don't see that without instrumentation.

Run the "shadow eval" pattern: replay last week's traffic through a cheaper model, compare outputs to what your frontier model generated, and measure the error rate. We've found that for 60-80% of calls in a well-designed agent loop, the smaller model outputs are identical or functionally equivalent to the frontier model. The remaining 20-40% differ. Not all differences are failures — some are style. Some are real mistakes.

Route the hard cases upstream, not every case

The cascade pattern is your friend. Start with a small model. If the output passes your confidence check, use it. If not, escalate to a bigger model.

Confidence signals come in layers. Logprobs: if the model's top token was only 60% likely, you're in uncertain territory. Self-consistency: generate the same call three times on the small model; if they agree, go with it; if they disagree, escalate. Schema validation: if the JSON is malformed, escalate. A judge model verdict: pass the small model's output and the prompt to a small judge model that answers yes/no on "is this decision high-confidence?"; if no, escalate.

Implement this with Vercel AI SDK's streaming + structured output, LangGraph's routing primitives, or ~200 lines of custom logic in your favorite framework. Add caching with Redis on identical or near-identical calls — if you've already decided "user intent is refund request" for this email, don't re-run it on the next customer.

Set hard guardrails. If a call escalates twice and still fails, route it to a dead-letter queue for human review instead of retrying forever. A dead-letter queue with a timeout is cheaper than a customer-facing timeout.

Real numbers: aim for 60-80% of calls served by a small model, 20-40% escalated to frontier, and 2-5x total cost reduction. You'll get latency gains too — the p99 user-facing wait time shrinks when most calls are 200ms instead of 1.5 seconds.

The hidden tax of overprovisioning models

Seven calls per ticket at two seconds each on Sonnet is fourteen seconds of wall-clock time before your user sees an answer. That's long enough for them to start composing a follow-up email or switch to another tab. It's not just slow — it's slow enough that people work around your system.

The cost math: Sonnet runs $3 per million input tokens and $15 per million output tokens. Haiku runs $0.25 and $1.25. A thousand-token call costs roughly $3 on Sonnet and $0.25 on Haiku. Do that seven times and you're at $21 per ticket on Sonnet, $1.75 on Haiku, assuming similar output token counts. And if Haiku outputs fewer tokens because it's more concise — which it often is on routing tasks — you're closer to $21 vs. $0.80.

There's also the rate-limit tax. Frontier models are heavily contended. Smaller models have more headroom. When your agent hits a 429 and retries, you lose 2-5 seconds and regenerate tokens. Escape that queue and you're faster and cheaper.

The operations tax is the one nobody budgets for: an agent that's slow enough becomes someone's side project to optimize "when we have time." That time never comes. You ship the optimization in Q3 as an emergency, rewrite half the loop, and lose months of other work. The hidden cost of manual ops compounds when your automation is itself slow enough to block human ops.

A practical audit you can run this week

Step 1: Add per-call logging with model, step, tokens, and latency. This takes two hours. Use a before/after hook on your LLM client.

Step 2: Pick one loop step — the one that fires most often. Replay 100 production traces through Haiku, Gemini Flash, and Llama 3.1 8B in parallel. Store outputs in a separate column.

Step 3: Use your frontier model as a judge. Feed it the prompt, the small model output, and the frontier output, along with a rubric ("is the small model output functionally correct?"). Score the 100 outputs. If >85% are correct, you have a candidate for cascading.

Step 4: Ship the small model behind a feature flag for 10% of traffic. Monitor error rates, latency, and downstream success metrics. If error rate stays <1% above your baseline, roll to 50%. Automate rollback on any spike.

Step 5: Set a quarterly review. Model prices and capabilities shift every 8-12 weeks. What's expensive now might be cheap in six months. What's accurate now might have a new competitor. The audit isn't a one-time project — it's a recurring bill you can control.

Pick one agent in production this week, instrument its token spend by step, and replay 100 traces through a model 5x cheaper before your next sprint planning. If you'd rather run the audit with us, let's talk.

Keep reading

More from the field.

Agents5 min read

Got something worth automating?

Book a call

Stop Defaulting to GPT-4o: Right-Sizing Your Agent's Reasoning Loop

OlmoEarth v1.1 is a signal, not a product recommendation

Most agent loops are 80% routing and 20% reasoning

Audit before you optimize: instrument token spend per decision

Route the hard cases upstream, not every case

The hidden tax of overprovisioning models

A practical audit you can run this week

More from the field.

NVIDIA Cosmos and the New Vision Tax on Physical-World Agents

NeMo Automodel + Diffusers: When Vision Fine-Tuning Is a Trap

Cache-Reason Logs: The Observability Pattern Your Agents Need

When to build an agent vs. a workflow

Got something worth automating?