The 'weeks to hours' headline hides a workflow rewrite
Endava's Codex-powered requirements analysis acceleration gets breathless coverage: weeks compressed to hours, analysts freed from drudge work, the agentic org finally arriving. But that framing misses the actual engineering story. What Endava shipped wasn't a smarter agent. It was a workflow—in the strict sense: an orchestrated, repeatable pattern of tasks with typed inputs, outputs, deterministic retries, and explicit human gates.
The common misread is that agents replaced analysts. The reality is more pedestrian. Endava decomposed "analyze requirements" into a graph of narrow, testable steps: ingest documents, extract entities, cluster requirements, detect conflicts, draft user stories, surface summaries for reviewer approval. Each node has a defined input schema and output schema. Each node can fail independently and get retried. When a step outputs garbage, you know where to look.
The term "agentic org" has become marketing shorthand that obscures orchestration debt. We've shipped enough of these systems to know: the speedup doesn't come from the agent being smarter. It comes from the workflow being simpler—narrower context windows, smaller decision surfaces, deterministic paths with HITL gates placed surgically where output cost-to-undo is high. That's engineering work. That's the work nobody credits in the headline.
Task decomposition is the work nobody credits
If you can describe the work as a flowchart with clear inputs and outputs, you have the shape of a decomposable workflow. Endava's decomposition is straightforward: document ingestion → entity extraction → requirement clustering → conflict detection → user-story drafting → reviewer gate. Each node is small enough that a single LLM call solves it. Each node outputs a typed schema—JSON, not narrative prose that needs parsing downstream.
Why this matters: a single "analyst agent" with one mega-prompt fails at the fifth or sixth hop. Token budget explodes. The model starts hallucinating context from earlier in the conversation. Failures become expensive to diagnose because the entire chain is entangled. Break it into nodes, and you can test each one independently. You can write unit tests. You can build evaluation harnesses per node—measure recall of extracted entities, precision of clustered requirements, false-positive rate in conflict detection. You can't unit-test vibes.
The rule of thumb we follow: if you can't write a test for the node, the node is too big. Split it. A node that says "analyze this vague input and figure out what to do" is not a node. A node that says "given a list of requirement objects with a known schema, detect logical conflicts and return conflict pairs with confidence scores" is a node.
Context windows are a budget, not a feature
The temptation is obvious. You have a 200K-token window. Just paste the whole requirements document, the entire codebase, every earlier analysis—let the model figure it out. That's how you end up with 30-second p50 latencies, $0.80 per call, and recall that degrades mysteriously in production.
We target 4–12K input tokens per call, sub-2-second p50 for interactive steps. That forces discipline. A conflict-detection node doesn't need the full PRD. It needs: the requirement pairs being compared, their metadata, and maybe the source documents they came from. Retrieval is scoped per node. You ask: what does this step actually need to succeed? You answer with a small, curated context window.
When you need to carry information forward between nodes, summarize it. Don't carry the raw transcript. A requirement-extraction node outputs not the full document text but a structured list of extracted requirements. The clustering node works from that list, not the original document. Downstream nodes receive the summaries, not the artifacts from ten hops earlier. This is where RAG systems often fail in production—they retrieve too much context too broadly, hoping recall saves them. Scoped retrieval per node is slower to design and faster to run.
Retries, idempotency, and dead-letter queues—the unsexy reliability layer
This is where we've gotten things wrong before, and it's instructive. An LLM workflow that doesn't handle failures as a first-class concern will fail silently or loudly in production, and you'll debug at 2am.
Exponential backoff with jitter on 429 and 5xx from OpenAI or Anthropic APIs is table stakes. But you need idempotency keys per node—a deterministic hash of the step ID and input—so a retry doesn't double-create Jira tickets or duplicate user stories. You need structured outputs: JSON mode or tool calls, not free-text parsing. A parse failure becomes a typed error you can bucket and retry, not a mystery.
And you need a dead-letter queue (Redis Streams, SQS, or Postgres with SKIP LOCKED) for poison inputs—documents so malformed that no amount of retrying will fix them. Humans triage the queue. Some inputs get reprocessed once you fix the upstream ingestion. Some get rejected. Without this, workflows accumulate backlogged junk.
Deterministic step IDs matter too. If a workflow dies after three hops and you replay it, you want to resume from step four, not restart from step one. That requires durable state—something that persists which nodes have run and what they output. Off-the-shelf frameworks like LangGraph ship the DAG syntax but punt on durable state. You integrate Temporal, Inngest, or build your own state machine on Postgres.
Human-in-the-loop gates are architecture, not UX
HITL placement is not a feature request. It's a design decision driven by blast radius and confidence thresholds. Place gates where cost-to-undo is high: before you create a Jira ticket that cascades into a sprint, before code gets merged, before a client-facing artifact ships. Don't place gates on low-stakes hops just because they exist.
Confidence-gated approval is the pattern: if self-consistency or a critic model scores the output above a threshold, auto-approve it. Otherwise, queue for human review. The review UI surfaces diffs and evidence—what changed from the previous version, why the model thinks this is correct—not raw chain-of-thought that humans can't act on. Manual operations have a real cost, and every gate you don't place is work you've reclaimed.
Track override rate per node. If humans are rejecting 40% of outputs from the conflict-detection step, the upstream requirement-extraction is drifting. The prompt needs attention. The retrieval is stale. Rising overrides are a signal, not a failure of the approach.
Endava's speedup came from removing humans from the low-stakes hops—document ingestion, entity extraction, initial clustering. The gates stayed where cost was real: before a reviewer saw the final summaries, before anything shipped to clients. That's not removing humans from workflows. It's routing them where they matter.
Why off-the-shelf agent frameworks don't ship this for you
LangGraph, CrewAI, AutoGen—they solve orchestration syntax. You get a clean DSL to define your DAG. Tool-call plumbing works. Basic memory management is there. None of them ship reliable production workflows out of the box.
What's missing: per-node evaluation harnesses. Drift detection. Cost attribution per step. Durable retries that survive a process restart. When we wire these systems for real, the integration surface looks like Postgres for state, Redis for queues, OpenTelemetry for traces, a custom eval runner in your CI pipeline. That's 80% of the work.
The Microsoft Copilot or Codex deployment inside an enterprise is maybe 20% model choice, 80% workflow scaffolding. Pick a framework that gets out of your way—one that plays nice with your existing state store and queue—then budget real engineering for the reliability layer. Model routing and reasoning-loop sizing matter, but they're refinements. The foundation is workflow structure.
Before you write a line of orchestrator code, map your workflow on a whiteboard. Define the nodes. Draw the schemas in and out. Mark where humans gate the flow. Count the steps. Ask yourself: can I test this? If the answer is no, redesign. Agents and workflows look similar from the outside—they cost very different amounts to maintain. The difference lives in how ruthlessly you've decomposed the problem and how explicitly you've handled failure. That's where Endava's speedup came from. That's the work that doesn't fit in a headline.