The hidden scaling problem in production AI agents

I keep running into the same problem while building Workpods.

It's not a dramatic failure. The agent doesn't crash. It calls the tools. It gives an answer. Most of the time the answer even sounds reasonable.

Then I open the trace.

That's usually where the problem shows up. The agent read three files it did not need. It trusted an old update because it looked like the right one. It searched across the whole workspace when the question was clearly about one project. Or it spent ten calls circling around something a person would have narrowed down almost immediately.

That trace is the part of agent work that has started to feel more important to me than tool calling itself. Tool calling is table stakes. The harder question is whether the agent knows where to look before it starts acting clever.

Demos hide this problem

Most demos give the agent a tiny world. A few files. A small set of tools. One obvious task. Maybe a clean folder called docs/ or invoices/.

In that setting, the agent looks smarter than it is. There aren't many wrong turns available.

A real workspace is messier. In Workpods, the agent can see organizations, workspaces, projects, milestones, tasks, comments, attachments, owners, status updates, and old decisions buried in threads. That structure helps because it gives the agent something to navigate. It also creates a lot of places to get lost.

Even a small example grows fast. Ten projects, ten milestones per project, ten tasks per milestone. That's already 1,000 possible task paths before comments, files, owners, dependencies, or history show up. Make those numbers larger and brute force stops being funny.

You can give the model more context, but more context doesn't automatically mean better context. Sometimes it just means the agent carries more junk into every next step.

The annoying thing is that this failure can look like progress. The agent is busy. It's reading things. It's producing summaries. It's spending tokens. But the answer isn't getting closer.

The real skill is narrowing

The question I keep asking isn't "how do I let the agent search more?"

It's "how do I help it search less without guessing?"

Take a normal project question: "Why is this delayed?"

A weak agent treats the workspace like a pile of text. It searches for "delayed", "blocked", "late", maybe the customer name, and then tries to stitch together whatever comes back.

Sometimes that works. More often it returns the right-looking wrong thing.

The better path is more boring:

Find the project first. If the project is ambiguous, ask. Read the latest project update. Check the milestones that are actually late. Look at the open tasks under those milestones. Only then go into comments or attachments if the evidence is thin.

That sounds almost too simple to write down, but it changes the behavior. The agent starts from the object the user is asking about instead of rummaging through the whole organization.

The caveat is important. A shortcut is only useful if the agent knows when to distrust it. If the latest update is two months old, it shouldn't treat it as truth. If the milestone says one thing and the task comments say another, it should widen the search. If two projects share a customer name, it should ask instead of pretending it knows.

That's the line I keep trying to design for: narrow early, but don't become stubborn.

Memory can make things worse

Memory sounds like the fix until you build enough of it.

The naive version is easy: save everything the agent sees. Meeting notes, documents, chats, tool results, preferences, decisions, half-finished thoughts. Then later the agent can search memory.

But now you have created another workspace, just less organized.

If memory is only a dump, the agent still has to solve the same search problem. It has to know which memory matters, which one is stale, which one was a temporary guess, and which one came from an actual source.

The version I trust more looks like a maintained wiki. Raw sources still exist, but there is a smaller layer above them: project pages, customer pages, topic notes, indexes, and a short log of what changed. When new information comes in, the agent should update the relevant page instead of only storing the raw input.

For example, if a meeting transcript says procurement is waiting on a supplier date, the agent shouldn't just save the transcript. It should update the project page, mark the supplier dependency, link back to the transcript, and note when that information was added.

Then the next time someone asks for status, the agent has a sane first stop. It can read the project page, check whether it is fresh, and only open the transcript if it needs the detail.

That's why I think "memory" is a slightly misleading word. What I actually want is maintained context. A map. Not a basement full of boxes.

Evals need traces

Another thing I have changed my mind about: evals without traces are too blurry.

If all I see is the final answer, I can tell whether it sounds good. I can't tell whether the agent got lucky. Maybe it cited the right blocker after searching the wrong project first. Maybe it ignored a newer update. Maybe it used one expensive tool call when a cheap lookup would have worked.

The trace is where the useful evidence is.

For a delayed project, I care about questions like:

Did it identify the right project?
Did it use the latest update?
Did it inspect the tasks that actually explain the delay?
Did it stop once it had enough evidence?
Did it say what was known versus assumed?

The final answer matters, obviously. But the path matters too, because that is what I can improve.

The evolutionary analogy helps here, even though I don't want to make it sound more profound than it is. You try a strategy, watch it fail, keep the parts that worked, and test again. The useful part is selection. Bad search paths become eval cases. Good paths become patterns you preserve.

That's much closer to how real agent work feels than the clean diagrams suggest. Ship something. Watch it wander. Turn the wandering into tests. Tighten the prompt, the tool, the memory page, or the route. Repeat.

Better is not one metric

The fitness function matters because "better" is easy to define badly.

If I optimize only for speed, the agent stops too early. If I optimize only for completion, it may skip approvals or ask fewer clarifying questions than it should. If I optimize only for user satisfaction, I may train it to sound confident when it should be uncertain.

For Workpods, a good answer is usually not the longest answer. It's the one that found the right object, used current evidence, avoided unnecessary tool calls, separated fact from guess, and left the user with a concrete next step.

Here is a small example.

"The project is delayed because several tasks are overdue" isn't good enough. It may be technically true, but it is not useful.

"The installation milestone is late because supplier delivery is still unconfirmed. The latest update mentions the supplier delay, and the next open task is to confirm the delivery date before rescheduling installation" is better. It connects the symptom, the cause, the evidence, and the next action.

That kind of difference is what an eval has to care about. Otherwise the system rewards fluent mush.

Why I still like the Bayesian framing

There's a clean theoretical version of this: the Bayesian agent.

Start with priors. Observe evidence. Update beliefs. Choose the action with the best expected value.

It's a nice mental model. It's not something I expect to implement exactly. Exact Bayesian reasoning blows up immediately because the agent would have to consider too many possible states and too many possible actions. Even tiny toy worlds become impossible if you try to enumerate everything.

Still, the framing is useful.

In a real agent, the "prior" is whatever the system already knows: memory, project structure, skills, previous traces, tool descriptions. The observations are tool calls, user messages, retrieved files, status updates, and feedback. The utility function is not pure math. It's a messy product judgement: be correct, useful, safe, fast enough, cheap enough, and willing to ask when unsure.

That framing is why context architecture matters so much. Not because it is elegant, but because it gives the agent better priors and cheaper observations.

What I am taking from this

The more I build this, the less I believe in agents that simply "see everything."

Seeing everything is usually the problem.

The useful agent has a good first guess about where to start. It can tell when that guess is weak. It knows which memory is fresh, which source is authoritative, which tool is worth calling, and when it should stop reading and answer.

That's the search problem I think production agents have to solve. More tools make it sharper. More documents make it sharper. More users, tasks, permissions, and memory all add branches.

So the work is not only model choice or prompt quality. It's maps, indexes, maintained memory, traces, evals, and boring rules about where to look first.

That sounds less magical than most agent demos.

That's also where the product starts to work.