Enterprise transformation
– 10 min read
Who tends the garden?
- AI agents are inheriting the challenge of fragmented enterprise knowledge, requiring organizations to clean and curate their internal data.
- Agents face difficulties navigating conflicting context, leading to emerging patterns like confidence thresholds, multi-source arbitration, and human-in-the-loop checkpoints.
- A new mindset is needed, with AI-forward employees acting as archivists and curators to maintain knowledge health and prevent costly judgments.
- By combining human judgment with AI capabilities, organizations can create an actively improving knowledge base that provides justified confidence for autonomous agents.
Every company is beginning to grapple with what happens when AI agents start acting on enterprise knowledge rather than simply surfacing it. In previous essays, I explored who owns context when software becomes agentic and what changes when software stops being used and starts acting. Both pieces converge on the same structural problem — judgment is fragmenting across systems, and no layer in the stack can currently own context end-to-end.
But there is a more immediate and more human version of this problem that most enterprises already live with — and that agents are about to inherit.
Most enterprises today maintain multiple systems of record for information, context, and judgment. These can be messy and contradictory when it comes time to decide what information to use.
A human being or an AI agent may encounter different or even conflicting answers across platforms they use for documentation, chat, email, and presentations. Because teams scatter the information, maintain it unevenly, and rarely reconcile it. Humans have learned to navigate this mess through intuition, relationships, and institutional memory. Agents do not have that luxury.
Today, I want to explore just how powerful AI agents and platforms are becoming when it comes to understanding and acting on the context and knowledge in your enterprise — and the new challenges and opportunities, including an entirely new class of jobs, that this may open up for human beings.
A lesson from the world’s coding library
In the early aughts, the engineers on my team spent a lot of time on Stack Overflow. Prior to the explosion of AI-powered coding agents, you solved tricky bugs or learned new things by consulting the world’s largest and most active library of coding knowledge.
The approach was simple — harness the wisdom of the crowd to identify and authenticate the best information. A user would ask a question, others would provide answers, and others would vote, edit, and tag these Q&A couplets. If you found a question with a good score and an accepted answer, chances were good that the information could help you, too.
Since November 2022, the volume of questions asked on Stack Overflow has fallen almost 90%. Developers stopped turning to other humans and simply asked AI instead.
In our internal testing, we are always pushing the boundaries and playing with ideas. I have seen some amazing examples of our system understanding and responding to enterprise context in recent months.
For example, AI systems at WRITER will recognize when a new piece of writing is lacking the latest marketing messaging and suggest how to weave it in. When you ask it to work with a new API for the first time, it will remember successful sessions that used a similar API from other companies and replicate the approaches that worked well — instead of starting from scratch.
The key here is that this context is part of the institutional and organizational memory a system might have for an individual — but is far more powerful when it applies broadly to teams, departments, and organizations. The individual, as they once did with Stack Overflow, can rely on the wisdom of the crowd, surfaced by your AI assistant.
Where this problem is already solved — and where it isn’t
Where this kind of contextual awareness stumbles, as I mentioned in my introduction, is when the AI encounters conflicting or inaccurate data. How can it make a judgment call about what is best to share?
This data fragmentation is less pronounced in highly regulated industries, which have extremely tight controls around certain information — trading and banking records in finance, or procedure and dosage recommendations and records in medicine. In those domains, systems of record are legally defined, tightly governed, and actively maintained because the cost of error is existential.
But for the average enterprise company, there is a ton of internal knowledge and context that are not always clearly labeled or do not resolve to a single empirical truth. Policy lives in a Google Doc that someone last updated 18 months ago. Onboarding procedures exist in three versions across Notion, Confluence, and someone’s Slack pinned messages. Pricing guidance differs between what the wiki says, what the sales team was told last quarter, and what finance actually approved.
When a human encounters this ambiguity, they do what humans have always done — they ask around, they use judgment, they triangulate. When an agent encounters it, their ability to triage with others can be more limited. Do we want agents popping into Slack and asking subject matter experts for clarification every time they think it MIGHT be worth adding enterprise context to a user’s query? The answer, as they become increasingly intelligent and their depth of knowledge grows, is likely yes.
How an agent navigates conflicting context
The question is how an AI agent would deal with this issue and sort through various answers to determine what action to take.
At what point would it stop and ask a human for advice, or offer a confidence score before proceeding? At what point might it rely instead on another AI system or arbiter as a judge, or use LLMs and an internal poll before making a decision?
These are not hypothetical design questions. They are the operational reality that any enterprise deploying agentic systems across its internal knowledge will face. The agent must decide — or rely on human guidance for — how to handle the moment when the information it has access to does not converge on a single answer. There are several patterns emerging as possible solutions here:
- Confidence thresholds, where the agent proceeds only when its certainty exceeds a defined level and escalates to a human otherwise.
- Multi-source arbitration, where the agent weighs information by recency, authority, or provenance before acting.
- Ensemble judgment, where multiple models or agents independently evaluate the same context, and a decision is made only when they converge.
- Human-in-the-loop checkpoints, where certain categories of decisions are never fully delegated, regardless of confidence.
Each of these patterns has tradeoffs. Confidence thresholds are only as good as the agent’s calibration. Arbitration requires metadata that most enterprise content does not carry. Ensemble approaches add latency and cost. Human checkpoints reintroduce the bottleneck that agents were meant to remove.
Crucially, none of them work well if the underlying knowledge is not maintained. The golden rule in AI remains — garbage in, garbage out.
The case for a new mindset
While the media has written many alarmist stories about AI’s threat to jobs, I see an amazing opportunity for a new practice to take root inside large enterprise companies.
Any organization serious about allowing agentic systems to act on their systems of record and make judgments will notice that AI-forward employees are acting as archivists, librarians, data cleaners, and compilers — someone who keeps an eye on the health of the content and data the AI is drawing on and is kept in the loop to help make decisions when issues of messy, incorrect, stale, or conflicting data are causing the system to make inappropriate or costly judgments.
In a system like this, citations are key. The system’s reasoning is not opaque. It shows its sources, its confidence, and the provenance of the knowledge it relied on. Users can then submit a response highlighting where the context was useful, neutral, or negative, thereby adding a feedback loop that the LLM and the curators can use to update and improve the knowledge base and the way in which the LLM and agents draw on it.
As we move towards more autonomous and powerful systems, the stakes are higher. The consumer of that knowledge is no longer a human who can exercise judgment over bad information. It is an agent that may act on it directly. When an AI provides bad information and shares it with the wrong people, a serious security incident can occur.
The ever-evolving organizational brain
So what would it actually look like for humans and AI systems to tend the garden of knowledge together?
When your AI platform recognizes that your request matches a pattern from previous work with a successful output, it can quickly hone in on the best path for your latest request. For example, say you ask the agent:
“I need to create a scraper using the Zyte API for these product pages, and I need all the available content in a structured way.” The AI might respond by saying, “I can see you have experience building Zyte API scrapers. Let me create a Naturium scraper following your proven Amazon/Walmart scraper patterns.”
In the context of a 1:1 workflow, this is a terrific way for the AI to support you like a seasoned employee. What happens when someone from another team asks for the API documentation about how your firm approaches this type of challenge, and there is contradictory advice and information across Slack, Jira, Notion, and email?
When a confidence score is low, the system flags it for a human curator who can discuss it with colleagues. If the right answer is available, the system can deprecate incorrect information. If no clear answer is available, this is the time to escalate it to a team of humans to make a decision and clearly label the data going forward.
Each day, the system makes updates and revisions to its internal knowledge base, noting the differences in a manner similar to the Git version history. New facts surface. Old ones are flagged as potentially stale. Conflicts between sources are identified and queued for review.
This creates something that none of the patterns described earlier — confidence thresholds, arbitration, ensemble judgment — can achieve on their own: a knowledge base that actively improves through use. The curators provide the institutional judgment. The feedback loop provides signals at scale. The version history provides auditability. And the agents gain something they currently lack — justified confidence that the data they are working with is fresh and accurate.
Over time, we expect these systems to improve at the same rate as LLMs have in the last three years. As this happens, LLMs will increasingly know when they can provide value and when they don’t meet a threshold to interject.
This is not a fantasy architecture. Every component exists today. What is missing is the organizational commitment to treating knowledge health as infrastructure rather than an afterthought — and the recognition that this work requires a change in mindset, among employees and organizations, not just better software.