Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

The Architect’s Dilemma

13 October 2025 at 07:22

The agentic AI landscape is exploding. Every new framework, demo, and announcement promises to let your AI assistant book flights, query databases, and manage calendars. This rapid advancement of capabilities is thrilling for users, but for the architects and engineers building these systems, it poses a fundamental question: When should a new capability be a simple, predictable tool (exposed via the Model Context Protocol, MCP) and when should it be a sophisticated, collaborative agent (exposed via the Agent2Agent Protocol, A2A)?

The common advice is often circular and unhelpful: “Use MCP for tools and A2A for agents.” This is like telling a traveler that cars use motorways and trains use tracks, without offering any guidance on which is better for a specific journey. This lack of a clear mental model leads to architectural guesswork. Teams build complex conversational interfaces for tasks that demand rigid predictability, or they expose rigid APIs to users who desperately need guidance. The outcome is often the same: a system that looks great in demos but falls apart in the real world.

In this article, I argue that the answer isn’t found by analyzing your service’s internal logic or technology stack. It’s found by looking outward and asking a single, fundamental question: Who is calling your product/service? By reframing the problem this way—as a user experience challenge first and a technical one second—the architect’s dilemma evaporates.

This essay draws a line where it matters for architects: the line between MCP tools and A2A agents. I will introduce a clear framework, built around the “Vending Machine Versus Concierge” model, to help you choose the right interface based on your consumer’s needs. I will also explore failure modes, testing, and the powerful Gatekeeper Pattern that shows how these two interfaces can work together to create systems that are not just clever but truly reliable.

Two Very Different Interfaces

MCP presents tools—named operations with declared inputs and outputs. The caller (a person, program, or agent) must already know what it wants, and provide a complete payload. The tool validates, executes once, and returns a result. If your mental image is a vending machine—insert a well-formed request, get a deterministic response—you’re close enough.

A2A presents agents—goal-first collaborators that converse, plan, and act across turns. The caller expresses an outcome (“book a refundable flight under $450”), not an argument list. The agent asks clarifying questions, calls tools as needed, and holds onto session state until the job is done. If you picture a concierge—interacting, negotiating trade-offs, and occasionally escalating—you’re in the right neighborhood.

Neither interface is “better.” They are optimized for different situations:

  • MCP is fast to reason about, easy to test, and strong on determinism and auditability.
  • A2A is built for ambiguity, long-running processes, and preference capture.

Bringing the Interfaces to Life: A Booking Example

To see the difference in practice, let’s imagine a simple task: booking a specific meeting room in an office.

The MCP “vending machine” expects a perfectly structured, machine-readable request for its book_room_tool. The caller must provide all necessary information in a single, valid payload:

{
  "jsonrpc": "2.0",
  "id": 42,
  "method": "tools/call",
  "params": {
    "name": "book_room_tool",
    "arguments": {
      "room_id": "CR-104B",
      "start_time": "2025-11-05T14:00:00Z",
      "end_time": "2025-11-05T15:00:00Z",
      "organizer": "user@example.com"
    }
  }
}

Any deviation—a missing field or incorrect data type—results in an immediate error. This is the vending machine: You provide the exact code of the item you want (e.g., “D4”) or you get nothing.

The A2A “concierge, an “office assistant” agent, is approached with a high-level, ambiguous goal. It uses conversation to resolve ambiguity:

User: “Hey, can you book a room for my 1-on-1 with Alex tomorrow afternoon?”
Agent: “Of course. To make sure I get the right one, what time works best, and how long will you need it for?”

The agent’s job is to take the ambiguous goal, gather the necessary details, and then likely call the MCP tool behind the scenes once it has a complete, valid set of arguments.

With this clear dichotomy established—the predictable vending machine (MCP) versus the stateful concierge (A2A)—how do we choose? As I argued in the introduction, the answer isn’t found in your tech stack. It’s found by asking the most important architectural question of all: Who is calling your service?

Step 1: Identify your consumer

  1. The machine consumer: A need for predictability
    Is your service going to be called by another automated system, a script, or another agent acting in a purely deterministic capacity? This consumer requires absolute predictability. It needs a rigid, unambiguous contract that can be scripted and relied upon to behave the same way every single time. It cannot handle a clarifying question or an unexpected update; any deviation from the strict contract is a failure. This consumer doesn’t want a conversation; it needs a vending machine. This nonnegotiable requirement for a predictable, stateless, and transactional interface points directly to designing your service as a tool (MCP).
  2. The human (or agentic) consumer: A need for convenience
    Is your service being built for a human end user or for a sophisticated AI that’s trying to fulfill a complex, high-level goal? This consumer values convenience and the offloading of cognitive load. They don’t want to specify every step of a process; they want to delegate ownership of a goal and trust that it will be handled. They’re comfortable with ambiguity because they expect the service—the agent—to resolve it on their behalf. This consumer doesn’t want to follow a rigid script; they need a concierge. This requirement for a stateful, goal-oriented, and conversational interface points directly to designing your service as an agent (A2A).

By starting with the consumer, the architect’s dilemma often evaporates. Before you ever debate statefulness or determinism, you first define the user experience you are obligated to provide. In most cases, identifying your customer will give you your definitive answer.

Step 2: Validate with the four factors

Once you have identified who calls your service, you have a strong hypothesis for your design. A machine consumer points to a tool; a human or agentic consumer points to an agent. The next step is to validate this hypothesis with a technical litmus test. This framework gives you the vocabulary to justify your choice and ensure the underlying architecture matches the user experience you intend to create.

  1. Determinism versus ambiguity
    Does your service require a precise, unambiguous input, or is it designed to interpret and resolve ambiguous goals? A vending machine is deterministic. Its API is rigid: GET /item/D4. Any other request is an error. This is the world of MCP, where a strict schema ensures predictable interactions. A concierge handles ambiguity. “Find me a nice place for dinner” is a valid request that the agent is expected to clarify and execute. This is the world of A2A, where a conversational flow allows for clarification and negotiation.
  2. Simple execution versus complex process
    Is the interaction a single, one-shot execution, or a long-running, multistep process? A vending machine performs a short-lived execution. The entire operation—from payment to dispensing—is an atomic transaction that is over in seconds. This aligns with the synchronous-style, one-shot model of MCP. A concierge manages a process. Booking a full travel itinerary might take hours or even days, with multiple updates along the way. This requires the asynchronous, stateful nature of A2A, which can handle long-running tasks gracefully.
  3. Stateless versus stateful
    Does each request stand alone or does the service need to remember the context of previous interactions? A vending machine is stateless. It doesn’t remember that you bought a candy bar five minutes ago. Each transaction is a blank slate. MCP is designed for these self-contained, stateless calls. A concierge is stateful. It remembers your preferences, the details of your ongoing request, and the history of your conversation. A2A is built for this, using concepts like a session or thread ID to maintain context.
  4. Direct control versus delegated ownership
    Is the consumer orchestrating every step, or are they delegating the entire goal? When using a vending machine, the consumer is in direct control. You are the orchestrator, deciding which button to press and when. With MCP, the calling application retains full control, making a series of precise function calls to achieve its own goal. With a concierge, you delegate ownership. You hand over the high-level goal and trust the agent to manage the details. This is the core model of A2A, where the consumer offloads the cognitive load and trusts the agent to deliver the outcome.
FactorTool (MCP)Agent (A2A)Key question
DeterminismStrict schema; errors on deviationClarifies ambiguity via dialogueCan inputs be fully specified up front?
ProcessOne-shotMulti-step/long-runningIs this atomic or a workflow?
StateStatelessStateful/sessionfulMust we remember context/preferences?
ControlCaller orchestratesOwnership delegatedWho drives: the caller or callee?

Table 1: Four question framework

These factors are not independent checkboxes; they are four facets of the same core principle. A service that is deterministic, transactional, stateless, and directly controlled is a tool. A service that handles ambiguity, manages a process, maintains state, and takes ownership is an agent. By using this framework, you can confidently validate that the technical architecture of your service aligns perfectly with the needs of your customer.

No framework, no matter how clear…

…can perfectly capture the messiness of the real world. While the “Vending Machine Versus Concierge” model provides a robust guide, architects will eventually encounter services that seem to blur the lines. The key is to remember the core principle we’ve established: The choice is dictated by the consumer’s experience, not the service’s internal complexity.

Let’s explore two common edge cases.

The complex tool: The iceberg
Consider a service that performs a highly complex, multistep internal process, like a video transcoding API. A consumer sends a video file and a desired output format. This is a simple, predictable request. But internally, this one call might kick off a massive, long-running workflow involving multiple machines, quality checks, and encoding steps. It’s a hugely complex process.

However, from the consumer’s perspective, none of that matters. They made a single, stateless, fire-and-forget call. They don’t need to manage the process; they just need a predictable result. This service is like an iceberg: 90% of its complexity is hidden beneath the surface. But because its external contract is that of a vending machine—a simple, deterministic, one-shot transaction—it is, and should be, implemented as a tool (MCP).

The simple agent: The scripted conversation
Now consider the opposite: a service with very simple internal logic that still requires a conversational interface. Imagine a chatbot for booking a dentist appointment. The internal logic might be a simple state machine: ask for a date, then a time, then a patient name. It’s not “intelligent” or particularly flexible.

However, it must remember the user’s previous answers to complete the booking. It’s an inherently stateful, multiturn interaction. The consumer cannot provide all the required information in a single, prevalidated call. They need to be guided through the process. Despite its internal simplicity, the need for a stateful dialogue makes it a concierge. It must be implemented as an agent (A2A) because its consumer-facing experience is that of a conversation, however scripted.

These gray areas reinforce the framework’s central lesson. Don’t get distracted by what your service does internally. Focus on the experience it provides externally. That contract with your customer is the ultimate arbiter in the architect’s dilemma.

Testing What Matters: Different Strategies for Different Interfaces

A service’s interface doesn’t just dictate its design; it dictates how you validate its correctness. Vending machines and concierges have fundamentally different failure modes and require different testing strategies.

Testing MCP tools (vending machines):

  • Contract testing: Validate that inputs and outputs strictly adhere to the defined schema.
  • Idempotency tests: Ensure that calling the tool multiple times with the same inputs produces the same result without side effects.
  • Deterministic logic tests: Use standard unit and integration tests with fixed inputs and expected outputs.
  • Adversarial fuzzing: Test for security vulnerabilities by providing malformed or unexpected arguments.

Testing A2A agents (concierges):

  • Goal completion rate (GCR): Measure the percentage of conversations where the agent successfully achieved the user’s high-level goal.
  • Conversational efficiency: Track the number of turns or clarifications required to complete a task.
  • Tool selection accuracy: For complex agents, verify that the right MCP tool was chosen for a given user request.
  • Conversation replay testing: Use logs of real user interactions as a regression suite to ensure updates don’t break existing conversational flows.

The Gatekeeper Pattern

Our journey so far has focused on a dichotomy: MCP or A2A, vending machine or concierge. But the most sophisticated and robust agentic systems do not force a choice. Instead, they recognize that these two protocols don’t compete with each other; they complement each other. The ultimate power lies in using them together, with each playing to its strengths.

The most effective way to achieve this is through a powerful architectural choice we can call the Gatekeeper Pattern.

In this pattern, a single, stateful A2A agent acts as the primary, user-facing entry point—the concierge. Behind this gatekeeper sits a collection of discrete, stateless MCP tools—the vending machines. The A2A agent takes on the complex, messy work of understanding a high-level goal, managing the conversation, and maintaining state. It then acts as an intelligent orchestrator, making precise, one-shot calls to the appropriate MCP tools to execute specific tasks.

Consider a travel agent. A user interacts with it via A2A, giving it a high-level goal: “Plan a business trip to London for next week.”

  • The travel agent (A2A) accepts this ambiguous request and starts a conversation to gather details (exact dates, budget, etc.).
  • Once it has the necessary information, it calls a flight_search_tool (MCP) with precise arguments like origin, destination, and date.
  • It then calls a hotel_booking_tool (MCP) with the required city, check_in_date, and room_type.
  • Finally, it might call a currency_converter_tool (MCP) to provide expense estimates.

Each tool is a simple, reliable, and stateless vending machine. The A2A agent is the smart concierge that knows which buttons to press and in what order. This pattern provides several significant architectural benefits:

  • Decoupling: It separates the complex, conversational logic (the “how”) from the simple, reusable business logic (the “what”). The tools can be developed, tested, and maintained independently.
  • Centralized governance: The A2A gatekeeper is the perfect place to implement cross-cutting concerns. It can handle authentication, enforce rate limits, manage user quotas, and log all activity before a single tool is ever invoked.
  • Simplified tool design: Because the tools are just simple MCP functions, they don’t need to worry about state or conversational context. Their job is to do one thing and do it well, making them incredibly robust.

Making the Gatekeeper Production-Ready

Beyond its design benefits, the Gatekeeper Pattern is the ideal place to implement the operational guardrails required to run a reliable agentic system in production.

  • Observability: Each A2A conversation generates a unique trace ID. This ID must be propagated to every downstream MCP tool call, allowing you to trace a single user request across the entire system. Structured logs for tool inputs and outputs (with PII redacted) are critical for debugging.
  • Guardrails and security: The A2A Gatekeeper acts as a single point of enforcement for critical policies. It handles authentication and authorization for the user, enforces rate limits and usage quotas, and can maintain a list of which tools a particular user or group is allowed to call.
  • Resilience and fallbacks: The Gatekeeper must gracefully manage failure. When it calls an MCP tool, it should implement patterns like timeouts, retries with exponential backoff, and circuit breakers. Critically, it is responsible for the final failure state—escalating to a human in the loop for review or clearly communicating the issue to the end user.

The Gatekeeper Pattern is the ultimate synthesis of our framework. It uses A2A for what it does best—managing a stateful, goal-oriented process—and MCP for what it was designed for—the reliable, deterministic execution of a task.

Conclusion

We began this journey with a simple but frustrating problem: the architect’s dilemma. Faced with the circular advice that “MCP is for tools and A2A is for agents,” we were left in the same position as a traveler trying to get to Edinburgh—knowing that cars use motorways and trains use tracks but with no intuition on which to choose for our specific journey.

The goal was to build that intuition. We did this not by accepting abstract labels, but by reasoning from first principles. We dissected the protocols themselves, revealing how their core mechanics inevitably lead to two distinct service profiles: the predictable, one-shot “vending machine” and the stateful, conversational “concierge.”

With that foundation, we established a clear, two-step framework for a confident design choice:

  1. Start with your customer. The most critical question is not a technical one but an experiential one. A machine consumer needs the predictability of a vending machine (MCP). A human or agentic consumer needs the convenience of a concierge (A2A).
  2. Validate with the four factors. Use the litmus test of determinism, process, state, and ownership to technically justify and solidify your choice.

Ultimately, the most robust systems will synthesize both, using the Gatekeeper Pattern to combine the strengths of a user-facing A2A agent with a suite of reliable MCP tools.

The choice is no longer a dilemma. By focusing on the consumer’s needs and understanding the fundamental nature of the protocols, architects can move from confusion to confidence, designing agentic ecosystems that are not just functional but also intuitive, scalable, and maintainable.

The Java Developer’s Dilemma: Part 1

30 September 2025 at 07:09
This is the first of a three-part series by Markus Eisele. Stay tuned for the follow-up posts.

AI is everywhere right now. Every conference, keynote, and internal meeting has someone showing a prototype powered by a large language model. It looks impressive. You ask a question, and the system answers in natural language. But if you are an enterprise Java developer, you probably have mixed feelings. You know how hard it is to build reliable systems that scale, comply with regulations, and run for years. You also know that what looks good in a demo often falls apart in production. That’s the dilemma we face. How do we make sense of AI and apply it to our world without giving up the qualities that made Java the standard for enterprise software?

The History of Java in the Enterprise

Java became the backbone of enterprise systems for a reason. It gave us strong typing, memory safety, portability across operating systems, and an ecosystem of frameworks that codified best practices. Whether you used Jakarta EE, Spring, or later, Quarkus and Micronaut, the goal was the same: build systems that are stable, predictable, and maintainable. Enterprises invested heavily because they knew Java applications would still be running years later with minimal surprises.

This history matters when we talk about AI. Java developers are used to deterministic behavior. If a method returns a result, you can rely on that result as long as your inputs are the same. Business processes depend on that predictability. AI does not work like that. Outputs are probabilistic. The same input might give different results. That alone challenges everything we know about enterprise software.

The Prototype Versus Production Gap

Most AI work today starts with prototypes. A team connects to an API, wires up a chat interface, and demonstrates a result. Prototypes are good for exploration. They aren’t good for production. Once you try to run them at scale you discover problems.

Latency is one issue. A call to a remote model may take several seconds. That’s not acceptable in systems where a two-second delay feels like forever. Cost is another issue. Calling hosted models is not free, and repeated calls across thousands of users quickly adds up. Security and compliance are even bigger concerns. Enterprises need to know where data goes, how it’s stored, and whether it leaks into a shared model. A quick demo rarely answers those questions.

The result is that many prototypes never make it into production. The gap between a demo and a production system is large, and most teams underestimate the effort required to close it.

Why This Matters for Java Developers

Java developers are often the ones who receive these prototypes and are asked to “make them real.” That means dealing with all the issues left unsolved. How do you handle unpredictable outputs? How do you log and monitor AI behavior? How do you validate responses before they reach downstream systems? These are not trivial questions.

At the same time, business stakeholders expect results. They see the promise of AI and want it integrated into existing platforms. The pressure to deliver is strong. The dilemma is that we cannot ignore AI, but we also cannot adopt it naively. Our responsibility is to bridge the gap between experimentation and production.

Where the Risks Show Up

Let’s make this concrete. Imagine an AI-powered customer support tool. The prototype connects a chat interface to a hosted LLM. It works in a demo with simple questions. Now imagine it deployed in production. A customer asks about account balances. The model hallucinates and invents a number. The system has just broken compliance rules. Or imagine a user submits malicious input and the model responds with something harmful. Suddenly you’re facing a security incident. These are real risks that go beyond “the model sometimes gets it wrong.”

For Java developers, this is the dilemma. We need to preserve the qualities we know matter: correctness, security, and maintainability. But we also need to embrace a new class of technologies that behave very differently from what we’re used to.

The Role of Java Standards and Frameworks

The good news is that the Java ecosystem is already moving to help. Standards and frameworks are emerging that make AI integration less of a wild west. The OpenAI API turns into a standard, providing a way to access models in a standard form, regardless of vendor. That means code you write today won’t be locked in to a single provider. The Model Context Protocol (MCP) is another step, defining how tools and models can interact in a consistent way.

Frameworks are also evolving. Quarkus has extensions for LangChain4j, making it possible to define AI services as easily as you define REST endpoints. Spring has introduced Spring AI. These projects bring the discipline of dependency injection, configuration management, and testing into the AI space. In other words, they give Java developers familiar tools for unfamiliar problems.

The Standards Versus Speed Dilemma

A common argument against Java and enterprise standards is that they move too slowly. The AI world changes every month, with new models and APIs appearing at a pace that no standards body can match. At first glance, it looks like standards are a barrier to progress. The reality is different. In enterprise software, standards are not the anchors holding us back. They’re the foundation that makes long-term progress possible.

Standards define a shared vocabulary. They ensure that knowledge is transferable across projects and teams. If you hire a developer who knows JDBC, you can expect them to work with any database supported by the driver ecosystem. If you rely on Jakarta REST, you can swap frameworks or vendors without rewriting every service. This is not slow. This is what allows enterprises to move fast without constantly breaking things.

AI will be no different. Proprietary APIs and vendor-specific SDKs can get you started quickly, but they come with hidden costs. You risk locking yourself in to one provider, or building a system that only a small set of specialists understands. If those people leave, or if the vendor changes terms, you’re stuck. Standards avoid that trap. They make sure that today’s investment remains useful years from now.

Another advantage is the support horizon. Enterprises don’t think in terms of weeks or hackathon demos. They think in years. Standards bodies and established frameworks commit to supporting APIs and specifications over the long term. That stability is critical for applications that process financial transactions, manage healthcare data, or run supply chains. Without standards, every system becomes a one-off, fragile and dependent on whoever built it.

Java has shown this again and again. Servlets, CDI, JMS, JPA: These standards secured decades of business-critical development. They allowed millions of developers to build applications without reinventing core infrastructure. They also made it possible for vendors and open source projects to compete on quality, not just lock-in. The same will be true for AI. Emerging efforts like LangChain4j and the Java SDK for the Model Context Protocol or the Agent2Agent Protocol SDK will not slow us down. They’ll enable enterprises to adopt AI at scale, safely and sustainably.

In the end, speed without standards leads to short-lived prototypes. Standards with speed lead to systems that survive and evolve. Java developers should not see standards as a constraint. They should see them as the mechanism that allows us to bring AI into production, where it actually matters.

Performance and Numerics: Java’s Catching Up

One more part of the dilemma is performance. Python became the default language for AI not because of its syntax, but because of its libraries. NumPy, SciPy, PyTorch, and TensorFlow all rely on highly optimized C and C++ code. Python is mostly a frontend wrapper around these math kernels. Java, by contrast, has never had numerics libraries of the same adoption or depth. JNI made calling native code possible, but it was awkward and unsafe.

That is changing. The Foreign Function & Memory (FFM) API (JEP 454) makes it possible to call native libraries directly from Java without the boilerplate of JNI. It’s safer, faster, and easier to use. This opens the door for Java applications to integrate with the same optimized math libraries that power Python. Alongside FFM, the Vector API (JEP 508) introduces explicit support for SIMD operations on modern CPUs. It allows developers to write vectorized algorithms in Java that run efficiently across hardware platforms. Together, these features bring Java much closer to the performance profile needed for AI and machine learning workloads.

For enterprise architects, this matters because it changes the role of Java in AI systems. Java isn’t the only orchestration layer that calls external services. With projects like Jlama, models can run inside the JVM. With FFM and the Vector API, Java can take advantage of native math libraries and hardware acceleration. That means AI inference can move closer to where the data lives, whether in the data center or at the edge, while still benefiting from the standards and discipline of the Java ecosystem.

The Testing Dimension

Another part of the dilemma is testing. Enterprise systems are only trusted when they’re tested. Java has a long tradition of unit testing and integration testing, supported by standards and frameworks that every developer knows: JUnit, TestNG, Testcontainers, Jakarta EE testing harnesses, and more recently, Quarkus Dev Services for spinning up dependencies in integration tests. These practices are a core reason Java applications are considered production-grade. Hamel Husain’s work on evaluation frameworks is directly relevant here. He describes three levels of evaluation: unit tests, model/human evaluation, and production-facing A/B tests. For Java developers treating models as black boxes, the first two levels map neatly onto our existing practice: unit tests for deterministic components and black-box evaluations with curated prompts for system behavior.

AI-infused applications bring new challenges. How do you write a unit test for a model that gives slightly different answers each time? How do you validate that an AI component works correctly when the definition of “correct” is fuzzy? The answer is not to give up testing but to extend it.

At the unit level, you still test deterministic components around the AI service: context builders, data retrieval pipelines, validation, and guardrail logic. These remain classic unit test targets. For the AI service itself, you can use schema validation tests, golden datasets, and bounded assertions. For example, you may assert that the model returns valid JSON, contains required fields, or produces a result within an acceptable range. The exact words may differ, but the structure and boundaries must hold.

At the integration level, you can bring AI into the picture. Dev Services can spin up a local Ollama container or mock inference API for repeatable test runs. Testcontainers can manage vector databases like PostgreSQL with pgvector or Elasticsearch. Property-based testing libraries such as jqwik can generate varied inputs to expose edge cases in AI pipelines. These tools are already familiar to Java developers; they simply need to be applied to new targets.

The key insight is that AI testing must complement, not replace, the testing discipline we already have. Enterprises cannot put untested AI into production and hope for the best. By extending unit and integration testing practices to AI-infused components, we give stakeholders the confidence that these systems behave within defined boundaries. Even when individual model outputs are probabilistic.

This is where Java’s culture of testing becomes an advantage. Teams already expect comprehensive test coverage before deploying. Extending that mindset to AI ensures that these applications meet enterprise standards, not just demo requirements. Over time, testing patterns for AI outputs will mature into the same kind of de facto standards that JUnit brought to unit tests and Arquillian brought to integration tests. We should expect evaluation frameworks for AI-infused applications to become as normal as JUnit in the enterprise stack.

A Path Forward

So what should we do? The first step is to acknowledge that AI is not going away. Enterprises will demand it, and customers will expect it. The second step is to be realistic. Not every prototype deserves to become a product. We need to evaluate use cases carefully, ask whether AI adds real value, and design with risks in mind.

From there, the path forward looks familiar. Use standards to avoid lock-in. Use frameworks to manage complexity. Apply the same discipline you already use for transactions, messaging, and observability. The difference is that now you also need to handle probabilistic behavior. That means adding validation layers, monitoring AI outputs, and designing systems that fail gracefully when the model is wrong.

The Java developer’s dilemma is not about choosing whether to use AI. It’s about how to use it responsibly. We cannot treat AI like a library we drop into an application and forget about. We need to integrate it with the same care we apply to any critical system. The Java ecosystem is giving us the tools to do that. Our challenge is to learn quickly, apply those tools, and keep the qualities that made Java the enterprise standard in the first place.

This is the beginning of a larger conversation. In the next article we will look at new types of applications that emerge when AI is treated as a core part of the architecture, not just an add-on. That’s where the real transformation happens.

Working with Contexts

28 August 2025 at 06:02

The following article comes from two blog posts by Drew Breunig: “How Long Contexts Fail” and “How to Fix Your Contexts.”

Managing Your Context Is the Key to Successful Agents

As frontier model context windows continue to grow,1 with many supporting up to 1 million tokens, I see many excited discussions about how long-context windows will unlock the agents of our dreams. After all, with a large enough window, you can simply throw everything into a prompt you might need—tools, documents, instructions, and more—and let the model take care of the rest.

Long contexts kneecapped RAG enthusiasm (no need to find the best doc when you can fit it all in the prompt!), enabled MCP hype (connect to every tool and models can do any job!), and fueled enthusiasm for agents.2

But in reality, longer contexts do not generate better responses. Overloading your context can cause your agents and applications to fail in surprising ways. Contexts can become poisoned, distracting, confusing, or conflicting. This is especially problematic for agents, which rely on context to gather information, synthesize findings, and coordinate actions.

Let’s run through the ways contexts can get out of hand, then review methods to mitigate or entirely avoid context fails.

Context Poisoning

Context poisoning is when a hallucination or other error makes it into the context, where it is repeatedly referenced.

The DeepMind team called out context poisoning in the Gemini 2.5 technical report, which we broke down previously. When playing Pokémon, the Gemini agent would occasionally hallucinate, poisoning its context:

An especially egregious form of this issue can take place with “context poisoning”—where many parts of the context (goals, summary) are “poisoned” with misinformation about the game state, which can often take a very long time to undo. As a result, the model can become fixated on achieving impossible or irrelevant goals.

If the “goals” section of its context was poisoned, the agent would develop nonsensical strategies and repeat behaviors in pursuit of a goal that cannot be met.

Context Distraction

Context distraction is when a context grows so long that the model over-focuses on the context, neglecting what it learned during training.

As context grows during an agentic workflow—as the model gathers more information and builds up history—this accumulated context can become distracting rather than helpful. The Pokémon-playing Gemini agent demonstrated this problem clearly:

While Gemini 2.5 Pro supports 1M+ token context, making effective use of it for agents presents a new research frontier. In this agentic setup, it was observed that as the context grew significantly beyond 100k tokens, the agent showed a tendency toward favoring repeating actions from its vast history rather than synthesizing novel plans. This phenomenon, albeit anecdotal, highlights an important distinction between long-context for retrieval and long-context for multistep, generative reasoning.

Instead of using its training to develop new strategies, the agent became fixated on repeating past actions from its extensive context history.

For smaller models, the distraction ceiling is much lower. A Databricks study found that model correctness began to fall around 32k for Llama 3.1-405b and earlier for smaller models.

If models start to misbehave long before their context windows are filled, what’s the point of super large context windows? In a nutshell: summarization3 and fact retrieval. If you’re not doing either of those, be wary of your chosen model’s distraction ceiling.

Context Confusion

Context confusion is when superfluous content in the context is used by the model to generate a low-quality response.

For a minute there, it really seemed like everyone was going to ship an MCP. The dream of a powerful model, connected to all your services and stuff, doing all your mundane tasks felt within reach. Just throw all the tool descriptions into the prompt and hit go. Claude’s system prompt showed us the way, as it’s mostly tool definitions or instructions for using tools.

But even if consolidation and competition don’t slow MCPscontext confusion will. It turns out there can be such a thing as too many tools.

The Berkeley Function-Calling Leaderboard is a tool-use benchmark that evaluates the ability of models to effectively use tools to respond to prompts. Now on its third version, the leaderboard shows that every model performs worse when provided with more than one tool.4 Further, the Berkeley team, “designed scenarios where none of the provided functions are relevant…we expect the model’s output to be no function call.” Yet, all models will occasionally call tools that aren’t relevant.

Browsing the function-calling leaderboard, you can see the problem get worse as the models get smaller:

Tool-calling irrelevance score for Gemma models (chart from dbreunig.com, source: Berkeley Function-Calling Leaderboard; created with Datawrapper)

A striking example of context confusion can be seen in a recent paper that evaluated small model performance on the GeoEngine benchmark, a trial that features 46 different tools. When the team gave a quantized (compressed) Llama 3.1 8b a query with all 46 tools, it failed, even though the context was well within the 16k context window. But when they only gave the model 19 tools, it succeeded.

The problem is, if you put something in the context, the model has to pay attention to it. It may be irrelevant information or needless tool definitions, but the model will take it into account. Large models, especially reasoning models, are getting better at ignoring or discarding superfluous context, but we continually see worthless information trip up agents. Longer contexts let us stuff in more info, but this ability comes with downsides.

Context Clash

Context clash is when you accrue new information and tools in your context that conflicts with other information in the context.

This is a more problematic version of context confusion. The bad context here isn’t irrelevant, it directly conflicts with other information in the prompt.

A Microsoft and Salesforce team documented this brilliantly in a recent paper. The team took prompts from multiple benchmarks and “sharded” their information across multiple prompts. Think of it this way: Sometimes, you might sit down and type paragraphs into ChatGPT or Claude before you hit enter, considering every necessary detail. Other times, you might start with a simple prompt, then add further details when the chatbot’s answer isn’t satisfactory. The Microsoft/Salesforce team modified benchmark prompts to look like these multistep exchanges:

Microsoft/Salesforce team benchmark prompts

All the information from the prompt on the left side is contained within the several messages on the right side, which would be played out in multiple chat rounds.

The sharded prompts yielded dramatically worse results, with an average drop of 39%. And the team tested a range of models—OpenAI’s vaunted o3’s score dropped from 98.1 to 64.1.

What’s going on? Why are models performing worse if information is gathered in stages rather than all at once?

The answer is context confusion: The assembled context, containing the entirety of the chat exchange, contains early attempts by the model to answer the challenge before it has all the information. These incorrect answers remain present in the context and influence the model when it generates its final answer. The team writes:

We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

This does not bode well for agent builders. Agents assemble context from documents, tool calls, and from other models tasked with subproblems. All of this context, pulled from diverse sources, has the potential to disagree with itself. Further, when you connect to MCP tools you didn’t create there’s a greater chance their descriptions and instructions clash with the rest of your prompt.

Learnings

The arrival of million-token context windows felt transformative. The ability to throw everything an agent might need into the prompt inspired visions of superintelligent assistants that could access any document, connect to every tool, and maintain perfect memory.

But, as we’ve seen, bigger contexts create new failure modes. Context poisoning embeds errors that compound over time. Context distraction causes agents to lean heavily on their context and repeat past actions rather than push forward. Context confusion leads to irrelevant tool or document usage. Context clash creates internal contradictions that derail reasoning.

These failures hit agents hardest because agents operate in exactly the scenarios where contexts balloon: gathering information from multiple sources, making sequential tool calls, engaging in multi-turn reasoning, and accumulating extensive histories.

Fortunately, there are solutions!

Mitigating and Avoiding Context Failures

Let’s run through the ways we can mitigate or avoid context failures entirely.

Everything is about information management. Everything in the context influences the response. We’re back to the old programming adage of “garbage in, garbage out.” Thankfully, there’s plenty of options for dealing with the issues above.

RAG

Retrieval-augmented generation (RAG) is the act of selectively adding relevant information to help the LLM generate a better response.

Because so much has been written about RAG, we’re not going to cover it here beyond saying: It’s very much alive.

Every time a model ups the context window ante, a new “RAG is dead” debate is born. The last significant event was when Llama 4 Scout landed with a 10 million token window. At that size, it’s really tempting to think, “Screw it, throw it all in,” and call it a day.

But, as we’ve already covered, if you treat your context like a junk drawer, the junk will influence your response. If you want to learn more, here’s a new course that looks great.

Tool Loadout

Tool loadout is the act of selecting only relevant tool definitions to add to your context.

The term “loadout” is a gaming term that refers to the specific combination of abilities, weapons, and equipment you select before a level, match, or round. Usually, your loadout is tailored to the context—the character, the level, the rest of your team’s makeup, and your own skill set. Here, we’re borrowing the term to describe selecting the most relevant tools for a given task.

Perhaps the simplest way to select tools is to apply RAG to your tool descriptions. This is exactly what Tiantian Gan and Qiyao Sun did, which they detail in their paper “RAG MCP.” By storing their tool descriptions in a vector database, they’re able to select the most relevant tools given an input prompt.

When prompting DeepSeek-v3, the team found that selecting the right tools becomes critical when you have more than 30 tools. Above 30, the descriptions of the tools begin to overlap, creating confusion. Beyond 100 tools, the model was virtually guaranteed to fail their test. Using RAG techniques to select fewer than 30 tools yielded dramatically shorter prompts and resulted in as much as 3x better tool selection accuracy.

For smaller models, the problems begin long before we hit 30 tools. One paper we touched on previously, “Less is More,” demonstrated that Llama 3.1 8b fails a benchmark when given 46 tools, but succeeds when given only 19 tools. The issue is context confusion, not context window limitations.

To address this issue, the team behind “Less is More” developed a way to dynamically select tools using an LLM-powered tool recommender. The LLM was prompted to reason about “number and type of tools it ‘believes’ it requires to answer the user’s query.” This output was then semantically searched (tool RAG, again) to determine the final loadout. They tested this method with the Berkeley Function-Calling Leaderboard, finding Llama 3.1 8b performance improved by 44%.

The “Less is More” paper notes two other benefits to smaller contexts—reduced power consumption and speed—crucial metrics when operating at the edge (meaning, running an LLM on your phone or PC, not on a specialized server). Even when their dynamic tool selection method failed to improve a model’s result, the power savings and speed gains were worth the effort, yielding savings of 18% and 77%, respectively.

Thankfully, most agents have smaller surface areas that only require a few hand-curated tools. But if the breadth of functions or the amount of integrations needs to expand, always consider your loadout.

Context Quarantine

Context quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs.

We see better results when our contexts aren’t too long and don’t sport irrelevant content. One way to achieve this is to break our tasks up into smaller, isolated jobs—each with its own context.

There are many examples of this tactic, but an accessible write-up of this strategy is Anthropic’s blog post detailing its multi-agent research system. They write:

The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—distinct tools, prompts, and exploration trajectories—which reduces path dependency and enables thorough, independent investigations.

Research lends itself to this design pattern. When given a question, multiple agents can identify and separately prompt several subquestions or areas of exploration. This not only speeds up the information gathering and distillation (if there’s compute available), but it keeps each context from accruing too much information or information not relevant to a given prompt, delivering higher quality results:

Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single-agent system failed to find the answer with slow, sequential searches.

This approach also helps with tool loadouts, as the agent designer can create several agent archetypes with their own dedicated loadout and instructions for how to utilize each tool.

The challenge for agent builders, then, is to find opportunities for isolated tasks to spin out onto separate threads. Problems that require context-sharing among multiple agents aren’t particularly suited to this tactic.

If your agent’s domain is at all suited to parallelization, be sure to read the whole Anthropic write-up. It’s excellent.

Context Pruning

Context pruning is the act of removing irrelevant or otherwise unneeded information from the context.

Agents accrue context as they fire off tools and assemble documents. At times, it’s worth pausing to assess what’s been assembled and remove the cruft. This could be something you task your main LLM with or you could design a separate LLM-powered tool to review and edit the context. Or you could choose something more tailored to the pruning task.

Context pruning has a (relatively) long history, as context lengths were a more problematic bottleneck in the natural language processing (NLP) field prior to ChatGPT. Building on this history, a current pruning method is Provence, “an efficient and robust context pruner for question answering.”

Provence is fast, accurate, simple to use, and relatively small—only 1.75 GB. You can call it in a few lines, like so:

from transformers import AutoModel

provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)

# Read in a markdown version of the Wikipedia entry for Alameda, CA
with open('alameda_wiki.md', 'r', encoding='utf-8') as f:
    alameda_wiki = f.read()

# Prune the article, given a question
question = 'What are my options for leaving Alameda?'
provence_output = provence.process(question, alameda_wiki)

Provence edited the article, cutting 95% of the content, leaving me with only this relevant subset. It nailed it.

One could employ Provence or a similar function to cull documents or the entire context. Further, this pattern is a strong argument for maintaining a structured5 version of your context in a dictionary or other form, from which you assemble a compiled string prior to every LLM call. This structure would come in handy when pruning, allowing you to ensure the main instructions and goals are preserved while the document or history sections can be pruned or summarized.

Context Summarization

Context summarization is the act of boiling down an accrued context into a condensed summary.

Context summarization first appeared as a tool for dealing with smaller context windows. As your chat session came close to exceeding the maximum context length, a summary would be generated and a new thread would begin. Chatbot users did this manually in ChatGPT or Claude, asking the bot to generate a short recap that would then be pasted into a new session.

However, as context windows increased, agent builders discovered there are benefits to summarization besides staying within the total context limit. As we’ve seen, beyond 100,000 tokens the context becomes distracting and causes the agent to rely on its accumulated history rather than training. Summarization can help it “start over” and avoid repeating context-based actions.

Summarizing your context is easy to do, but hard to perfect for any given agent. Knowing what information should be preserved and detailing that to an LLM-powered compression step is critical for agent builders. It’s worth breaking out this function as its own LLM-powered stage or app, which allows you to collect evaluation data that can inform and optimize this task directly.

Context Offloading

Context offloading is the act of storing information outside the LLM’s context, usually via a tool that stores and manages the data.

This might be my favorite tactic, if only because it’s so simple you don’t believe it will work.

Again, Anthropic has a good write-up of the technique, which details their “think” tool, which is basically a scratchpad:

With the “think” tool, we’re giving Claude the ability to include an additional thinking step—complete with its own designated space—as part of getting to its final answer… This is particularly helpful when performing long chains of tool calls or in long multi-step conversations with the user.

I really appreciate the research and other writing Anthropic publishes, but I’m not a fan of this tool’s name. If this tool were called scratchpad, you’d know its function immediately. It’s a place for the model to write down notes that don’t cloud its context and are available for later reference. The name “think” clashes with “extended thinking” and needlessly anthropomorphizes the model… but I digress.

Having a space to log notes and progress works. Anthropic shows pairing the “think” tool with a domain-specific prompt (which you’d do anyway in an agent) yields significant gains: up to a 54% improvement against a benchmark for specialized agents.

Anthropic identified three scenarios where the context offloading pattern is useful:

  1. Tool output analysis. When Claude needs to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;
  2. Policy-heavy environments. When Claude needs to follow detailed guidelines and verify compliance; and
  3. Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).

Takeaways

Context management is usually the hardest part of building an agent. Programming the LLM to, as Karpathy says, “pack the context windows just right,” smartly deploying tools, information, and regular context maintenance, is the job of the agent designer.

The key insight across all the above tactics is that context is not free. Every token in the context influences the model’s behavior, for better or worse. The massive context windows of modern LLMs are a powerful capability, but they’re not an excuse to be sloppy with information management.

As you build your next agent or optimize an existing one, ask yourself: Is everything in this context earning its keep? If not, you now have six ways to fix it.


Footnotes

  1. Gemini 2.5 and GPT-4.1 have 1 million token context windows, large enough to throw Infinite Jest in there with plenty of room to spare.
  2. The “Long form text” section in the Gemini docs sum up this optmism nicely.
  3. In fact, in the Databricks study cited above, a frequent way models would fail when given long contexts is they’d return summarizations of the provided context while ignoring any instructions contained within the prompt.
  4. If you’re on the leaderboard, pay attention to the “Live (AST)” columns. These metrics use real-world tool definitions contributed to the product by enterprise, “avoiding the drawbacks of dataset contamination and biased benchmarks.”
  5. Hell, this entire list of tactics is a strong argument for why you should program your contexts.

People Work in Teams, AI Assistants in Silos

15 August 2025 at 07:37

As I was waiting to start a recent episode of Live with Tim O’Reilly, I was talking with attendees in the live chat. Someone asked, “Where do you get your up-to-date information about what’s going on in AI?” I thought about the various newsletters and publications I follow but quickly realized that the right answer was “some chat groups that I am a part of.” Several are on WhatsApp, and another on Discord. For other topics, there are some Signal group chats. Yes, the chats include links to various media sources, but they are curated by the intelligence of the people in those groups, and the discussion often matters more than the links themselves.

Later that day, I asked my 16-year-old grandson how he kept in touch with his friends. “I used to use Discord a lot,” he said, “but my friend group has now mostly migrated to WhatsApp. I have two groups, one with about 8 good friends, and a second one with a bigger group of about 20.” The way “friend group” has become part of the language for younger people is a tell. Groups matter.

A WhatsApp group is also how I keep in touch with my extended family. (Actually, there are several overlapping family groups, each with a slightly different focus and set of active members.) And there’s a Facebook group that my wife and I use to keep in touch with neighbors in the remote town in the Sierra Nevada where we spend our summers.

I’m old enough to remember the proto-internet of the mid-1980s, when Usenet groups were how people shared information, formed remote friendships, and built communities of interest. Email, which grew up as a sibling of Usenet, also developed some group-forming capabilities. Listservs (mailing list managers) were and still are a thing, but they were a sideshow compared to the fecundity of Usenet. Google Groups remains as a 25-year-old relic of that era, underinvested in and underused.

Later on, I used Twitter to follow the people I cared about and those whose work and ideas I wanted to keep up with. After Twitter made it difficult to see the feed of people I wanted to follow, replacing it by default with a timeline of suggested posts, I pretty much stopped using it. I still used Instagram to follow my friends and family; it used to be the first thing I checked every morning when my grandchildren were little and far away. But now, the people I want to follow are hard to find there too, buried by algorithmic suggestions, and so I visit the site only intermittently. Social software (the original name that Clay Shirky gave to applications like FriendFeed and systems like RSS that allow a user to curate a list of “feeds” to follow) gave way to social media. A multiplexed feed of content from the people I have chosen is social software, group-forming and empowering to individuals; an algorithmically curated feed of content that someone else thinks I will like is social media, divisive and disempowering.

“What are some tips on dealing with the fact that we are currently working in teams, but in silos of individual AI assistants?”

For technology to do its best work for people, it has to provide support for groups. They are a fundamental part of the human social experience. But serving groups is hard. Consumer technology companies discover this opportunity, then abandon it with regularity, only for someone else to discover it again. We’ve all had this experience, I think. I am reminded of a marvelous passage from the Wallace Stevens’s poem “Esthétique du Mal”:

The tragedy, however, may have begun, 
Again, in the imagination’s new beginning, 
In the yes of the realist spoken because he must 
Say yes, spoken because under every no 
Lay a passion for yes that had never been broken.

There is a passion for groups that has never been broken. We’re going to keep reinventing them until every platform owner realizes that they are an essential part of the landscape and sticks with them. They are not just a way to attract users before abandoning them as part of the cycle of enshittification.

There is still a chance to get this right for AI. The imagination’s new beginning is cropping up at all levels, from LLMs themselves, where the advantages of hyperscaling seem to be slowing, reducing the likelihood of a winner-takes-all outcome, to protocols like MCP and A2A, to AI applications for teams.

AI Tooling for Teams?

In the enterprise world, there have long been products explicitly serving the needs of teams (i.e., groups), from Lotus Notes through SharePoint, Slack, and Microsoft Teams. 20 years ago, Google Docs kicked off a revolution that turned document creation into a powerful kind of group collaboration tool. Git and GitHub are also a powerful form of groupware, one so fundamental that software development as we know it could not operate without it. But so far, AI model and application developers largely seem to have ignored the needs of groups, despite their obvious importance. As Claire Vo put it to me in one recent conversation, “AI coding is still largely a single-player game.”

It is possible to share the output of AI, but most AI applications are still woefully lacking in the ability to collaborate during the act of creation. As one attendee asked on my recent Live with Tim O’Reilly episode with Marily Nika, “What are some tips on dealing with the fact that we are currently working in teams, but in silos of individual AI assistants?” We are mostly limited to sharing our chats or the outputs of our AI work with each other by email or link. Where is the shared context? The shared workflows? Claire’s ChatPRD (AI for product management) apparently has an interface designed to support teams, and I have been told that Devin has some useful collaborative features, but as of yet, there is no full-on reinvention of AI interfaces for multiplayer interactions. We are still leaning on external environments like GitHub or Google Docs to make up for the lack of native collaboration in AI workflows.

We need to reinvent sharing for AI in the same way that Sam Schillace, Steve Newman, and Claudia Carpenter turned the office productivity world on its head back in 2005 with the development of Writely, which became Google Docs. It’s easy to forget (or for younger people never to know) how painful collaborative editing of documents used to be, and just how much the original Google Docs team got right. Not only did they make user control of sharing central to the experience; they also made version control largely invisible. Multiple collaborators could work on a document simultaneously and magically see each others’ work reflected in real time. Document history and the ability to revert to earlier versions is likewise seamless.

On August 26, I’ll be chatting with Sam Schillace, Steve Newman, and Claudia Carpenter on Live with Tim O’Reilly. We’ll be celebrating the 20th anniversary of Writely/Google Docs and talking about how they developed its seamless sharing, and what that might look like today for AI.

What we really need is the ability to share context among a group. And that means not just a shared set of source documents but also a shared history of everyone’s interactions with the common project, and visibility into the channels by which the group communicates with each other about it. As Steve Newman wrote to me, “If I’m sharing that particular AI instance with a group, it should have access to the data that’s relevant to the group.”

In this article, I’m going to revisit some past attempts at designing for the needs of groups and make a few stabs at thinking out loud about them as provocations for AI developers.

Lessons from the Unix Filesystem

Maybe I’m showing my age, but so many ideas I keep going back to come from the design of the Unix operating system (later Linux.) But I’m not the only one. Back in 2007, the ever-insightful Marc Hedlund wrote:

One of my favorite business model suggestions for entrepreneurs is, find an old UNIX command that hasn’t yet been implemented on the web, and fix that. talk and finger became ICQ, LISTSERV became Yahoo! Groups, ls became (the original) Yahoo!, find and grep became Google, rn became Bloglines, pine became Gmail, mount is becoming S3, and bash is becoming Yahoo! Pipes. I didn’t get until tonight that Twitter is wall for the web. I love that.

I have a similar suggestion for AI entrepreneurs. Yes, rethink everything for AI, but figure out what to keep as well as what to let go. History can teach us a lot about what patterns are worth keeping. This is especially important as we explore how to make AI more participatory and less monolithic.

The Unix filesystem, which persists through Linux and is thus an integral part of the underlying architecture of the technological world as we know it, had a way of thinking about file permissions that is still relevant in the world of AI. (The following brief description is for those who are unfamiliar with the Unix/Linux filesystem. Feel free to skip ahead.)

Every file is created with a default set of permissions that control its access and use. There are separate permissions specified for user, group, and world: A file can be private so that only the person who created it can read and/or write to it, or if it is an executable file such as a program, run it. A file can belong to a group, identified by a unique numeric group ID in a system file that names the group, gives it that unique numeric ID and an optional encrypted group password, and lists the members who can read, write, or execute files belonging to it. Or a file can have “world” access, in which anyone can read and potentially write to it or run it. Every file thus not only has an associated owner (usually but not always the creator) but potentially also an associated group owner, who controls membership in the group.

This explicit framing of three levels of access seems important, rather than leaving group access as something that is sometimes available and sometimes not. I also like that Unix had a “little language” (umask and chmod) for compactly viewing or modifying the read/write/execute permissions for each level of access.

A file that is user readable and writable versus one that is, say, world readable but not writable is an easily understood distinction. But there’s this whole underexplored middle in what permissions can be given to members of associated groups. The chief function, as far as I remember it, was to allow for certain files to be editable or runnable only by members of a group with administrative access. But this is really only the tip of the iceberg of possibilities, as we shall see.

One of the drawbacks of the original Unix filesystem is that the members of groups had to be explicitly defined, and a file can only be assigned to one primary group at a time. While a user can belong to multiple groups, a file itself is associated with a single owning group. More modern versions of the system, like Linux, work around this limitation by providing Access Control Lists (ACLs), which make it possible to define specific permissions for multiple users and multiple groups on a single file or directory. Groups in systems like WhatsApp and Signal and Discord and Google Groups also use an ACL-type approach. Access rights are usually controlled by an administrator. This draws hard boundaries around groups and makes ad hoc group-forming more difficult.

Lessons from Open Source Software

People think that free and open source depend on a specific kind of license. I have always believed that while licenses are important, the essential foundation of open source software is the ability of groups to collaborate on shared projects. There are countless stories of software developed by collaborative communities—notably Unix itself—that came about despite proprietary licenses. Yes, the open source Linux took over from proprietary versions of Unix, but let’s not forget that the original development was done not just at Bell Labs but at the University of California, Berkeley and other universities and companies around the world. This happened despite AT&T’s proprietary license and long before Richard Stallman wrote the GNU Manifesto or Linus Torvalds wrote the Linux kernel.

There were two essential innovations that enabled distributed collaboration on shared software projects outside the boundaries of individual organizations.

The first is what I have called “the architecture of participation.” Software products that are made up of small cooperating units rather than monoliths are easier for teams to work on. When we were interviewing Linus Torvalds for our 1999 essay collection Open Sources, he said something like “I couldn’t have written a new kernel for Windows even if I had access to the source code. The architecture just wouldn’t support it.” That is, Windows was monolithic, while Unix was modular.

We have to ask the question: What is the architecture of participation for AI?

Years ago, I wrote the first version of the Wikipedia page about Kernighan and Pike’s book The Unix Programming Environment because that book so fundamentally shaped my view of the programming world and seemed like it had such profound lessons for all of us. Kernighan and Pike wrote:

Even though the UNIX system introduces a number of innovative programs and techniques, no single program or idea makes it work well. Instead, what makes it effective is the approach to programming, a philosophy of using the computer. Although that philosophy can’t be written down in a single sentence, at its heart is the idea that the power of a system comes more from the relationships among programs than from the programs themselves. Many UNIX programs do quite trivial things in isolation, but, combined with other programs, become general and useful tools.

What allowed that combination is the notion that every program produced its output as ASCII text, which could then be consumed and transformed by other programs in a pipeline, or if necessary, redirected into a file for storage. The behavior of the programs in the pipeline could be modified by a series of command line flags, but the most powerful features came from the transformations made to the data by a connected sequence of small utility programs with distinct powers.

Unix was the first operating system designed by a company that was, at its heart, a networking company. Unix was all about the connections between things, the space between. The small pieces loosely joined, end-to-end model became the paradigm for the internet as well and shaped the modern world. It was easy to participate in the collaborative development of Unix. New tools could be added without permission because the rules for cooperating applications were already defined.

MCP is a fresh start on creating an architecture of participation for AI at the macro level. The way I see it, pre-MCP the model for applications built with AI was hub-and-spoke. That is, we were in a capital-fueled race for the leading AI model to become the centralized platform on which most AI applications would be built, much like Windows was the default platform in the PC era. The agentic vision of MCP is a networked vision, much like Unix, in which small, specialized tools can be combined in a variety of ways to accomplish complex tasks.

(Even pre-MCP, we saw this pattern at work in AI. What is RAG but a pipeline of cooperating programs?)

Given the slowdown in progress in LLMs, with most leading models clustering around similar benchmarks, including many open source/open weight models that can be customized and run by corporations or even individual users, we are clearly moving toward a distributed AI future. MCP provides a first step toward the communications infrastructure of this multipolar world of cooperating AIs. But we haven’t thought deeply enough about a world without gatekeepers, where the permissions are fluid, and group-forming is easy and under user control.

AI Codecon, September 9, 2025
The future of cooperating agents is the subject of the second of our free AI Codecon conferences about the future of programming, Coding for the Future Agentic World, to be held September 9. Addy Osmani and I are cohosting, and we’ve got an amazing lineup of speakers. We’ll be exploring agentic interfaces beyond chat UX; how to chain agents across environments to complete complex tasks; asynchronous, autonomous code generation in production; and the infrastructure enabling the agentic web, including MCP and agent protocols.

There was a second essential foundation for the collaborative development of Unix and other open source software, and that was version control. Marc Rochkind’s 1972 SCCS (Source Code Control System), which he originally wrote for the IBM System/370 operating system but quickly ported to Unix, was arguably the first version control system. It pioneered the innovation (for the time) of storing only the differences between two files, not a complete new copy. It wasn’t released publicly till 1977, and was succeeded by a number of improved source code control systems over the years. Git, developed by Linux creator Linux Torvalds in 2005, has been the de facto standard for the last 20 years.

The earliest source code repositories were local, and change files were sent around by email or Usenet. (Do you remember patch?) Git was a creature of the internet era, where everything could be found online, and so it soon became the basis of one of the web’s great assemblages of collective intelligence. GitHub, created in 2008 by Tom Preston-Werner, Chris Wanstrath, P. J. Hyett, and Scott Chacon, turned the output of the entire software industry into a shared resource, segmented by an inbuilt architecture of user, group, and world. There are repositories that represent the work of one author, and there are others that are the work of a community of developers.

Explicit check-ins, forks, and branches are the stuff of everyday life for the learned priesthood of software developers. And increasingly, they are stuff of everyday life for the agents that are part of the modern AI-enabled developer tools. It’s easy to forget just how much GitHub is the substrate of the software development workflow, as important in many ways as the internet itself.

But clearly there is work to be done. How might version control come to a new flowering in AI? What features would make it easier for a group, not just an individual, to have a shared conversation with an AI? How might a group collaborate in developing a large software project or other complex intellectual work? This means figuring out a lot about memory, how versions of the past are not consistent, how some versions are more canonical than others, and what a gift it is for users to be able to roll back to an earlier state and go forward from there.

Lessons from Google Docs

Google Docs and similar applications are another great example of version control at work, and there’s a lot to learn from them. Given that the promise of AI is that everyone, not just the learned few, may soon be able to develop complex bespoke software, version control for AI will need to have the simplicity of Google Docs and other office productivity tools inspired by it as well as the more powerful mechanisms provided by formal version control systems like Git.

One important distinction between the kind of version control and group forming that is enabled by GitHub versus Google Docs is that GitHub provides a kind of exoskeleton for collaboration, while Google docs internalizes it. Each Google Docs file carries within it the knowledge of who can access it and what actions that they can take. Group forming is natural and instantaneous. I apologize for subjecting you to yet another line from my favorite poet Wallace Stevens, but in Google Docs and its siblings, access permissions and version control are “a part of the [thing] itself and not about it.”

Much like in the Unix filesystem, a Google doc may be private, open to a predefined group (e.g., all employees with oreilly.com addresses), or open to anyone. But it also provides a radical simplification of group formation. Inviting someone to collaborate on a Google doc—to edit, comment, or merely read it—creates an ad hoc group centered on that document.

Google docs ad hoc group

My aspiration for groups in AI is that they have the seamless ad hoc quality of the community of contributors to a Google doc. How might our interactions with AI be different if we were no longer sharing a fixed output but the opportunity for cocreation? How might an ad hoc group of collaborators include not only humans but their AI assistants? What is the best way for changes to be tracked when those changes include not just explicit human edits to AI output but revised instructions to recreate the AI contribution?

Maybe Google already has a start on a shared AI environment for groups. NotebookLM is built on the substrate of Google Drive, which inherited its simple but robust permissions architecture from Google Docs. I’d love to see the team there spend more time thinking through how to apply the lessons of Google Docs to NotebookLM and other AI interfaces. Unfortunately, the NotebookLM team seems to be focused on making it into an aggregator of Notebooks rather than providing it as an extension of the collaborative infrastructure of Google Workspace. This is a missed opportunity.

Core Versus Boundary

A group with enumerated members—say, the employees of a company—has a boundary. You are in or out. So do groups like citizens of a nation, the registered users of a site or service, members of a club or church, or professors at a university as distinct from students, who may themselves be divided into undergraduates and grad students and postdocs. But many social groups have no boundary. Instead, they have a kind of gravitational core, like a solar system whose gravity extends outward from its dense core, attenuating but never quite ending.

Image of gravitational core
Image generated by Google Imagen via Gemini 2.5

Image generated by Google Imagen via Gemini 2.5

I know this is a fanciful metaphor, but it is useful.

The fact that ACLs work by drawing boundaries around groups is a serious limitation. It’s important to make space for groups organized around a gravitational core. A public Google group or a public Google doc open to access for anyone with the link or a Signal group with shareable invite links (versus the targeted invitations to a WhatsApp group) draws in new users by the social equivalent to the way a dense body deforms the space around it, pulling them into its orbit.

I’m not sure what I’m entirely asking for here. But I am suggesting that any AI system focused on enabling collaboration take the Core versus Boundary pattern into account. Design systems that can have a gravitational core (i.e., public access with opt-in membership), not just mechanisms for creating group boundaries with defined membership.

The Tragedy Begins Again?

The notion of the follow, which originally came from RSS and was later widely adopted in the timelines of Twitter, Facebook, and other social media apps, provides an instructive take on the Core pattern.

“Following” inverts the membership in a group by taking output that is world-readable and curating it into a user-selected group. We take this for granted, but the idea that there can be billions of people posting to Facebook, and that each of them can have an individual algorithmically curated feed of content from a small subset of the other billions of users, only those whom they chose, is truly astonishing. This is a group that is user specified but with the actual content dynamically collected by the platform on behalf of the user trillions of times a day. “@mentions” even allow users to invite people into their orbit, turning any given post into the kind of ad hoc group that we see with Google Docs. Hashtags allow them to invite others in by specifying a core of shared interests.

And of course, in social media, you can also see the tragedy that Wallace Stevens spoke of. The users, each at the bottom of their personal gravity well, had postings from the friends they chose drawn to them by the algorithmic curvature of space, so to speak, when suddenly, a great black hole of suggested content came in and disrupted the dance of their chosen planets.

A group can be defined either by its creator (boundary) or collectively by its members (core). If those who control internet applications forget that groups don’t belong to them but to their creators, the users are forced to migrate elsewhere to recreate the community that they had built but have now lost.

I suspect that there is a real opportunity for AI to recreate the power of this kind of group forming, displacing those who have put their own commercial preferences ahead of those of their users. But that opportunity can’t be taken for granted. The race to load all the content into massive models in the race for superintelligence started out with homogenization on a massive scale, dwarfing even the algorithmically shaped feeds of social media. Once advertising enters the mix, there will be strong incentives for AI platforms too to place their own preferences ahead of those of their users. Given the enormous capital required to win the AI race, the call to the dark side will be strong. So we should fear a centralized AI future.

Fortunately, the fevered dreams of the hyperscalers are beginning to abate as progress slows (though the hype still continues apace.) Far from being a huge leap forward, GPT-5 appears to have made the case that progress is leveling off. It appears that AI may be a “normal technology” after all, not a singularity. That means that we can expect continued competition.

The best defense against this bleak future is to build the infrastructure and capabilities for a distributed AI alternative. How can we bring that into the world? It can be informed by these past advances in group collaboration, but it will need to find new pathways as well. We are starting a long process by which (channeling Wallace Stevens again) we “searches the possible for its possibleness.” I’d love to hear from developers who are at the forefront of that search, and I’m sure others would as well.

Thanks to Alex Komoroske, Claire Vo, Eran Sandler, Ilan Strauss, Mike Loukides, Rohit Krishnan, and Steve Newman for helpful comments during the development of this piece.

❌
❌