In 2025 AI reshaped how teams think, build, and deliver software. We’re now at a point where “AI coding assistants have quickly moved from novelty to necessity [with] up to 90% of software engineers us[ing] some kind of AI for coding,” Addy Osmani writes. That’s a very different world to the one we were in 12 months ago. As we look ahead to 2026, here are three key trends we have seen driving change and how we think developers and architects can prepare for what’s ahead.
Evolving Coding Workflows
New AI tools changed coding workflows in 2025, enabling developers to write and work with code faster than ever before. This doesn’t mean AI is replacing developers. It’s opening up new frontiers to be explored and skills to be mastered, something we explored at our first AI Codecon in May.
AI tools in the IDE and on the command line have revived the debate about the IDE’s future, echoing past arguments (e.g., VS Code versus Vim). It’s more useful to focus on the tools’ purpose. As Kent Beck and Tim O’Reilly discussed in November, developers are ultimately responsible for the code their chosen AI tool produces. We know that LLMs “actively reward existing top tier software engineering practices” and “amplify existing expertise,” as Simon Willison has pointed out. And a good coder will “factor in” questions that AI doesn’t. Does it really matter which tool is used?
The critical transferable skill for working with any of these tools is understanding how to communicate effectively with the underlying model. AI tools generate better code if they’re given all the relevant background on a project. Managing what the AI knows about your project (context engineering) and communicating it (prompt engineering) are going to be key to doing good work.
The core skills for working effectively with code won’t change in the face of AI. Understanding code review, design patterns, debugging, testing, and documentation and applying those to the work you do with AI tools will be the differential.
The Rise of Agentic AI
With the rise of agents and Model Context Protocol (MCP) in the second half of 2025, developers gained the ability to use AI not just as a pair programmer but as an entire team of developers. The speakers at our Coding for the Agentic World live AI Codecon event in September 2025 explored new tools, workflows, and hacks that are shaping this emerging discipline of agentic AI.
Software engineers aren’t just working with single coding agents. They’re building and deploying their own custom agents, often within complex setups involving multi-agent scenarios, teams of coding agents, and agent swarms. This shift from conducting AI to orchestrating AI elevates the importance of truly understanding how good software is built and maintained.
We know that AI generates better code with context, and this is also true of agents. As with coding workflows, this means understanding context engineering is essential. However, the differential for senior engineers in 2026 will be how well they apply intermediate skills such as product thinking, advanced testing, system design, and architecture to their work with agentic systems.
AI and Software Architecture
We began 2025 with our January Superstream, Software Architecture in the Age of AI, where speaker Rebecca Parsons explored the architectural implications of AI, dryly noting that “given the pace of change, this could be out of date by Friday.” By the time of our Superstream in August, things had solidified a little more and our speakers were able to share AI-based patterns and antipatterns and explain how they intersect with software architecture. Our December 9 event will look at enterprise architecture and how architects can navigate the impact of AI on systems, processes, and governance. (Registration is still open—save your seat.) As these events show, AI has progressed from being something architects might have to consider to something that is now essential to their work.
We’re seeing successful AI-enhanced architectures using event-driven models, enabling AI agents to act on incoming triggers rather than fixed prompts. This means it’s more important than ever to understand event-driven architecture concepts and trade-offs. In 2026, topics that align with evolving architectures (evolutionary architectures, fitness functions) will also become more important as architects look to find ways to modernize existing systems for AI without derailing them. AI-native architectures will also bring new considerations and patterns for system design next year, as will the trend toward agentic AI.
As was the case for their engineer coworkers, architects still have to know the basics: when to add an agent or a microservice, how to consider cost, how to define boundaries, and how to act on the knowledge they already have. As Thomas Betts, Sarah Wells, Eran Stiller, and Daniel Bryant note on InfoQ, they also “nee[d] to understand how an AI element relates to other parts of their system: What are the inputs and outputs? How can they measure performance, scalability, cost, and other cross-functional requirements?”
Companies will continue to decentralize responsibilities across different functions this year, and AI brings new sets of trade-offs to be considered. It’s true that regulated industries remain understandably wary of granting access to their systems. They’re rolling out AI more carefully with greater guardrails and governance, but they are still rolling it out. So there’s never been a better time to understand the foundations of software architecture. It will prepare you for the complexity on the horizon.
Strong Foundations Matter
AI has changed the way software is built, but it hasn’t changed what makes good software. As we enter 2026, the most important developer and architecture skills won’t be defined by the tool you know. They’ll be defined by how effectively you apply judgment, communicate intent, and handle complexity when working with (and sometimes against) intelligent assistants and agents. AI rewards strong engineering; it doesn’t replace it. It’s an exciting time to be involved.
Join us at the Software Architecture Superstream on December 9 to learn how to better navigate the impact of AI on systems, processes, and governance. Over four hours, host Neal Ford and our lineup of experts including Metro Bank’s Anjali Jain and Philip O’Shaughnessy, Vercel’s Dom Sipowicz, Intel’s Brian Rogers, Microsoft’s Ron Abellera, and Equal Experts’ Lewis Crawford will share their hard-won insights about building adaptive, AI-ready architectures that support continuous innovation, ensure governance and security, and align seamlessly with business goals.
When AI systems were just a single model behind an API, life felt simpler. You trained, deployed, and maybe fine-tuned a few hyperparameters.
But that world’s gone. Today, AI feels less like a single engine and more like a busy city—a network of small, specialized agents constantly talking to each other, calling APIs, automating workflows, and making decisions faster than humans can even follow.
And here’s the real challenge: The smarter and more independent these agents get, the harder it becomes to stay in control. Performance isn’t what slows us down anymore. Governance is.
How do we make sure these agents act ethically, safely, and within policy? How do we log what happened when multiple agents collaborate? How do we trace who decided what in an AI-driven workflow that touches user data, APIs, and financial transactions?
That’s where the idea of engineering governance into the stack comes in. Instead of treating governance as paperwork at the end of a project, we can build it into the architecture itself.
From Model Pipelines to Agent Ecosystems
In the old days of machine learning, things were pretty linear. You had a clear pipeline: collect data, train the model, validate it, deploy, monitor. Each stage had its tools and dashboards, and everyone knew where to look when something broke.
But with AI agents, that neat pipeline turns into a web. A single customer-service agent might call a summarization agent, which then asks a retrieval agent for context, which in turn queries an internal API—all happening asynchronously, sometimes across different systems.
It’s less like a pipeline now and more like a network of tiny brains, all thinking and talking at once. And that changes how we debug, audit, and govern. When an agent accidentally sends confidential data to the wrong API, you can’t just check one log file anymore. You need to trace the whole story: which agent called which, what data moved where, and why each decision was made. In other words, you need full lineage, context, and intent tracing across the entire ecosystem.
Why Governance Is the Missing Layer
Governance in AI isn’t new. We already have frameworks like NIST’s AI Risk Management Framework (AI RMF) and the EU AI Act defining principles like transparency, fairness, and accountability. The problem is these frameworks often stay at the policy level, while engineers work at the pipeline level. The two worlds rarely meet. In practice, that means teams might comply on paper but have no real mechanism for enforcement inside their systems.
What we really need is a bridge—a way to turn those high-level principles into something that runs alongside the code, testing and verifying behavior in real time. Governance shouldn’t be another checklist or approval form; it should be a runtime layer that sits next to your AI agents—ensuring every action follows approved paths, every dataset stays where it belongs, and every decision can be traced when something goes wrong.
The Four Guardrails of Agent Governance
Policy as code
Policies shouldn’t live in forgotten PDFs or static policy docs. They should live next to your code. By using tools like the Open Policy Agent (OPA), you can turn rules into version-controlled code that’s reviewable, testable, and enforceable. Think of it like writing infrastructure as code, but for ethics and compliance. You can define rules such as:
Which agents can access sensitive datasets
Which API calls require human review
When a workflow needs to stop because the risk feels too high
This way, developers and compliance folks stop talking past each other—they work in the same repo, speaking the same language.
And the best part? You can spin up a Dockerized OPA instance right next to your AI agents inside your Kubernetes cluster. It just sits there quietly, watching requests, checking rules, and blocking anything risky before it hits your APIs or data stores.
Governance stops being some scary afterthought. It becomes just another microservice. Scalable. Observable. Testable. Like everything else that matters.
Observability and auditability
Agents need to be observable not just in performance terms (latency, errors) but in decision terms. When an agent chain executes, we should be able to answer:
Who initiated the action?
What tools were used?
What data was accessed?
What output was generated?
Modern observability stacks—Cloud Logging, OpenTelemetry, Prometheus, or Grafana Loki—can already capture structured logs and traces. What’s missing is semantic context: linking actions to intent and policy.
Imagine extending your logs to capture not only “API called” but also “Agent FinanceBot requested API X under policy Y with risk score 0.7.” That’s the kind of metadata that turns telemetry into governance.
When your system runs in Kubernetes, sidecar containers can automatically inject this metadata into every request, creating a governance trace as natural as network telemetry.
Dynamic risk scoring
Governance shouldn’t mean blocking everything; it should mean evaluating risk intelligently. In an agent network, different actions have different implications. A “summarize report” request is low risk. A “transfer funds” or “delete records” request is high risk.
By assigning dynamic risk scores to actions, you can decide in real time whether to:
Allow it automatically
Require additional verification
Escalate to a human reviewer
You can compute risk scores using metadata such as agent role, data sensitivity, and confidence level. Cloud providers like Google Cloud Vertex AI Model Monitoring already support risk tagging and drift detection—you can extend those ideas to agent actions.
The point isn’t to slow agents down but to make their behavior context-aware.
Regulatory mapping
Frameworks like NIST AI RMF and the EU AI Act are often seen as legal mandates. In reality, they can double as engineering blueprints.
Governance principle
Engineering implementation
Transparency
Agent activity logs, explainability metadata
Accountability
Immutable audit trails in Cloud Logging/Chronicle
Robustness
Canary testing, rollout control in Kubernetes
Risk management
Real-time scoring, human-in-the-loop review
Mapping these requirements into cloud and container tools turns compliance into configuration.
Once you start thinking of governance as a runtime layer, the next step is to design what that actually looks like in production.
Building a Governed AI Stack
Let’s visualize a practical, cloud native setup—something you could deploy tomorrow.
All of these can run on Kubernetes with Docker containers for modularity. The governance layer acts as a smart proxy—it intercepts agent calls, evaluates policy and risk, then logs and forwards the request if approved.
In practice:
Each agent’s container registers itself with the governance service.
Policies live in Git, deployed as ConfigMaps or sidecar containers.
Logs flow into Cloud Logging or Elastic Stack for searchable audit trails.
A Chronicle or BigQuery dashboard visualizes high-risk agent activity.
This separation of concerns keeps things clean: Developers focus on agent logic, security teams manage policy rules, and compliance officers monitor dashboards instead of sifting through raw logs. It’s governance you can actually operate—not bureaucracy you try to remember later.
Lessons from the Field
When I started integrating governance layers into multi-agent pipelines, I learned three things quickly:
It’s not about more controls—it’s about smarter controls. When all operations have to be manually approved, you will paralyze your agents. Focus on automating the 90% that’s low risk.
Logging everything isn’t enough. Governance requires interpretable logs. You need correlation IDs, metadata, and summaries that map events back to business rules.
Governance has to be part of the developer experience. If compliance feels like a gatekeeper, developers will route around it. If it feels like a built-in service, they’ll use it willingly.
In one real-world deployment for a financial-tech environment, we used a Kubernetes admission controller to enforce policy before pods could interact with sensitive APIs. Each request was tagged with a “risk context” label that traveled through the observability stack. The result? Governance without friction. Developers barely noticed it—until the compliance audit, when everything just worked.
Human in the Loop, by Design
Despite all the automation, people should also be involved in making some decisions. A healthy governance stack knows when to ask for help. Imagine a risk-scoring service that occasionally flags “Agent Alpha has exceeded transaction threshold three times today.” As an alternative to blocking, it may forward the request to a human operator via Slack or an internal dashboard. That is not a weakness but a good indication of maturity when an automated system requires a person to review it. Reliable AI does not imply eliminating people; it means knowing when to bring them back in.
Avoiding Governance Theater
Every company wants to say they have AI governance. But there’s a difference between governance theater—policies written but never enforced—and governance engineering—policies turned into running code.
When you can measure governance, you can improve it. That’s how you move from pretending to protect systems to proving that you do. The future of AI isn’t just about building smarter models; it’s about building smarter guardrails. Governance isn’t bureaucracy—it’s infrastructure for trust. And just as we’ve made automated testing part of every CI/CD pipeline, we’ll soon treat governance checks the same way: built in, versioned, and continuously improved.
True progress in AI doesn’t come from slowing down. It comes from giving it direction, so innovation moves fast but never loses sight of what’s right.
November ended. Thanksgiving (in the US), turkey, and a train of model announcements. The announcements were exciting: Google’s Gemini 3 puts it in the lead among large language models, at least for the time being. Nano Banana Pro is a spectacularly good text-to-image model. OpenAI has released its heavy hitters, GPT-5.1-Codex-Max and GPT-5.1 Pro. And the Allen Institute released its latest open source model, Olmo 3, the leading open source model from the US.
Since Trends avoids deal-making (should we?), we’ve also avoided the angst around an AI bubble and its implosion. Right now, it’s safe to say that the bubble is formed of money that hasn’t yet been invested, let alone spent. If it is a bubble, it’s in the future. Do promises and wishes make a bubble? Does a bubble made of promises and wishes pop with a bang or a pffft?
AI
Now that Google and OpenAI have laid down their cards, Anthropic has released its latest heavyweight model: Opus 4.5. They’ve also dropped the price significantly.
The Allen Institute has launched its latest open source model, Olmo 3. The institute’s opened up the whole development process to allow other teams to understand its work.
Not to be outdone, Google has introduced Nano Banana Pro (aka Gemini 3 Pro Image), its state-of-the-art image generation model. Nano Banana’s biggest feature is the ability to edit images to change the appearance of items without redrawing them from scratch. And according to Simon WIllison, it watermarks the parts of an image it generates with SynthID.
OpenAI has released two more components of GPT-5.1, GPT-5.1-Codex-Max (API) and GPT-5.1 Pro (ChatGPT). This release brings the company’s most powerful models for generative work into view.
A group of quantum physicists claim to have reduced the size of the DeepSeek model by half, and to have removed Chinese censorship. The model can now tell you what happened in Tiananmen Square, explain what Pooh looked like, and answer other forbidden questions.
The release train for Gemini 3 has begun, and the commentariatquicklycrownedit king of the LLMs. It includes the ability to spin up a web interface so users can give it more information about their questions, and to generate diagrams along with text output.
As part of the Gemini 3 release, Google has also announced a new agentic IDE called Antigravity.
Google has released a new weather forecasting model, WeatherNext 2, that can forecast with resolutions up to 1 hour. The data is available through Earth Engine and BigQuery, for those who would like to do their own forecasting. There’s also an early access program on Vertex AI.
Grok 4.1 has been released, with reports that it is currently the best model at generative prose, including creative writing. Be that as it may, we don’t see why anyone would use an AI that has been trained to reflect Elon Musk’s thoughts and values. If AI has taught us one thing, it’s that we need to think for ourselves.
AI demands the creation of new data centers and new energy sources. States want to ensure that those power plants are built, and built in ways that don’t pass costs on to consumers.
Grokipedia uses questionable sources. Is anyone surprised? How else would you train an AI on the latest conspiracy theories?
AMD GPUs are competitive, but they’re hampered because there are few libraries for low-level operations. To solve this problem, Chris Ré and others have announcedHipKittens, a library of programming primitive operations for AMD GPUs.
OpenAI has released GPT-5.1. The two new models are Instant, which is tuned to be more conversational and “human,” and Thinking, a reasoning model that now adapts the time it takes to “think” to the difficulty of the questions.
Large language models, including GPT-5 and the Chinese models, show bias against users who use a German dialect rather than standard German. The bias appeared to be greater as the model size increased. These results also apply to languages like English.
Yann LeCun is leaving Facebook to launch a new startup that will develop his ideas about building AI.
Harbor is a new tool that simplifies benchmarking frameworks and models. It’s from the developers of the Terminal-Bench benchmark. And it brings us a step closer to a world where people build their own specialized AI rather than rely on large providers.
Moonshot AI has finally releasedKimiK2 Thinking, the first open weights model to have benchmark results competitive with—or exceeding—the best closed weights models. It’s designed to be used as an agent, calling external tools as needed to solve problems.
Tongyi DeepResearch is a new fully open source agent for doing research. Its results are comparable to OpenAI deep research, Claude Sonnet 4, and similar models. Tongyi is part of Alibaba; it’s yet another important model to come out of China.
MiniMax M2 is a new open weights model that focuses on building agents. It has performance similar to Claude Sonnet but at a much lower price point. It also embeds its thought processes between <think> and </think> tags, which is an important step toward interpretability.
DeepSeek has introduced a new model for OCR with some very interesting properties: It has a new process for storing and retrieving memories that also makes the model significantly more efficient.
Agent Lightning provides a code-free way to train agents using reinforcement learning.
Programming
The Zig programming language has published a book. Online, of course.
Google is weakening its controversial new rules about developer verification. The company plans to create a separate class for applications with limited distribution, and develop a flow that will allow the installation of unverified apps.
Google’s LiteRT is a library for running AI models in browsers and small devices. LiteRT supports Android, iOS, embedded Linux, and microcontrollers. Supported languages include Java, Kotlin, Swift, Embedded C, and C++.
Does AI-assisted coding mean the end of new languages? Simon Willison thinks that LLMs can encourage the development of new programming languages. Design your language and ship it with a Claude Skills-style document; that should be enough for an LLM to learn how to use it.
Deepnote, a successor to the Jupyter Notebook, is a next-generation notebook for data analytics that’s built for teams. There’s now a shared workspace; different blocks can use different languages; and AI integration is on the road map. It’s now open source.
The idea of assigning colors (red, blue) to tools may be helpful in limiting the risk of prompt injection when building agents. What tools can return something damaging? This sounds like a step towards the application of the “least privilege” principle to AI design.
Security
We’re making the same mistake with AI security as we made with cloud security (and security in general): treating security as an afterthought.
Don’t become a victim. Data collected for online age verification makes your site a target for attackers. That data is valuable, and they know it.
A research collaboration uses data poisoning and AI to disrupt deepfake images. Users use Silverer to process their images before posting. The tool makes invisible changes to the original image that confuse AIs creating new images, leading to unusable distortions.
Is it a surprise that AI is being used to generate fake receipts and expense reports? After all, it’s used to fake just about everything else. It was inevitable that enterprise applications of AI fakery would appear.
HydraPWK2 is a Linux distribution designed for penetration testing. It’s based on Debian and is supposedly easier to use than Kali Linux.
How secure is your trusted execution environment (TEE)? All of the major hardware vendors are vulnerable to a number of physical attacks against “secure enclaves.” And their terms of service often exclude physical attacks.
Atroposia is a new malware-as-a-service package that includes a local vulnerability scanner. Once an attacker has broken into a site, they can find other ways to remain there.
A new kind of phishing attack (CoPhishing) uses Microsoft Copilot Studio agents to steal credentials by abusing the Sign In topic. Microsoft has promised an update that will defend against this attack.
Operations
Here’s how to install Open Notebook, an open source equivalent to NotebookLM, to run on your own hardware. It uses Docker and Ollama to run the notebook and the model locally, so data never leaves your system.
Open source isn’t “free as in beer.” Nor is it “free as in freedom.” It’s “free as in puppies.” For better or for worse, that just about says it.
Need a framework for building proxies? Cloudflare’s next generation Oxy framework might be what you need. (Whatever you think of their recent misadventure.)
MIT Media Labs’ Project NANDA intends to build infrastructure for a decentralized network of AI agents. They describe it as a global decentralized registry (not unlike DNS) that can be used to discover and authenticate agents using MCP and A2A. Isn’t this what we wanted from the internet in the first place?
Luke Wroblewski suggests a new model for designing AI chat sessions. A simple chat isn’t as simple as it seems; particularly with reasoning models, it can become cluttered to the point of uselessness. This new design addresses those problems.
The debate about open source AI has largely featured open weight models. But that’s a bit like arguing that in the PC era, the most important goal would have been to have Intel open source its chip designs. That might have been useful to some people, but it wouldn’t have created Linux, Apache, or the collaborative software ecosystem that powers the modern internet. What makes open source transformative is the ease with which people can learn from what others have done, modify it to meet their own needs, and share those modifications with others. And that can’t just happen at the lowest, most complex level of a system. And it doesn’t come easily when what you are providing is access to a system that takes enormous resources to modify, use, and redistribute. It comes from what I’ve called the architecture of participation.
This architecture of participation has a few key properties:
Legibility: You can understand what a component does without understanding the whole system.
Modifiability: You can change one piece without rewriting everything.
Composability: Pieces work together through simple, well-defined interfaces.
Shareability: Your small contribution can be useful to others without them adopting your entire stack.
The most successful open source projects are built from small pieces that work together. Unix gave us a small operating system kernel surrounded by a library of useful functions, together with command-line utilities that could be chained together with pipes and combined into simple programs using the shell. Linux followed and extended that pattern. The web gave us HTML pages you could “view source” on, letting anyone see exactly how a feature was implemented and adapt it to their needs, and HTTP connected every website as a linkable component of a larger whole. Apache didn’t beat Netscape and Microsoft in the web server market by adding more and more features, but instead provided an extension layer so a community of independent developers could add frameworks like Grails, Kafka, and Spark.
MCP and Skills Are “View Source” for AI
MCP and Claude Skills remind me of those early days of Unix/Linux and the web. MCP lets you write small servers that give AI systems new capabilities such as access to your database, your development tools, your internal APIs, or third-party services like GitHub, GitLab, or Stripe. A skill is even more atomic: a set of plain language instructions, often with some tools and resources, that teaches Claude how to do something specific. Matt Bell from Anthropic remarked in comments on a draft of this piece that a skill can be defined as “the bundle of expertise to do a task, and is typically a combination of instructions, code, knowledge, and reference materials.” Perfect.
What is striking about both is their ease of contribution. You write something that looks like the shell scripts and web APIs developers have been writing for decades. If you can write a Python function or format a Markdown file, you can participate.
This is the same quality that made the early web explode. When someone created a clever navigation menu or form validation, you could view source, copy their HTML and JavaScript, and adapt it to your site. You learned by doing, by remixing, by seeing patterns repeated across sites you admired. You didn’t have to be an Apache contributor to get the benefit of learning from others and reusing their work.
Anthropic’s MCP Registry and third-party directories like punkpeye/awesome-mcp-servers show early signs of this same dynamic. Someone writes an MCP server for Postgres, and suddenly dozens of AI applications gain database capabilities. Someone creates a skill for analyzing spreadsheets in a particular way, and others fork it, modify it, and share their versions. Anthropic still seems to be feeling its way with user contributed skills, listing in its skills gallery only those they and select partners have created, but they document how to create them, making it possible for anyone to build a reusable tool based on their specific needs, knowledge, or insights. So users are developing skills that make Claude more capable and sharing them via GitHub. It will be very exciting to see how this develops. Groups of developers with shared interests creating and sharing collections of interrelated skills and MCP servers that give models deep expertise in a particular domain will be a potent frontier for both AI and open source.
GPTs Versus Skills: Two Models of Extension
It’s worth contrasting the MCP and skills approach with OpenAI’s custom GPTs, which represent a different vision of how to extend AI capabilities.
GPTs are closer to apps. You create one by having a conversation with ChatGPT, giving it instructions and uploading files. The result is a packaged experience. You can use a GPT or share it for others to use, but they can’t easily see how it works, fork it, or remix pieces of it into their own projects. GPTs live in OpenAI’s store, discoverable and usable but ultimately contained within the OpenAI ecosystem.
This is a valid approach, and for many use cases, it may be the right one. It’s user-friendly. If you want to create a specialized assistant for your team or customers, GPTs make that straightforward.
But GPTs aren’t participatory in the open source sense. You can’t “view source” on someone’s GPT to understand how they got it to work well. You can’t take the prompt engineering from one GPT and combine it with the file handling from another. You can’t easily version control GPTs, diff them, or collaborate on them the way developers do with code. (OpenAI offers team plans that do allow collaboration by a small group using the same workspace, but this is a far cry from open source–style collaboration.)
Skills and MCP servers, by contrast, are files and code. A skill is literally just a Markdown document you can read, edit, fork, and share. An MCP server is a GitHub repository you can clone, modify, and learn from. They’re artifacts that exist independently of any particular AI system or company.
This difference matters. The GPT Store is an app store, and however rich it becomes, an app store remains a walled garden. The iOS App Store and Google Play store host millions of apps for phones, but you can’t view source on an app, can’t extract the UI pattern you liked, and can’t fork it to fix a bug the developer won’t address. The open source revolution comes from artifacts you can inspect, modify, and share: source code, markup languages, configuration files, scripts. These are all things that are legible not just to computers but to humans who want to learn and build.
That’s the lineage skills and MCP belong to. They’re not apps; they’re components. They’re not products; they’re materials. The difference is architectural, and it shapes what kind of ecosystem can grow around them.
Nothing prevents OpenAI from making GPTs more inspectable and forkable, and nothing prevents skills or MCP from becoming more opaque and packaged. The tools are young. But the initial design choices reveal different instincts about what kind of participation matters. OpenAI seems deeply rooted in the proprietary platform model. Anthropic seems to be reaching for something more open.1
Complexity and Evolution
Of course, the web didn’t stay simple. HTML begat CSS, which begat JavaScript frameworks. View source becomes less useful when a page is generated by megabytes of minified React.
But the participatory architecture remained. The ecosystem became more complex, but it did so in layers, and you can still participate at whatever layer matches your needs and abilities. You can write vanilla HTML, or use Tailwind, or build a complex Next.js app. There are different layers for different needs, but all are composable, all shareable.
I suspect we’ll see a similar evolution with MCP and skills. Right now, they’re beautifully simple. They’re almost naive in their directness. That won’t last. We’ll see:
Abstraction layers: Higher-level frameworks that make common patterns easier.
Composition patterns: Skills that combine other skills, MCP servers that orchestrate other servers.
Optimization: When response time matters, you might need more sophisticated implementations.
Security and safety layers: As these tools handle sensitive data and actions, we’ll need better isolation and permission models.
The question is whether this evolution will preserve the architecture of participation or whether it will collapse into something that only specialists can work with. Given that Claude itself is very good at helping users write and modify skills, I suspect that we are about to experience an entirely new frontier of learning from open source, one that will keep skill creation open to all even as the range of possibilities expands.
What Does This Mean for Open Source AI?
Open weights are necessary but not sufficient. Yes, we need models whose parameters aren’t locked behind APIs. But model weights are like processor instructions. They are important but not where the most innovation will happen.
The real action is at the interface layer. MCP and skills open up new possibilities because they create a stable, comprehensible interface between AI capabilities and specific uses. This is where most developers will actually participate. Not only that, it’s where people who are not now developers will participate, as AI further democratizes programming. At bottom, programming is not the use of some particular set of “programming languages.” It is the skill set that starts with understanding a problem that the current state of digital technology can solve, imagining possible solutions, and then effectively explaining to a set of digital tools what we want them to help us do. The fact that this may now be possible in plain language rather than a specialized dialect means that more people can create useful solutions to the specific problems they face rather than looking only for solutions to problems shared by millions. This has always been a sweet spot for open source. I’m sure many people have said this about the driving impulse of open source, but I first heard it from Eric Allman, the creator of Sendmail, at what became known as the open source summit in 1998: “scratching your own itch.” And of course, history teaches us that this creative ferment often leads to solutions that are indeed useful to millions. Amateur programmers become professionals, enthusiasts become entrepreneurs, and before long, the entire industry has been lifted to a new level.
Standards enable participation. MCP is a protocol that works across different AI systems. If it succeeds, it won’t be because Anthropic mandates it but because it creates enough value that others adopt it. That’s the hallmark of a real standard.
Ecosystems beat models. The most generative platforms are those in which the platform creators are themselves part of the ecosystem. There isn’t an AI “operating system” platform yet, but the winner-takes-most race for AI supremacy is based on that prize. Open source and the internet provide an alternate, standards-based platform that not only allows people to build apps but to extend the platform itself.
Open source AI means rethinking open source licenses. Most of the software shared on GitHub has no explicit license, which means that default copyright laws apply: The software is under exclusive copyright, and the creator retains all rights. Others generally have no right to reproduce, distribute, or create derivative works from the code, even if it is publicly visible on GitHub. But as Shakespeare wrote in The Merchant of Venice, “The brain may devise laws for the blood, but a hot temper leaps o’er a cold decree.” Much of this code is de facto open source, even if not de jure. People can learn from it, easily copy from it, and share what they’ve learned.
But perhaps more importantly for the current moment in AI, it was all used to train LLMs, which means that this de facto open source code became a vector through which all AI-generated code is created today. This, of course, has made many developers unhappy, because they believe that AI has been trained on their code without either recognition or recompense. For open source, recognition has always been a fundamental currency. For open source AI to mean something, we need new approaches to recognizing contributions at every level.
Licensing issues also come up around what happens to data that flows through an MCP server. What happens when people connect their databases and proprietary data flows through an MCP so that an LLM can reason about it? Right now I suppose it falls under the same license as you have with the LLM vendor itself, but will that always be true? And, would I, as a provider of information, want to restrict the use of an MCP server depending on a specific configuration of a user’s LLM settings? For example, might I be OK with them using a tool if they have turned off “sharing” in the free version, but not want them to use it if they hadn’t? As one commenter on a draft of this essay put it, “Some API providers would like to prevent LLMs from learning from data even if users permit it. Who owns the users’ data (emails, docs) after it has been retrieved via a particular API or MCP server might be a complicated issue with a chilling effect on innovation.”
There are efforts such as RSL (Really Simple Licensing) and CC Signals that are focused on content licensing protocols for the consumer/open web, but they don’t yet really have a model for MCP, or more generally for transformative use of content by AI. For example, if an AI uses my credentials to retrieve academic papers and produces a literature review, what encumbrances apply to the results? There is a lot of work to be done here.
Open Source Must Evolve as Programming Itself Evolves
It’s easy to be amazed by the magic of vibe coding. But treating the LLM as a code generator that takes input in English or other human languages and produces Python, TypeScript, or Java echoes the use of a traditional compiler or interpreter to generate byte code. It reads what we call a “higher-level language” and translates it into code that operates further down the stack. And there’s a historical lesson in that analogy. In the early days of compilers, programmers had to inspect and debug the generated assembly code, but eventually the tools got good enough that few people need to do that any more. (In my own career, when I was writing the manual for Lightspeed C, the first C compiler for the Mac, I remember Mike Kahl, its creator, hand-tuning the compiler output as he was developing it.)
Now programmers are increasingly finding themselves having to debug the higher-level code generated by LLMs. But I’m confident that will become a smaller and smaller part of the programmer’s role. Why? Because eventually we come to depend on well-tested components. I remember how the original Macintosh user interface guidelines, with predefined user interface components, standardized frontend programming for the GUI era, and how the Win32 API meant that programmers no longer needed to write their own device drivers. In my own career, I remember working on a book about curses, the Unix cursor-manipulation library for CRT screens, and a few years later the manuals for Xlib, the low-level programming interfaces for the X Window System. This kind of programming soon was superseded by user interface toolkits with predefined elements and actions. So too, the roll-your-own era of web interfaces was eventually standardized by powerful frontend JavaScript frameworks.
Once developers come to rely on libraries of preexisting components that can be combined in new ways, what developers are debugging is no longer the lower-level code (first machine code, then assembly code, then hand-built interfaces) but the architecture of the systems they build, the connections between the components, the integrity of the data they rely on, and the quality of the user interface. In short, developers move up the stack.
LLMs and AI agents are calling for us to move up once again. We are groping our way towards a new paradigm in which we are not just building MCPs as instructions for AI agents but developing new programming paradigms that blend the rigor and predictability of traditional programming with the knowledge and flexibility of AI. As Phillip Carter memorably noted, LLMs are inverted computers relative to those with which we’ve been familiar: “We’ve spent decades working with computers that are incredible at precision tasks but need to be painstakingly programmed for anything remotely fuzzy. Now we have computers that are adept at fuzzy tasks but need special handling for precision work.” That being said, LLMs are becoming increasingly adept at knowing what they are good at and what they aren’t. Part of the whole point of MCP and skills is to give them clarity about how to use the tools of traditional computing to achieve their fuzzy aims.
Consider the evolution of agents from those based on “browser use” (that is, working with the interfaces designed for humans) to those based on making API calls (that is, working with the interfaces designed for traditional programs) to those based on MCP (relying on the intelligence of LLMs to read documents that explain the tools that are available to do a task). An MCP server looks a lot like the formalization of prompt and context engineering into components. A look at what purports to be a leaked system prompt for ChatGPT suggests that the pattern of MCP servers was already hidden in the prompts of proprietary AI apps: “Here’s how I want you to act. Here are the things that you should and should not do. Here are the tools available to you.”
But while system prompts are bespoke, MCP and skills are a step towards formalizing plain text instructions to an LLM so that they can become reusable components. In short, MCP and skills are early steps towards a system of what we can call “fuzzy function calls.”
Fuzzy Function Calls: Magic Words Made Reliable and Reusable
This view of how prompting and context engineering fit with traditional programming connects to something I wrote about recently: LLMs natively understand high-level concepts like “plan,” “test,” and “deploy”; industry standard terms like “TDD” (Test Driven Development) or “PRD” (Product Requirements Document); competitive features like “study mode”; or specific file formats like “.md file.” These “magic words” are prompting shortcuts that bring in dense clusters of context and trigger particular patterns of behavior that have specific use cases.
But right now, these magic words are unmodifiable. They exist in the model’s training, within system prompts, or locked inside proprietary features. You can use them if you know about them, and you can write prompts to modify how they work in your current session. But you can’t inspect them to understand exactly what they do, you can’t tweak them for your needs, and you can’t share your improved version with others.
Skills and MCPs are a way to make magic words visible and extensible. They formalize the instructions and patterns that make an LLM application work, and they make those instructions something you can read, modify, and share.
Take ChatGPT’s study mode as an example. It’s a particular way of helping someone learn, by asking comprehension questions, testing understanding, and adjusting difficulty based on responses. That’s incredibly valuable. But it’s locked inside ChatGPT’s interface. You can’t even access it via the ChatGPT API. What if study mode was published as a skill? Then you could:
See exactly how it works. What instructions guide the interaction?
Modify it for your subject matter. Maybe study mode for medical students needs different patterns than study mode for language learning.
Fork it into variants. You might want a “Socratic mode” or “test prep mode” that builds on the same foundation.
Use it with your own content and tools. You might combine it with an MCP server that accesses your course materials.
Share your improved version and learn from others’ modifications.
This is the next level of AI programming “up the stack.” You’re not training models or vibe coding Python. You’re elaborating on concepts the model already understands, more adapted to specific needs, and sharing them as building blocks others can use.
Building reusable libraries of fuzzy functions is the future of open source AI.
The Economics of Participation
There’s a deeper pattern here that connects to a rich tradition in economics: mechanism design. Over the past few decades, economists like Paul Milgrom and Al Roth won Nobel Prizes for showing how to design better markets: matching systems for medical residents, spectrum auctions for wireless licenses, kidney exchange networks that save lives. These weren’t just theoretical exercises. They were practical interventions that created more efficient, more equitable outcomes by changing the rules of the game.
Some tech companies understood this. As chief economist at Google, Hal Varian didn’t just analyze ad markets, he helped design the ad auction that made Google’s business model work. At Uber, Jonathan Hall applied mechanism design insights to dynamic pricing and marketplace matching to build a “thick market” of passengers and drivers. These economists brought economic theory to bear on platform design, creating systems where value could flow more efficiently between participants.
Though not guided by economists, the web and the open source software revolution were also not just technical advances but breakthroughs in market design. They created information-rich, participatory markets where barriers to entry were lowered. It became easier to learn, create, and innovate. Transaction costs plummeted. Sharing code or content went from expensive (physical distribution, licensing negotiations) to nearly free. Discovery mechanisms emerged: Search engines, package managers, and GitHub made it easy to find what you needed. Reputation systems were discovered or developed. And of course, network effects benefited everyone. Each new participant made the ecosystem more valuable.
These weren’t accidents. They were the result of architectural choices that made internet-enabled software development into a generative, participatory market.
AI desperately needs similar breakthroughs in mechanism design. Right now, most economic analysis of AI focuses on the wrong question: “How many jobs will AI destroy?” This is the mindset of an extractive system, where AI is something done to workers and to existing companies rather than with them. The right question is: “How do we design AI systems that create participatory markets where value can flow to all contributors?”
Consider what’s broken right now:
Attribution is invisible. When an AI model benefits from training on someone’s work, there’s no mechanism to recognize or compensate for that contribution.
Value capture is concentrated. A handful of companies capture the gains, while millions of content creators, whose work trained the models and are consulted during inference, see no return.
Improvement loops are closed. If you find a better way to accomplish a task with AI, you can’t easily share that improvement or benefit from others’ discoveries.
Quality signals are weak. There’s no good way to know if a particular skill, prompt, or MCP server is well-designed without trying it yourself.
MCP and skills, viewed through this economic lens, are early-stage infrastructure for a participatory AI market. The MCP Registry and skills gallery are primitive but promising marketplaces with discoverable components and inspectable quality. When a skill or MCP server is useful, it’s a legible, shareable artifact that can carry attribution. While this may not redress the “original sin” of copyright violation during model training, it does perhaps point to a future where content creators, not just AI model creators and app developers, may be able to monetize their work.
But we’re nowhere near having the mechanisms we need. We need systems that efficiently match AI capabilities with human needs, that create sustainable compensation for contribution, that enable reputation and discovery, that make it easy to build on others’ work while giving them credit.
This isn’t just a technical challenge. It’s a challenge for economists, policymakers, and platform designers to work together on mechanism design. The architecture of participation isn’t just a set of values. It’s a powerful framework for building markets that work. The question is whether we’ll apply these lessons of open source and the web to AI or whether we’ll let AI become an extractive system that destroys more value than it creates.
A Call to Action
I’d love to see OpenAI, Google, Meta, and the open source community develop a robust architecture of participation for AI.
Make innovations inspectable. When you build a compelling feature or an effective interaction pattern or a useful specialization, consider publishing it in a form others can learn from. Not as a closed app or an API to a black box but as instructions, prompts, and tool configurations that can be read and understood. Sometimes competitive advantage comes from what you share rather than what you keep secret.
Support open protocols. MCP’s early success demonstrates what’s possible when the industry rallies around an open standard. Since Anthropic introduced it in late 2024, MCP has been adopted by OpenAI (across ChatGPT, the Agents SDK, and the Responses API), Google (in the Gemini SDK), Microsoft (in Azure AI services), and a rapidly growing ecosystem of development tools from Replit to Sourcegraph. This cross-platform adoption proves that when a protocol solves real problems and remains truly open, companies will embrace it even when it comes from a competitor. The challenge now is to maintain that openness as the protocol matures.
Create pathways for contribution at every level. Not everyone needs to fork model weights or even write MCP servers. Some people should be able to contribute a clever prompt template. Others might write a skill that combines existing tools in a new way. Still others will build infrastructure that makes all of this easier. All of these contributions should be possible, visible, and valued.
Document magic. When your model responds particularly well to certain instructions, patterns, or concepts, make those patterns explicit and shareable. The collective knowledge of how to work effectively with AI shouldn’t be scattered across X threads and Discord channels. It should be formalized, versioned, and forkable.
Reinvent open source licenses. Take into account the need for recognition not only during training but inference. Develop protocols that help manage rights for data that flows through networks of AI agents.
Engage with mechanism design. Building a participatory AI market isn’t just a technical problem, it’s an economic design challenge. We need economists, policymakers, and platform designers collaborating on how to create sustainable, participatory markets around AI. Stop asking “How many jobs will AI destroy?” and start asking “How do we design AI systems that create value for all participants?” The architecture choices we make now will determine whether AI becomes an extractive force or an engine of broadly shared prosperity.
The future of programming with AI won’t be determined by who publishes model weights. It’ll be determined by who creates the best ways for ordinary developers to participate, contribute, and build on each other’s work. And that includes the next wave of developers: users who can create reusable AI skills based on their special knowledge, experience, and human perspectives.
We’re at a choice point. We can make AI development look like app stores and proprietary platforms, or we can make it look like the open web and the open source lineages that descended from Unix. I know which future I’d like to live in.
Footnotes
I shared a draft of this piece with members of the Anthropic MCP and Skills team, and in addition to providing a number of helpful technical improvements, they confirmed a number of points where my framing captured their intentions. Comments ranged from “Skills were designed with composability in mind. We didn’t want to confine capable models to a single system prompt with limited functions” to “I love this phrasing since it leads into considering the models as the processing power, and showcases the need for the open ecosystem on top of the raw power a model provides” and “In a recent talk, I compared the models to processors, agent runtimes/orchestrations to the OS, and Skills as the application.” However, all of the opinions are my own and Anthropic is not responsible for anything I’ve said here.
The following article originally appeared on Medium and is being republished here with the author’s permission.
Don’t get me wrong, I’m up all night using these tools.
But I also sense we’re heading for an expensive hangover. The other day, a colleague told me about a new proposal to route a million documents a day through a system that identifies and removes Social Security numbers.
I joked that this was going to be a “million-dollar regular expression.”
Run the math on the “naïve” implementation with full GPT-5 and it’s eye-watering: A million messages a day at ~50K characters each works out to around 12.5 billion tokens daily, or $15,000 a day at current pricing. That’s nearly $6 million a year to check for Social Security numbers. Even if you migrate to GPT-5 Nano, you still spend about $230,000 a year.
That’s a success. You “saved” $5.77 million a year…
How about running this code for a million documents a day? How much would this cost:
import re; s = re.sub(r”\b\d{3}[- ]?\d{2}[- ]?\d{4}\b”, “[REDACTED]”, s)
A plain old EC2 instance could handle this… A single EC2 instance—something like an m1.small at 30 bucks a month—could churn through the same workload with a regex and cost you a few hundred dollars a year.
Which means that in practice, companies will be calling people like me in a year saying, “We’re burning a million dollars to do something that should cost a fraction of that—can you fix it?”
From $15,000/day to $0.96/day—I do think we’re about to see a lot of companies realize that a thinking model connected to an MCP server is way more expensive than just paying someone to write a bash script. Starting now, you’ll be able to make a career out of un-LLM-ifying applications.
MLOps is dead. Well, not really, but for many the job is evolving into LLMOps. In this episode, Abide AI founder and LLMOps author Abi Aryan joins Ben to discuss what LLMOps is and why it’s needed, particularly for agentic AI systems. Listen in to hear why LLMOps requires a new way of thinking about observability, why we should spend more time understanding human workflows before mimicking them with agents, how to do FinOps in the age of generative AI, and more.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
Transcript
This transcript was created with the help of AI and has been lightly edited for clarity.
00.00: All right, so today we have Abi Aryan. She is the author of the O’Reilly book on LLMOps as well as the founder of Abide AI. So, Abi, welcome to the podcast.
00.21: All right. Let’s start with the book, which I confess, I just cracked open: LLMOps. People probably listening to this have heard of MLOps. So at a high level, the models have changed: They’re bigger, they’re generative, and so on and so forth. So since you’ve written this book, have you seen a wider acceptance of the need for LLMOps?
00.51: I think more recently there are more infrastructure companies. So there was a conference happening recently, and there was this sort of perception or messaging across the conference, which was “MLOps is dead.” Although I don’t agree with that.
There’s a big difference that companies have started to pick up on more recently, as the infrastructure around the space has sort of started to improve. They’re starting to realize how different the pipelines were that people managed and grew, especially for the older companies like Snorkel that were in this space for years and years before large language models came in. The way they were handling data pipelines—and even the observability platforms that we’re seeing today—have changed tremendously.
01.40: What about, Abi, the general. . .? We don’t have to go into specific tools, but we can if you want. But, you know, if you look at the old MLOps person and then fast-forward, this person is now an LLMOps person. So on a day-to-day basis [has] their suite of tools changed?
02.01: Massively. I think for an MLOps person, the focus was very much around “This is my model. How do I containerize my model, and how do I put it in production?” That was the entire problem and, you know, most of the work was around “Can I containerize it? What are the best practices around how I arrange my repository? Are we using templates?”
Drawbacks happened, but not as much because most of the time the stuff was tested and there was not too much indeterministic behavior within the models itself. Now that has changed.
02.38: [For] most of the LLMOps engineers, the biggest job right now is doing FinOps really, which is controlling the cost because the models are massive. The second thing, which has been a big difference, is we have shifted from “How can we build systems?” to “How can we build systems that can perform, and not just perform technically but perform behaviorally as well?”: “What is the cost of the model? But also what is the latency? And see what’s the throughput looking like? How are we managing the memory across different tasks?”
The problem has really shifted when we talk about it. . . So a lot of focus for MLOps was “Let’s create fantastic dashboards that can do everything.” Right now it’s no matter which dashboard you create, the monitoring is really very dynamic.
03.32: Yeah, yeah. As you were talking there, you know, I started thinking, yeah, of course, obviously now the inference is essentially a distributed computing problem, right? So that was not the case before. Now you have different phases even of the computation during inference, so you have the prefill phase and the decode phase. And then you might need different setups for those.
So anecdotally, Abi, did the people who were MLOps people successfully migrate themselves? Were they able to upskill themselves to become LLMOps engineers?
04.14: I know a couple of friends who were MLOps engineers. They were teaching MLOps as well—Databricks folks, MVPs. And they were now transitioning to LLMOps.
But the way they started is they started focusing very much on, “Can you do evals for these models? They weren’t really dealing with the infrastructure side of it yet. And that was their slow transition. And right now they’re very much at that point where they’re thinking, “OK, can we make it easy to just catch these problems within the model—inferencing itself?”
04.49: A lot of other problems still stay unsolved. Then the other side, which was like a lot of software engineers who entered the field and became AI engineers, they have a much easier transition because software. . . The way I look at large language models is not just as another machine learning model but literally like software 3.0 in that way, which is it’s an end-to-end system that will run independently.
Now, the model isn’t just something you plug in. The model is the product tree. So for those people, most software is built around these ideas, which is, you know, we need a strong cohesion. We need low coupling. We need to think about “How are we doing microservices, how the communication happens between different tools that we’re using, how are we calling up our endpoints, how are we securing our endpoints?”
Those questions come easier. So the system design side of things comes easier to people who work in traditional software engineering. So the transition has been a little bit easier for them as compared to people who were traditionally like MLOps engineers.
05.59: And hopefully your book will help some of these MLOps people upskill themselves into this new world.
Let’s pivot quickly to agents. Obviously it’s a buzzword. Just like anything in the space, it means different things to different teams. So how do you distinguish agentic systems yourself?
06.24: There are two words in the space. One is agents; one is agent workflows. Basically agents are the components really. Or you can call them the model itself, but they’re trying to figure out what you meant, even if you forgot to tell them. That’s the core work of an agent. And the work of a workflow or the workflow of an agentic system, if you want to call it, is to tell these agents what to actually do. So one is responsible for execution; the other is responsible for the planning side of things.
07.02: I think sometimes when tech journalists write about these things, the general public gets the notion that there’s this monolithic model that does everything. But the reality is, most teams are moving away from that design as you, as you describe.
So they have an agent that acts as an orchestrator or planner and then parcels out the different steps or tasks needed, and then maybe reassembles in the end, right?
07.42: Coming back to your point, it’s now less of a problem of machine learning. It’s, again, more like a distributed systems problem because we have multiple agents. Some of these agents will have more load—they will be the frontend agents, which are communicating to a lot of people. Obviously, on the GPUs, these need more distribution.
08.02: And when it comes to the other agents that may not be used as much, they can be provisioned based on “This is the need, and this is the availability that we have.” So all of that provisioning again is a problem. The communication is a problem. Setting up tests across different tasks itself within an entire workflow, now that becomes a problem, which is where a lot of people are trying to implement context engineering. But it’s a very complicated problem to solve.
08.31: And then, Abi, there’s also the problem of compounding reliability. Let’s say, for example, you have an agentic workflow where one agent passes off to another agent and yet to another third agent. Each agent may have a certain amount of reliability, but it compounds over time. So it compounds across this pipeline, which makes it more challenging.
09.02: And that’s where there’s a lot of research work going on in the space. It’s an idea that I’ve talked about in the book as well. At that point when I was writing the book, especially chapter four, in which a lot of these were described, most of the companies right now are [using] monolithic architecture, but it’s not going to be able to sustain as we go towards application.
We have to go towards a microservices architecture. And the moment we go towards microservices architecture, there are a lot of problems. One will be the hardware problem. The other is consensus building, which is. . .
Let’s say you have three different agents spread across three different nodes, which would be running very differently. Let’s say one is running on an edge one hundred; one is running on something else. How can we achieve consensus if even one of the nodes ends up winning? So that’s open research work [where] people are trying to figure out, “Can we achieve consensus in agents based on whatever answer the majority is giving, or how do we really think about it?” It should be set up at a threshold at which, if it’s beyond this threshold, then you know, this perfectly works.
One of the frameworks that is trying to work in this space is called MassGen—they’re working on the research side of solving this problem itself in terms of the tool itself.
10.31: By the way, even back in the microservices days in software architecture, obviously people went overboard too. So I think that, as with any of these new things, there’s a bit of trial and error that you have to go through. And the better you can test your systems and have a setup where you can reproduce and try different things, the better off you are, because many times your first stab at designing your system may not be the right one. Right?
11.08: Yeah. And I’ll give you two examples of this. So AI companies tried to use a lot of agentic frameworks. You know people have used Crew; people have used n8n, they’ve used. . .
11.25: Oh, I hate those! Not I hate. . . Sorry. Sorry, my friends and crew.
11.30: And 90% of the people working in this space seriously have already made that transition, which is “We are going to write it ourselves.
The same happened for evaluation: There were a lot of evaluation tools out there. What they were doing on the surface is literally just tracing, and tracing wasn’t really solving the problem—it was just a beautiful dashboard that doesn’t really serve much purpose. Maybe for the business teams. But at least for the ML engineers who are supposed to debug these problems and, you know, optimize these systems, essentially, it was not giving much other than “What is the error response that we’re getting to everything?”
12.08: So again, for that one as well, most of the companies have developed their own evaluation frameworks in-house, as of now. The people who are just starting out, obviously they’ve done. But most of the companies that started working with large language models in 2023, they’ve tried every tool out there in 2023, 2024. And right now more and more people are staying away from the frameworks and launching and everything.
People have understood that most of the frameworks in this space are not superreliable.
12.41: And [are] also, honestly, a bit bloated. They come with too many things that you don’t need in many ways. . .
12:54: Security loopholes as well. So for example, like I reported one of the security loopholes with LangChain as well, with LangSmith back in 2024. So those things obviously get reported by people [and] get worked on, but the companies aren’t really proactively working on closing those security loopholes.
13.15: Two open source projects that I like that are not specifically agentic are DSPy and BAML. Wanted to give them a shout out. So this point I’m about to make, there’s no easy, clear-cut answer. But one thing I noticed, Abi, is that people will do the following, right? I’m going to take something we do, and I’m going to build agents to do the same thing. But the way we do things is I have a—I’m just making this up—I have a project manager and then I have a designer, I have role B, role C, and then there’s certain emails being exchanged.
So then the first step is “Let’s replicate not just the roles but kind of the exchange and communication.” And sometimes that actually increases the complexity of the design of your system because maybe you don’t need to do it the way the humans do it. Right? Maybe if you go to automation and agents, you don’t have to over-anthropomorphize your workflow. Right. So what do you think about this observation?
14.31: A very interesting analogy I’ll give you is people are trying to replicate intelligence without understanding what intelligence is. The same for consciousness. Everybody wants to replicate and create consciousness without understanding consciousness. So the same is happening with this as well, which is we are trying to replicate a human workflow without really understanding how humans work.
14.55: And sometimes humans may not be the most efficient thing. Like they exchange five emails to arrive at something.
15.04: And humans are never context defined. And in a very limiting sense. Even if somebody’s job is to do editing, they’re not just doing editing. They are looking at the flow. They are looking for a lot of things which you can’t really define. Obviously you can over a period of time, but it needs a lot of observation to understand. And that skill also depends on who the person is. Different people have different skills as well. Most of the agentic systems right now, they’re just glorified Zapier IFTTT routines. That’s the way I look at them right now. The if recipes: If this, then that.
15.48: Yeah, yeah. Robotic process automation I guess is what people call it. The other thing that people I don’t think understand just reading the popular tech press is that agents have levels of autonomy, right? Most teams don’t actually build an agent and unleash it full autonomous from day one.
I mean, I guess the analogy would be in self-driving cars: They have different levels of automation. Most enterprise AI teams realize that with agents, you have to kind of treat them that way too, depending on the complexity and the importance of the workflow.
So you go first very much a human is involved and then less and less human over time as you develop confidence in the agent.
But I think it’s not good practice to just kind of let an agent run wild. Especially right now.
16.56: It’s not, because who’s the person answering if the agent goes wrong? And that’s a question that has come up often. So this is the work that we’re doing at Abide really, which is trying to create a decision layer on top of the knowledge retrieval layer.
17.07: Most of the agents which are built using just large language models. . . LLMs—I think people need to understand this part—are fantastic at knowledge retrieval, but they do not know how to make decisions. If you think agents are independent decision makers and they can figure things out, no, they cannot figure things out. They can look at the database and try to do something.
Now, what they do may or may not be what you like, no matter how many rules you define across that. So what we really need to develop is some sort of symbolic language around how these agents are working, which is more like trying to give them a model of the world around “What is the cause and effect, with all of these decisions that you’re making? How do we prioritize one decision where the. . .? What was the reasoning behind that so that entire decision making reasoning here has been the missing part?”
18.02: You brought up the topic of observability. There’s two schools of thought here as far as agentic observability. The first one is we don’t need new tools. We have the tools. We just have to apply [them] to agents. And then the second, of course, is this is a new situation. So now we need to be able to do more. . . The observability tools have to be more capable because we’re dealing with nondeterministic systems.
And so maybe we need to capture more information along the way. Chains of decision, reasoning, traceability, and so on and so forth. Where do you fall in this kind of spectrum of we don’t need new tools or we need new tools?
18.48: We don’t need new tools, but we certainly need new frameworks, and especially a new way of thinking. Observability in the MLOps world—fantastic; it was just about tools. Now, people have to stop thinking about observability as just visibility into the system and start thinking of it as an anomaly detection problem. And that was something I’d written in the book as well. Now it’s no longer about “Can I see what my token length is?” No, that’s not enough. You have to look for anomalies at every single part of the layer across a lot of metrics.
19.24: So your position is we can use the existing tools. We may have to log more things.
19.33: We may have to log more things, and then start building simple ML models to be able to do anomaly detection.
Think of managing any machine, any LLM model, any agent as really like a fraud detection pipeline. So every single time you’re looking for “What are the simplest signs of fraud?” And that can happen across various factors. But we need more logging. And again you don’t need external tools for that. You can set up your own loggers as well.
Most of the people I know have been setting up their own loggers within their companies. So you can simply use telemetry to be able to a.) define a set and use the general logs, and b.) be able to define your own custom logs as well, depending on your agent pipeline itself. You can define “This is what it’s trying to do” and log more things across those things, and then start building small machine learning models to look for what’s going on over there.
20.36: So what is the state of “Where we are? How many teams are doing this?”
20.42: Very few. Very, very few. Maybe just the top bits. The ones who are doing reinforcement learning training and using RL environments, because that’s where they’re getting their data to do RL. But people who are not using RL to be able to retrain their model, they’re not really doing much of this part; they’re still depending very much on external accounts.
21.12: I’ll get back to RL in a second. But one topic you raised when you pointed out the transition from MLOps to LLMOps was the importance of FinOps, which is, for our listeners, basically managing your cloud computing costs—or in this case, increasingly mastering token economics. Because basically, it’s one of these things that I think can bite you.
For example, the first time you use Claude Code, you go, “Oh, man, this tool is powerful.” And then boom, you get an email with a bill. I see, that’s why it’s powerful. And you multiply that across the board to teams who are starting to maybe deploy some of these things. And you see the importance of FinOps.
So where are we, Abi, as far as tooling for FinOps in the age of generative AI and also the practice of FinOps in the age of generative AI?
22.19: Less than 5%, maybe even 2% of the way there.
22:24: Really? But obviously everyone’s aware of it, right? Because at some point, when you deploy, you become aware.
22.33: Not enough people. A lot of people just think about FinOps as cloud, basically the cloud cost. And there are different kinds of costs in the cloud. One of the things people are not doing enough is not profiling their models properly, which is [determining] “Where are the costs really coming from? Our models’ compute power? Are they taking too much RAM?
22.58: Or are we using reasoning when we don’t need it?
23.00: Exactly. Now that’s a problem we solve very differently. That’s where yes, you can do kernel fusion. Define your own custom kernels. Right now there’s a massive number of people who think we need to rewrite kernels for everything. It’s only going to solve one problem, which is the compute-bound problem. But it’s not going to solve the memory-bound problem. Your data engineering pipelines aren’t what’s going to solve your memory-bound problems.
And that’s where most of the focus is missing. I’ve mentioned it in the book as well: Data engineering is the foundation of first being able to solve the problems. And then we moved to the compute-bound problems. Do not start optimizing the kernels over there. And then the third part would be the communication-bound problem, which is “How do we make these GPUs talk smarter with each other? How do we figure out the agent consensus and all of those problems?”
Now that’s a communication problem. And that’s what happens when there are different levels of bandwidth. Everybody’s dealing with the internet bandwidth as well, the kind of serving speed as well, different kinds of cost and every kind of transitioning from one node to another. If we’re not really hosting our own infrastructure, then that’s a different problem, because it depends on “Which server do you get assigned your GPUs on again?”
24.20: Yeah, yeah, yeah. I want to give a shout out to Ray—I’m an advisor to Anyscale—because Ray basically is built for these sorts of pipelines because it can do fine-grained utilization and help you decide between CPU and GPU. And just generally, you don’t think that the teams are taking token economics seriously?
I guess not. How many people have I heard talking about caching, for example? Because if it’s a prompt that [has been] answered before, why do you have to go through it again?
25.07: I think plenty of people have started implementing KV caching, but they don’t really know. . . Again, one of the questions people don’t understand is “How much do we need to store in the memory itself, and how much do we need to store in the cache?” which is the big memory question. So that’s the one I don’t think people are able to solve. A lot of people are storing too much stuff in the cache that should actually be stored in the RAM itself, in the memory.
And there are generalist applications that don’t really understand that this agent doesn’t really need access to the memory. There’s no point. It’s just lost in the throughput really. So I think the problem isn’t really caching. The problem is that differentiation of understanding for people.
25.55: Yeah, yeah, I just threw that out as one element. Because obviously there’s many, many things to mastering token economics. So you, you brought up reinforcement learning. A few years ago, obviously people got really into “Let’s do fine-tuning.” But then they quickly realized. . . And actually fine-tuning became easy because basically there became so many services where you can just focus on labeled data. You upload your labeled data, boom, come back from lunch, you have a fine-tuned model.
But then people realize that “I fine-tuned, but the model that results isn’t really as good as my fine-tuning data.” And then obviously RAG and context engineering came into the picture. Now it seems like more people are again talking about reinforcement learning, but in the context of LLMs. And there’s a lot of libraries, many of them built on Ray, for example. But it seems like what’s missing, Abi, is that fine-tuning got to the point where I can sit down a domain expert and say, “Produce labeled data.” And basically the domain expert is a first-class participant in fine-tuning.
As best I can tell, for reinforcement learning, the tools aren’t there yet. The UX hasn’t been figured out in order to bring in the domain experts as the first-class citizen in the reinforcement learning process—which they need to be because a lot of the stuff really resides in their brain.
27.45: The big problem here, and very, very much to the point of what you pointed out, is the tools aren’t really there. And one very specific thing I can tell you is most of the reinforcement learning environments that you’re seeing are static environments. Agents are not learning statically. They are learning dynamically. If your RL environment cannot adapt dynamically, which basically in 2018, 2019, emerged as the OpenAI Gym and a lot of reinforcement learning libraries were coming out.
28.18: There is a line of work called curriculum learning, which is basically adapting your model’s difficulty to the results itself. So basically now that can be used in reinforcement learning, but I’ve not seen any practical implementation of using curriculum learning for reinforcement learning environments. So people create these environments—fantastic. They work well for a little bit of time, and then they become useless.
So that’s where even OpenAI, Anthropic, those companies are struggling as well. They’ve paid heavily in contracts, which are yearlong contracts to say, “Can you build this vertical environment? Can you build that vertical environment?” and that works fantastically But once the model learns on it, then there’s nothing else to learn. And then you go back into the question of, “Is this data fresh? Is this adaptive with the world?” And it becomes the same RAG problem over again.
29.18: So maybe the problem is with RL itself. Maybe maybe we need a different paradigm. It’s just too hard.
Let me close by looking to the future. The first thing is—the space is moving so hard, this might be an impossible question to ask, but if you look at, let’s say, 6 to 18 months, what are some things in the research domain that you think are not being talked enough about that might produce enough practical utility that we will start hearing about them in 6 to 12, 6 to 18 months?
29.55: One is how to profile your machine learning models, like the entire systems end-to-end. A lot of people do not understand them as systems, but only as models. So that’s one thing which will make a massive amount of difference. There are a lot of AI engineers today, but we don’t have enough system design engineers.
30.16: This is something that Ion Stoica at Sky Computing Lab has been giving keynotes about. Yeah. Interesting.
30.23: The second part is. . . I’m optimistic about seeing curriculum learning applied to reinforcement learning as well, where our RL environments can adapt in real time so when we train agents on them, they are dynamically adapting as well. That’s also [some] of the work being done by labs like Circana, which are working in artificial labs, artificial light frame, all of that stuff—evolution of any kind of machine learning model accuracy.
30.57: The third thing where I feel like the communities are falling behind massively is on the data engineering side. That’s where we have massive gains to get.
31.09: So on the data engineering side, I’m happy to say that I advise several companies in the space that are completely focused on tools for these new workloads and these new data types.
Last question for our listeners: What mindset shift or what skill do they need to pick up in order to position themselves in their career for the next 18 to 24 months?
31.40: For anybody who’s an AI engineer, a machine learning engineer, an LLMOps engineer, or an MLOps engineer, first learn how to profile your models. Start picking up Ray very quickly as a tool to just get started on, to see how distributed systems work. You can pick the LLM if you want, but start understanding distributed systems first. And once you start understanding those systems, then start looking back into the models itself.
One of the principles in our upcoming book Architecture as Code is the ability for architects to design automated governance checks for important architectural concerns, creating fast feedback loops when things go awry. This idea isn’t new—Neal and his coauthors Rebecca Parsons and Patrick Kua espoused this idea back in 2017 in the first edition of Building Evolutionary Architectures, and many of our clients adopted these practices with great success. However, our most ambitious goals were largely thwarted by a common problem in modern architectures: brittleness. Fortunately, the advent of the Model Context Protocol (MCP) and agentic AI have largely solved this problem for enterprise architects.
Fitness Functions
Building Evolutionary Architectures defines the concept of an architectural fitness function: any mechanism that provides an objective integrity check for architectural characteristics. Architects can think of fitness functions sort of like unit tests, but for architectural concerns.
While many fitness functions run like unit tests to test structure (using tools like ArchUnit, NetArchTest, PyTestArch, arch-go, and so on), architects can write fitness functions to validate all sorts of important checks…like tasks normally reserved for relational databases.
Fitness functions and referential integrity
Consider the architecture illustrated in Figure 1.
Figure 1: Strategically splitting a database in a distributed architecture
In Figure 1, the team has decided to split the data into two databases for better scalability and availability. However, the common disadvantage of that approach lies with the fact that the team can no longer rely on the database to enforce referential integrity. In this situation, each ticket must have a corresponding customer to model this workflow correctly.
While many teams seem to think that referential integrity is only possible within a relational database, we separate the governance activity (data integrity) from the implementation (the relational database) and realize we can create our own check using an architectural fitness function, as shown in Figure 2.
Figure 2: Implementing referential integrity as a fitness function
In Figure 2, the architect has created a small fitness function that monitors the queue between customer and ticket. When the queue depth drops to zero (meaning that the system isn’t processing any messages), the fitness function creates a set of customer keys from the customer service and a set of customer foreign keys from the ticket service and asserts that all of the ticket foreign keys are contained within the set of customer keys.
Why not just query the databases directly from the fitness function? Abstracting them as sets allows flexibility—querying across databases on a constant basis introduces overhead that may have negative side effects. Abstracting the fitness function check from the mechanics of how the data is stored to an abstract data structure has at least a couple of advantages. First, using sets allows architects to cache nonvolatile data (like customer keys), avoiding constant querying of the database. Many solutions exist for write-through caches in the rare event we do add a customer. Second, using sets of keys abstracts us from actual data items. Data engineers prefer synthetic keys to using domain data; the same is true for architects. While the database schema might change over time, the team will always need the relationship between customers and tickets, which this fitness function validates in an abstract way.
Who executes this code? As this problem is typical in distributed architectures such as microservices, the common place to execute this governance code is within the service mesh of the microservices architecture. Service mesh is a general pattern for handling operational concerns in microservices, such as logging, monitoring, naming, service discovery, and other nondomain concerns. In mature microservices ecosystems, the service mesh also acts as a governance mesh, applying fitness functions and other rules at runtime.
This is a common way that architects at the application level can validate data integrity, and we’ve implemented these types of fitness functions on hundreds of projects. However, the specificity of the implementation details makes it difficult to expand the scope of these types of fitness functions to the enterprise architect level because they include too many implementation details about how the project works.
Brittleness for metadomains
One of the key lessons from domain-driven design was the idea of keeping implementation details as tightly bound as possible, using anticorruption layers to prevent integration points from understanding too many details. Architects have embraced this philosophy in architectures like microservices.
Yet we see the same problem here at the metalevel, where enterprise architects would like to broadly control concerns like data integrity yet are hampered by the distance and specificity of the governance requirement. Distance refers to the scope of the activity. While application and integration architects have a narrow scope of responsibility, enterprise architects by their nature sit at the enterprise level. Thus, for an enterprise architect to enforce governance such as referential integrity requires them to know too many specific details about how the team has implemented the project.
One of our biggest global clients has a role within their enterprise architecture group called evolutionary architect, whose job is to identify global governance concerns, and we have other clients who have tried to implement this level of holistic governance with their enterprise architects. However, the brittleness defeats these efforts: As soon as the team needs to change an implementation detail, the fitness function breaks. Even though we often couch fitness functions as “unit tests for architecture,” in reality, they break much less often than unit tests. (How often do changes affect some fundamental architectural concern versus a change to the domain?) However, by exposing implementation details outside the project to enterprise architects, these fitness functions do break enough to limit their value.
We’ve tried a variety of anticorruption layers for metaconcerns, but generative AI and MCP have provided the best solution to date.
MCP and Agentic Governance
MCP defines a general integration layer for agents to query and consume capabilities within a particular metascope. For example, teams can set up an MCP server at the application or integration architecture level to expose tools and data sources to AI agents. This provides the perfect anticorruption layer for enterprise architects to state the intent of governance without relying on implementation details.
This allows teams to implement the type of governance that the strategically minded enterprise architects want but create a level of indirection for the details. For example, see the updated referential integrity check illustrated in Figure 3.
Figure 3. Using MCP for indirection to hide the fitness function implementation details
In Figure 3, the enterprise architect issues the general request to validate referential integrity to the MCP server for the project. It in turn exposes fitness functions via tools (or data sources such as log files) to carry out the request.
By creating an anticorruption layer between the project details and enterprise architect, we can use MCP to handle implementation details so that when the project evolves in the future, it doesn’t break the governance because of brittleness, as shown in Figure 4.
Figure 4. Using agentic AI to create metalevel indirection
In Figure 4, the enterprise architect concern (validate referential integrity) hasn’t changed, but the project details have. The team added another service for experts, who work on tickets, meaning we now need to validate integrity across three databases. The team changes the internal MCP tool that implements the fitness function, and the enterprise architect request stays the same.
This allows enterprise architects to effectively state governance intent without diving into implementation details, removing the brittleness of far-reaching fitness functions and enabling much more proactive holistic governance by architects at all levels.
Defining the Intersections of Architecture
In Architecture as Code, we discuss nine different intersections with software architecture and other parts of the software development ecosystem (data representing one of them), all expressed as architectural fitness functions (the “code” part of architecture as code). In defining the intersection of architecture and enterprise architect, we can use MCP and agents to state intent holistically, deferring the actual details to individual projects and ecosystems. This solves one of the nagging problems for enterprise architects who want to build more automated feedback loops within their systems.
MCP is almost ideally suited for this purpose, designed to expose tools, data sources, and prompt libraries to external contexts outside a particular project domain. This allows enterprise architects to holistically define broad intent and leave it to teams to implement (and evolve) their solutions.
X as code (where X can be a wide variety of things) typically arises when the software development ecosystem reaches a certain level of maturity and automation. Teams tried for years to make infrastructure as code work, but it didn’t until tools such as Puppet and Chef came along that could enable that capability. The same is true with other “as code” initiatives (security, policy, and so on): The ecosystem needs to provide tools and frameworks to allow it to work. Now, with the combination of powerful fitness function libraries for a wide variety of platforms and ecosystem innovations such as MCP and agentic AI, architecture itself has enough support to join the “as code” communities.
Learn more about how AI is reshaping enterprise architecture at the Software Architecture Superstream on December 9. Join host Neal Ford and a lineup of experts including Metro Bank’s Anjali Jain and Philip O’Shaughnessy, Vercel’s Dom Sipowicz, Intel’s Brian Rogers, Microsoft’s Ron Abellera, and Equal Experts’ Lewis Crawford to hear hard-won insights about building adaptive, AI-ready architectures that support continuous innovation, ensure governance and security, and align seamlessly with business goals.
O’Reilly members can register here. Not a member? Sign up for a 10-day free trial before the event to attend—and explore all the other resources on O’Reilly.
The following originally appears on fast.ai and is reposted here with the author’s permission.
I’ve spent decades teaching people to code, building tools that help developers work more effectively, and championing the idea that programming should be accessible to everyone. Through fast.ai, I’ve helped millions learn not just to use AI but to understand it deeply enough to build things that matter.
But lately, I’ve been deeply concerned. The AI agent revolution promises to make everyone more productive, yet what I’m seeing is something different: developers abandoning the very practices that lead to understanding, mastery, and software that lasts. When CEOs brag about their teams generating 10,000 lines of AI-written code per day, when junior engineers tell me they’re “vibe-coding” their way through problems without understanding the solutions, are we racing toward a future where no one understands how anything works, and competence craters?
I needed to talk to someone who embodies the opposite approach: someone whose code continues to run the world decades after he created it. That’s why I called Chris Lattner, cofounder and CEO of Modular AI and creator of LLVM, the Clang compiler, the Swift programming language, and the MLIR compiler infrastructure.
Chris and I chatted on Oct 5, 2025, and he kindly let me record the conversation. I’m glad I did, because it turned out to be thoughtful and inspiring. Check out the video for the full interview, or read on for my summary of what I learned.
Talking with Chris Lattner
Chris Lattner builds infrastructure that becomes invisible through ubiquity.
Twenty-five years ago, as a PhD student, he created LLVM: the most fundamental system for translating human-written code into instructions computers can execute. In 2025, LLVM sits at the foundation of most major programming languages: the Rust that powers Firefox, the Swift running on your iPhone, and even Clang, a C++ compiler created by Chris that Google and Apple now use to create their most critical software. He describes the Swift programming language he created as “Syntax sugar for LLVM”. Today it powers the entire iPhone/iPad ecosystem.
When you need something to last not just years but decades, to be flexible enough that people you’ll never meet can build things you never imagined on top of it, you build it the way Chris built LLVM, Clang, and Swift.
I first met Chris when he arrived at Google in 2017 to help them with TensorFlow. Instead of just tweaking it, he did what he always does: he rebuilt from first principles. He created MLIR (think of it as LLVM for modern hardware and AI), and then left Google to create Mojo: a programming language designed to finally give AI developers the kind of foundation that could last.
Chris architects systems that become the bedrock others build on for decades, by being a true craftsman. He cares deeply about the craft of software development.
I told Chris about my concerns, and the pressures I was feeling as both a coder and a CEO:
“Everybody else around the world is doing this, ‘AGI is around the corner. If you’re not doing everything with AI, you’re an idiot.’ And honestly, Chris, it does get to me. I question myself… I’m feeling this pressure to say, ‘Screw craftsmanship, screw caring.’ We hear VCs say, ‘My founders are telling me they’re getting out 10,000 lines of code a day.’ Are we crazy, Chris? Are we old men yelling at the clouds, being like, ‘Back in my day, we cared about craftsmanship’? Or what’s going on?”
Chris told me he shares my concerns:
“A lot of people are saying, ‘My gosh, tomorrow all programmers are going to be replaced by AGI, and therefore we might as well give up and go home. Why are we doing any of this anymore? If you’re learning how to code or taking pride in what you’re building, then you’re not doing it right.’ This is something I’m pretty concerned about…
But the question of the day is: how do you build a system that can actually last more than six months?”
He showed me that the answer to that question is timeless, and actually has very little to do with AI.
Design from First Principles
Chris’s approach has always been to ask fundamental questions. “For me, my journey has always been about trying to understand the fundamentals of what makes something work,” he told me. “And when you do that, you start to realize that a lot of the existing systems are actually not that great.”
When Chris started LLVM over Christmas break in 2000, he was asking: what does a compiler infrastructure need to be, fundamentally, to support languages that don’t exist yet? When he came into the AI world he was eager to learn the problems I saw with TensorFlow and other systems. He then zoomed into what AI infrastructure should look like from the ground up. Chris explained:
“The reason that those systems were fundamental, scalable, successful, and didn’t crumble under their own weight is because the architecture of those systems actually worked well. They were well-designed, they were scalable. The people that worked on them had an engineering culture that they rallied behind because they wanted to make them technically excellent.
In the case of LLVM, for example, it was never designed to support the Rust programming language or Julia or even Swift. But because it was designed and architected for that, you could build programming languages, Snowflake could go build a database optimizer—which is really cool—and a whole bunch of other applications of the technology came out of that architecture.”
Chris pointed out that he and I have a certain interest in common: “We like to build things, and we like to build things from the fundamentals. We like to understand them. We like to ask questions.” He has found (as have I!) that this is critical if you want your work to matter, and to last.
Of course, building things from the fundamentals doesn’t always work. But as Chris said, “if we’re going to make a mistake, let’s make a new mistake.” Doing the same thing as everyone else in the same way as everyone else isn’t likely to do work that matters.
Craftsmanship and Architecture
Chris pointed out that software engineering isn’t just about an individual churning out code: “A lot of evolving a product is not just about getting the results; it’s about the team understanding the architecture of the code.” And in fact it’s not even just about understanding, but that he’s looking for something much more than that. “For people to actually give a damn. For people to care about what they’re doing, to be proud of their work.”
I’ve seen that it’s possible for teams that care and build thoughtfully to achieve something special. I pointed out to him that “software engineering has always been about trying to get a product that gets better and better, and your ability to work on that product gets better and better. Things get easier and faster because you’re building better and better abstractions and better and better understandings in your head.”
Chris agreed. He again stressed the importance of thinking longer term:
“Fundamentally, with most kinds of software projects, the software lives for more than six months or a year. The kinds of things I work on, and the kinds of systems you like to build, are things that you continue to evolve. Look at the Linux kernel. The Linux kernel has existed for decades with tons of different people working on it. That is made possible by an architect, Linus, who is driving consistency, abstractions, and improvement in lots of different directions. That longevity is made possible by that architectural focus.”
This kind of deep work doesn’t just benefit the organization, but benefits every individual too. Chris said:
“I think the question is really about progress. It’s about you as an engineer. What are you learning? How are you getting better? How much mastery do you develop? Why is it that you’re able to solve problems that other people can’t?… The people that I see doing really well in their careers, their lives, and their development are the people that are pushing. They’re not complacent. They’re not just doing what everybody tells them to do. They’re actually asking hard questions, and they want to get better. So investing in yourself, investing in your tools and techniques, and really pushing hard so that you can understand things at a deeper level—I think that’s really what enables people to grow and achieve things that they maybe didn’t think were possible a few years before.”
This is what I tell my team too. The thing I care most about is whether they’re always improving at their ability to solve those problems.
Dogfooding
But caring deeply and thinking architecturally isn’t enough if you’re building in a vacuum.
I’m not sure it’s really possible to create great software if you’re not using it yourself, or working right next to your users. When Chris and his team were building the Swift language, they had to build it in a vacuum of Apple secrecy. He shares:
“The using your own product piece is really important. One of the big things that caused the IDE features and many other things to be a problem with Swift is that we didn’t really have a user. We were building it, but before we launched, we had one test app that was kind of ‘dogfooded’ in air quotes, but not really. We weren’t actually using it in production at all. And by the time it launched, you could tell. The tools didn’t work, it was slow to compile, crashed all the time, lots of missing features.”
His new Mojo project is taking a very different direction:
“With Mojo, we consider ourselves to be the first customer. We have hundreds of thousands of lines of Mojo code, and it’s all open source… That approach is very different. It’s a product of experience, but it’s also a product of building Mojo to solve our own problems. We’re learning from the past, taking best principles in.”
The result is evident. Already at this early stage models built on Mojo are getting state of the art results. Most of Mojo is written in Mojo. So if something isn’t working well, they are the first ones to notice.
We had a similar goal at fast.ai with our Solveit platform: we wanted to reach a point where most of our staff chose to do most of their work in Solveit, because they preferred it. (Indeed, I’m writing this article in Solveit right now!) Before we reached that point, I often had to force myself to use Solveit in order to experience first hand the shortcomings of those early versions, so that I could deeply understand the issues. Having done so, I now appreciate how smooth everything works even more!
But this kind of deep, experiential understanding is exactly what we risk losing when we delegate too much to AI.
AI, Craftsmanship, and Learning
Chris uses AI: “I think it’s a very important tool. I feel like I get a 10 to 20% improvement—some really fancy code completion and autocomplete.” But with Chris’ focus on the importance of craftsmanship and continual learning and improvement, I wondered if heavy AI (and particularly agent) use (“vibe coding”) might negatively impact organizations and individuals.
Chris: When you’re vibe-coding things, suddenly… another thing I’ve seen is that people say, ‘Okay, well maybe it’ll work.’ It’s almost like a test. You go off and say, ‘Maybe the agentic thing will go crank out some code,’ and you spend all this time waiting on it and coaching it. Then, it doesn’t work.
Jeremy: It’s like a gambling machine, right? Pull the lever again, try again, just try again.
Chris: Exactly. And again, I’m not saying the tools are useless or bad, but when you take a step back and you look at where it’s adding value and how, I think there’s a little bit too much enthusiasm of, ‘Well, when AGI happens, it’s going to solve the problem. I’m just waiting and seeing… Here’s another aspect of it: the anxiety piece. I see a lot of junior engineers coming out of school, and they’re very worried about whether they’ll be able to get a job. A lot of things are changing, and I don’t really know what’s going to happen. But to your point earlier, a lot of them say, ’Okay, well, I’m just going to vibe-code everything,’ because this is ‘productivity’ in air quotes. I think that’s also a significant problem.
Jeremy: Seems like a career killer to me.
Chris: …If you get sucked into, ‘Okay, well I need to figure out how to make this thing make me a 10x programmer,’ it may be a path that doesn’t bring you to developing at all. It may actually mean that you’re throwing away your own time, because we only have so much time to live on this earth. It can end up retarding your development and preventing you from growing and actually getting stuff done.
At its heart, Chris’s concern is that AI-heavy coding and craftsmanship just don’t appear to be compatible:
“Software craftsmanship is the thing that AI code threatens. Not because it’s impossible to use properly—again, I use it, and I feel like I’m doing it well because I care a lot about the quality of the code. But because it encourages folks to not take the craftsmanship, design, and architecture seriously. Instead, you just devolve to getting your bug queue to be shallower and making the symptoms go away. I think that’s the thing that I find concerning.”
“What you want to get to, particularly as your career evolves, is mastery. That’s how you kind of escape the thing that everybody can do and get more differentiation… The concern I have is this culture of, ‘Well, I’m not even going to try to understand what’s going on. I’m just going to spend some tokens, and maybe it’ll be great.’”
I asked if he had some specific examples where he’s seen things go awry.
“I’ve seen a senior engineer, when a bug gets reported, let the agentic loop rip, go spend some tokens, and maybe it’ll come up with a bug fix and create a PR. This PR, however, was completely wrong. It made the symptom go away, so it ‘fixed’ the bug in air quotes, but it was so wrong that if it had been merged, it would have just made the product way worse. You’re replacing one bug with a whole bunch of other bugs that are harder to understand, and a ton of code that’s just in the wrong place doing the wrong thing. That is deeply concerning. The actual concern is not this particular engineer because, fortunately, they’re a senior engineer and smart enough not to just say, ‘Okay, pass this test, merge.’ We also do code review, which is a very important thing. But the concern I have is this culture of, ‘Well, I’m not even going to try to understand what’s going on. I’m just going to spend some tokens, and maybe it’ll be great. Now I don’t have to think about it.’ This is a huge concern because a lot of evolving a product is not just about getting the results; it’s about the team understanding the architecture of the code. If you’re delegating knowledge to an AI, and you’re just reviewing the code without thinking about what you want to achieve, I think that’s very, very concerning.”
Some folks have told me they think that unit tests are a particularly good place to look at using AI more heavily. Chris urges caution, however:
“AI is really great at writing unit tests. This is one of the things that nobody likes to do. It feels super productive to say, ‘Just crank out a whole bunch of tests,’ and look, I’ve got all this code, amazing. But there’s a problem, because unit tests are their own potential tech debt. The test may not be testing the right thing, or they might be testing a detail of the thing rather than the real idea of the thing… And if you’re using mocking, now you get all these super tightly bound implementation details in your tests, which make it very difficult to change the architecture of your product as things evolve. Tests are just like the code in your main application—you should think about them. Also, lots of tests take a long time to run, and so they impact your future development velocity.”
Part of the problem, Chris noted, is that many people are using high lines of code written as a statistic to support the idea that AI is making a positive impact.
“To me, the question is not how do you get the most code. I’m not a CEO bragging about the number of lines of code written by AI; I think that’s a completely useless metric. I don’t measure progress based on the number of lines of code written. In fact, I see verbose, redundant, not well-factored code as a huge liability… The question is: how productive are people at getting stuff done and making the product better? This is what I care about.”
Underlying all of these concerns is the belief that AGI is imminent, and therefore traditional approaches to software development are obsolete. Chris has seen this movie before. “In 2017, I was at Tesla working on self-driving cars, leading the Autopilot software team. I was convinced that in 2020, autonomous cars would be everywhere and would be solved. It was this desperate race to go solve autonomy… But at the time, nobody even knew how hard that was. But what was in the air was: trillions of dollars are at stake, job replacement, transforming transportation… I think today, exactly the same thing is happening. It’s not about self-driving, although that is making progress, just a little bit less gloriously and immediately than people thought. But now it’s about programming.”
Chris thinks that, like all previous technologies, AI progress isn’t actually exponential. “I believe that progress looks like S-curves. Pre-training was a big deal. It seemed exponential, but it actually S-curved out and got flat as things went on. I think that we have a number of piled-up S-curves that are all driving forward amazing progress, but I at least have not seen that spark.”
The danger isn’t just that people might be wrong about AGI’s timeline—it’s what happens to their careers and codebases while they’re waiting. “Technology waves cause massive hype cycles, overdrama, and overselling,” Chris noted. “Whether it be object-oriented programming in the ’80s where everything’s an object, or the internet wave in the 2000s where everything has to be online otherwise you can’t buy a shirt or dog food. There’s truth to the technology, but what ends up happening is things settle out, and it’s less dramatic than initially promised. The question is, when things settle out, where do you as a programmer stand? Have you lost years of your own development because you’ve been spending it the wrong way?”
Chris is careful to clarify that he’s not anti-AI—far from it. “I am a maximalist. I want AI in all of our lives,” he told me. “However, the thing I don’t like is the people that are making decisions as though AGI or ASI were here tomorrow… Being paranoid, being anxious, being afraid of living your life and of building a better world seems like a very silly and not very pragmatic thing to do.”
Software Craftsmanship with AI
Chris sees the key as understanding the difference between using AI as a crutch versus using it as a tool that enhances your craftsmanship. He finds AI particularly valuable for exploration and learning:
“It’s amazing for learning a codebase you’re not familiar with, so it’s great for discovery. The automation features of AI are super important. Getting us out of writing boilerplate, getting us out of memorizing APIs, getting us out of looking up that thing from Stack Overflow; I think this is really profound. This is a good use. The thing that I get concerned about is if you go so far as to not care about what you’re looking up on Stack Overflow and why it works that way and not learning from it.”
One principle Chris and I share is the critical importance of tight iteration loops. For Chris, working on systems programming, this means “edit the code, compile, run it, get a test that fails, and then debug it and iterate on that loop… Running tests should take less than a minute, ideally less than 30 seconds.” He told me that when working on Mojo, one of the first priorities was “building VS Code support early because without tools that let you create quick iterations, all of your work is going to be slower, more annoying, and more wrong.”
My background is different—I am a fan of the Smalltalk, Lisp, and APL tradition where you have a live workspace and every line of code manipulates objects in that environment. When Chris and I first worked together on Swift for TensorFlow, the first thing I told him was “I’m going to need a notebook.” Within a week, he had built me complete Swift support for Jupyter. I could type something, see the result immediately, and watch my data transform step-by-step through the process. This is the Brett Victor “Inventing on Principle” style of being close to what you’re crafting.
If you want to maintain craftsmanship while using AI, you need tight iteration loops so you can see what’s happening. You need a live workspace where you (and the AI) are manipulating actual state, not just writing text files.
At fast.ai, we’ve been working to put this philosophy into practice with our Solveit platform. We discovered a key principle: the AI should be able to see exactly what the human sees, and the human should be able to see exactly what the AI sees at all times. No separate instruction files, no context windows that don’t match your actual workspace—the AI is right there with you, supporting you as you work.
This creates what I think of as “a third participant in this dialogue”—previously I had a conversation with my computer through a REPL, typing commands and seeing results. Now the AI is in that conversation too, able to see my code, my data, my outputs, and my thought process as I work through problems. When I ask “does this align with what we discussed earlier” or “have we handled this edge case,” the AI doesn’t need me to copy-paste context—it’s already there.
One of our team members, Nate, built something called ShellSage that demonstrates this beautifully. He realized that tmux already shows everything that’s happened in your shell session, so he just added a command that talks to an LLM. That’s it—about 100 lines of code. The LLM can see all your previous commands, questions, and output. By the next day, all of us were using it constantly. Another team member, Eric, built our Discord Buddy bot using this same approach—he didn’t write code in an editor and deploy it. He typed commands one at a time in a live symbol table, manipulating state directly. When it worked, he wrapped those steps into functions. No deployment, no build process—just iterative refinement of a running system.
Eric Ries has been writing his new book in Solveit and the AI can see exactly what he writes. He asks questions like “does this paragraph align with the mission we stated earlier?” or “have we discussed this case study before?” or “can you check my editor’s notes for comments on this?” The AI doesn’t need special instructions or context management—it’s in the trenches with him, watching the work unfold. (I’m writing this article in Solveit right now, for the same reasons.)
I asked Chris about how he thinks about the approach we’re taking with Solveit: “instead of bringing in a junior engineer that can just crank out code, you’re bringing in a senior expert, a senior engineer, an advisor—somebody that can actually help you make better code and teach you things.”
How Do We Do Something Meaningful?
Chris and I both see a bifurcation coming. “It feels like we’re going to have a bifurcation of skills,” I told him, “because people who use AI the wrong way are going to get worse and worse. And the people who use it to learn more and learn faster are going to outpace the speed of growth of AI capabilities because they’re human with the benefit of that… There’s going to be this group of people that have learned helplessness and this maybe smaller group of people that everybody’s like, ‘How does this person know everything? They’re so good.’”
The principles that allowed LLVM to last 25 years—architecture; understanding; craftsmanship—haven’t changed. “The question is, when things settle out, where do you as a programmer stand?” Chris asked. “Have you lost years of your own development because you’ve been spending it the wrong way? And now suddenly everybody else is much further ahead of you in terms of being able to create productive value for the world.”
His advice is clear, especially for those just starting out: “If I were coming out of school, my advice would be don’t pursue that path. Particularly if everybody is zigging, it’s time to zag. What you want to get to, particularly as your career evolves, is mastery. So you can be the senior engineer. So you can actually understand things to a depth that other people don’t. That’s how you escape the thing that everybody can do and get more differentiation.”
The hype will settle. The tools will improve. But the question Chris poses remains: “How do we actually add value to the world? How do we do something meaningful? How do we move the world forward?” For both of us, the answer involves caring deeply about our craft, understanding what we’re building, and using AI not as a replacement for thinking but as a tool to think more effectively. If the goal is to build things that last, you’re not going to be able to outsource that to AI. You’ll need to invest deeply in yourself.
Picture this: You’re a data analyst on day one at a midsize SaaS company. You’ve got the beginnings of a data warehouse—some structured, usable data and plenty of raw data you’re not quite sure what to do with yet. But that’s not the real problem. The real problem is that different teams are doing their own thing: Finance has Power BI models loaded with custom DAX and Excel connections. Sales is using Tableau connected to the central data lake. Marketing has some bespoke solution you haven’t figured out yet. If you’ve worked in data for any number of years, this scene probably feels familiar.
Then a finance director emails: Why does ARR show as $250M in my dashboard when Sales just reported $275M in their call?
No problem, you think. You’re a data analyst; this is what you do. You start digging. What you find isn’t a simple calculation error. Finance and sales are using different date dimensions, so they’re measuring different time periods. Their definitions of what counts as “revenue” don’t match. Their business unit hierarchies are built on completely different logic: one buried in a Power BI model, the other hardcoded in a Tableau calculation. You trace the problem through layers of custom notebooks, dashboard formulas, and Excel workbooks and realize that creating a single version of the truth that’s governable, stable, and maintainable isn’t going to be easy. It might not even be possible without rebuilding half the company’s data infrastructure and achieving a level of compliance from other data users that would be a full-time job in itself.
This is where the semantic layer comes in—what VentureBeat has called the “$1 trillion AI problem.” Think of it as a universal translator for your data: It’s a single place where you define what your metrics mean, how they’re calculated, and who can access them. The semantic layer is software that sits between your data sources and your analytics tools, pulling in data from wherever it lives, adding critical business context (relationships, calculations, descriptions), and serving it to any downstream tool in a consistent format. The result? Secure, performant access that enables genuinely practical self-service analytics.
Why does this matter now? As we’ll see when we return to the ARR problem, one force is driving the urgency: AI.
Legacy BI tools were never built with AI in mind, creating two critical gaps. First, all the logic and calculations scattered across your Power BI models, Tableau workbooks, and Excel spreadsheets aren’t accessible to AI tools in any meaningful way. Second, the data itself lacks the business context AI needs to use it accurately. An LLM looking at raw database tables doesn’t know that “revenue” means different things to finance and sales, or why certain records should be excluded from ARR calculations.
The semantic layer solves both problems. It makes data more trustworthy across traditional BI tools like Tableau, Power BI, and Excel while also giving AI tools the context they need to work accurately. Initial research shows near 100% accuracy across a wide range of queries when pairing a semantic layer with an LLM, compared to much lower performance when connecting AI directly to a data warehouse.
So how does this actually work? Let’s return to the ARR dilemma.
The core problem: multiple versions of the truth. Sales has one definition of ARR; finance has another. Analysts caught in the middle spend days investigating, only to end up with “it depends” as their answer. Decision making grinds to a halt because no one knows which number to trust.
This is where the semantic layer delivers its biggest value: a single source for defining and storing metrics. Think of it as the authoritative dictionary for your company’s data. ARR gets one definition, one calculation, one source of truth all stored in the semantic layer and accessible to everyone who needs it.
You might be thinking, “Can’t I do this in my data warehouse or BI tool?” Technically, yes. But here’s what makes semantic layers different: modularity and context.
Once you define ARR in the semantic layer it becomes a modular, reusable object—any tool that connects to it can use that metric: Tableau, Power BI, Excel, your new AI chatbot, whatever. The metric carries its business context with it: what it means, how it’s calculated, who can access it, and why certain records are included or excluded. You’re not rebuilding the logic in each tool; you’re referencing a single, governed definition.
This creates three immediate wins:
Single version of truth: Everyone uses the same ARR calculation, whether they’re in finance or sales or they’re pulling it into a machine learning model.
Effortless lineage: You can trace exactly where ARR is used across your organization and see its full calculation path.
Change management that actually works: When your CFO decides next quarter that ARR should exclude trial customers, you update the definition once in the semantic layer. Every dashboard, report, and AI tool that uses ARR gets the update automatically. No hunting through dozens of Tableau workbooks, Power BI models, and Python notebooks to find every hardcoded calculation.
Which brings us to the second key function of a semantic layer: interoperability.
Back to our finance director and that ARR question. With a semantic layer in place, here’s what changes. She opens Excel and pulls ARR directly from the semantic layer: $265M. The sales VP opens his Tableau dashboard, connects to the same semantic layer, and sees $265M. Your company’s new AI chatbot? Someone asks, “What’s our Q3 ARR?” and it queries the semantic layer: $265M. Same metric, same calculation, same answer, regardless of the tool.
This is what makes semantic layers transformative. They sit between your data sources and every tool that needs to consume that data. Power BI, Tableau, Excel, Python notebooks, LLMs, the semantic layer doesn’t care. You define the metric once, and every tool can access it through standard APIs or protocols. No rebuilding the logic in DAX for Power BI, then again in Tableau’s calculation language, then again in Excel formulas, then again for your AI chatbot.
Before semantic layers, interoperability meant compromise. You’d pick one tool as the “source of truth” and force everyone to use it, or you’d accept that different teams would have slightly different numbers. Neither option scales. With a semantic layer, your finance team keeps Excel, your sales team keeps Tableau, your data scientists keep Python, and your executives can ask questions in plain English to an AI assistant. They all get the same answer because they’re all pulling from the same governed definition.
Back to day one. You’re still a data analyst at that SaaS company, but this time there’s a semantic layer in place.
The finance director emails, but the question is different: “Can we update ARR to include our new business unit?”
Without a semantic layer, this request means days of work: updating Power BI models, Tableau dashboards, Excel reports, and AI integrations one by one. Coordinating with other analysts to understand their implementations. Testing everything. Hoping nothing breaks.
With a semantic layer? You log in to your semantic layer software and see the ARR definition: the calculation, the source tables, every tool using it. You update the logic once to include the new business unit. Test it. Deploy it. Every downstream tool—Power BI, Tableau, Excel, the AI chatbot—instantly reflects the change.
What used to take days now takes hours. What used to require careful coordination across teams now happens in one place. The finance director gets her answer, Sales sees the same number, and nobody’s reconciling spreadsheets at 5PM on Friday.
This is what analytics can be: consistent, flexible, and actually self-service. The semantic layer doesn’t just solve the ARR problem—it solves the fundamental challenge of turning data into trusted insights. One definition, any tool, every time.
Headlines surfaced by a simple “job market” search describe it as “a humiliation ritual” or “hell” and “an emerging crisis for entry-level workers.” The unemployment rate in the US for recent graduates is at an “unusually high” 5.8%—even Harvard Business School graduates have been taking months to find work. Inextricable from this conversation is the complication of AI’s potential to automate entry-level jobs, and as a tool for employers to evaluate applications. But the widespread availability of generative AI platforms begs an overlooked question: How are job seekers themselves using AI?
An interview study with upcoming master’s graduates at an elite UK university* sheds some light. In contrast to popular narratives about “laziness” or “shortcuts,” AI use comes from job seekers trying to strategically tackle the digitally saturated, competitive reality of today’s job market. Here are the main takeaways:
They Use AI to Play an Inevitable Numbers Game
Job seekers described feeling the need to apply to a high volume of jobs because of how rare it is to get a response amid the competition. They send out countless applications on online portals and rarely receive so much as an automated rejection email. As Franco, a 29-year-old communications student put it, particularly with “LinkedIn and job portals” saturating the market, his CV is just one “in a spreadsheet of 2,000 applicants.”
This context underlies how job seekers use AI, which allows them to spend less time on any given application by helping to cater résumés or write cover letters and thus put out more applications. Seoyeon, a 24-year-old communications student, describes how she faced repeated rejections no matter how carefully she crafted the application or how qualified she was.
[Employers] themselves are going to use AI to screen through those applications….And after a few rejections, it really frustrates you because you put in so much effort and time and passion for this one application to learn that it’s just filtered through by some AI….After that, it makes you lean towards, you know what, I’m just gonna put less effort into one application but apply for as many jobs as possible.
Seoyeon went on to say later that she even asks AI to tell her what “keywords” she should have in her application in light of AI in hiring systems.
Her reflection reveals that AI use is not a shortcut, but that it feels like a necessity to deal with the inevitable rejection and AI scanners, especially in light of companies themselves using AI to read applications—making her “passion” feel like a waste.
AI as a Savior to Emotional Labor
The labor of applying to jobs and dealing with constant rejection and little human interaction makes it a deeply emotional process that students describe as “draining” and “torturing,” which illuminates that AI is a way to reduce not just the time of labor but the emotional aspect of it.
Franco felt that having to portray himself as “passionate” for hundreds of jobs that he would not even hear back from was an “emotional toll” that AI helped him manage.
Repeating this process to a hundred job applications, a hundred job positions and having to rewrite a cover letter in a way that sounds like if it was your dream, well I don’t know if you can have a hundred dreams.…I would say that it does have an emotional toll….I think that AI actually helps a lot in terms of, okay, I’m going to help you do this cover letter so you don’t have to mentally feel you’re not going to get the shot.
Using AI thus acted as a buffer for the emotional difficulties of being a job seeker, allowing students to conserve mental energy in a grueling process while still applying to many jobs.
The More Passionate They Are, the Less AI They Use
AI use was not uniform by any means, even though the job application process often requires the same materials. Job seekers had “passion parameters” in place, where they dial down their use for a job that they were more passionate about.
Joseph, a 24-year-old psychology student, put this “human involvement” as “definitely more than 50%” for a role he truly desires, whereas for a less interesting role, it’s about “20%–30%.” He differentiates this by describing how, when passion is involved, he does deep research into the company as opposed to relying on AI’s “summarized, nuanced-lacking information,” and writes the cover letter from scratch—only using AI to be critical of it. In contrast, for less desirable jobs, AI plays a much more generative role in creating the initial draft that he then edits.
This points to the fact that while AI feels important for labor efficiency, students do not use it indiscriminately, especially when passion is involved and they want to put their best foot forward.
They Understand AI’s Flaws (and Work Around Them)
In their own words, students are not heedlessly “copying and pasting” AI-generated materials. They are critical of AI tools and navigate them with their concerns in mind.
Common flaws in AI-generated material include sounding “robotic” and “machine-like,” with some “AI” sounding words including “explore” and “delve into.” Joseph asserted that he can easily tell which one is written by a human, because AI-generated text lacks the “passion and zeal” of someone who is genuinely hungry for the job.
Nandita, a 23-year-old psychology student, shared how AI’s tendency to “put you on a pedestal” came through in misrepresenting facts. When she asked AI to tailor her résumé, it embellished her experience of “a week-long observation in a psychology clinic” into “community service,” which she strongly felt it wasn’t—she surmised this happened because community service was mentioned in the job description she fed AI, and she caught it and corrected it.
Consequently, using AI in the job hunt is not a passive endeavor but requires vigilance and a critical understanding to ensure its flaws do not hurt you as a job seeker.
They Grapple with AI’s Larger Implications
Using AI is not an unconditional endorsement of the technology; all the students were cognizant of (and worried about) its wider social implications.
John, a 24-year-old data science student, drew a distinction between using AI in impersonal processes versus human experiences. While he would use it for “a cover letter” for a job he suspects will be screened by AI anyway, he worries how it will be used in other parts of life.
I think it’s filling in parts of people’s lives that they don’t realize are very fundamental to who they are as humans. One example I’ve always thought of is, if you need it for things like cover letters, [that]s OK] just because it’s something where it’s not very personal.…But if you can’t write a birthday card without using ChatGPT, that’s a problem.
Nandita voiced a similar critique, drawing on her psychology background; while she could see AI helping tasks like “admin work,” she worries about how it would be used for therapy. She argues that an AI therapist would be “100% a Western…thing” and would fail to connect with someone “from the rural area in India.”
The understanding of AI shows that graduates differentiate using it for impersonal processes, like job searching in the digital age, from more human-to-human situations where it poses a threat.
Some Grads Are Opting Out of AI Use
Though most people interviewed were using AI, some rejected it entirely. They voiced similar qualms that AI users had, including sounding “robotic” and not “human.” Julia, a 23-year-old law student, specifically mentioned that her field requires “language and persuasiveness,” with “a human tone” that AI cannot replicate, and that not using it would “set you apart” in job applications.
Mark, a 24-year-old sociology student, acknowledged the same concerns as AI users about a saturated online arms race, but instead of using AI to send out as many applications as possible, had a different strategy in mind: “talking to people in real life.” He described how he once secured a research job through a connection in the smoking area of a pub.
Importantly, these job seekers had similar challenges with the job market as AI users, but they opted for different strategies to handle it that emphasize human connection and voice.
Conclusion
For graduate job seekers, AI use is a layered strategy that is a direct response to the difficulties of the job market. It is not about cutting corners but carefully adapting to current circumstances that require new forms of digital literacy.
Moving away from dialogue framing job seekers as lazy or unable to write their own materials forces us to look at how the system itself can be improved for applicants and companies alike. If employers don’t want AI use, how can they create a process that makes room for human authenticity as opposed to AI-generated materials that sustain the broken cycle of hiring?
In this episode, Laurence Moroney, director of AI at Arm, joins Ben Lorica to chat about the state of deep learning frameworks—and why you may be better off thinking a step higher, on the solution level. Listen in for Laurence’s thoughts about posttraining; the evolution of on-device AI (and how tools like ExecuTorch and LiteRT are helping make it possible); why culturally specific models will only grow in importance; what Hollywood can teach us about LLM privacy; and more.
About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.
Check out other episodes of this podcast on the O’Reilly learning platform.
Transcript
This transcript was created with the help of AI and has been lightly edited for clarity.
00.00: All right. So today we have Laurence Moroney, director of AI at Arm and author of the book AI and ML for Coders in PyTorch. Laurence is someone I’ve known for a while. He was at Google serving as one of the main evangelists for TensorFlow. So welcome to the podcast, Laurence.
00.26: I guess, before we go on to the present, let’s talk about a little bit of the past of deep learning frameworks. In fact, this week is interesting because Soumith Chintala just announced he was leaving Meta, and Soumith was one of the leaders of the PyTorch project. I interviewed Soumith in an O’Reilly podcast after PyTorch was released, but coincidentally, exactly about a year before I interviewed Rajat Monga right around the time that TensorFlow was released. I was actually talking to these project leaders very early on.
So, Laurence, you move your book to PyTorch, and I’m sure TensorFlow still holds a special place in your heart, right? So where does TensorFlow sit right now in your mind? Because right now it’s all about PyTorch, right?
01.25: Yeah, that’s a great question. TensorFlow definitely has a very special place in my heart. I built a lot of my recent career on TensorFlow. I’ll be frank. It feels like there’s not that much investment in TensorFlow anymore.
If you take a look at even releases, it went 2.8, 2.9, 2.10, 2.11. . .and you know, there’s no 3.0 on the horizon. I can’t really share any insider stuff from Google, although I left there over a year ago, but it does feel that unfortunately [TensorFlow has] kind of withered on the vine a little bit internally at Google compared to JAX.
02.04: But then the problem, at least for me from an external perspective, is, first of all, JAX isn’t really a machine learning framework. There are machine learning frameworks that are built on top of it. And second of all, it’s not a 1.0 product. It’s hard for me to encourage anybody to bet their business or get their career on something that isn’t a 1.0 product, or at least a 1.0 product.
02.29: That really just leaves (by default) PyTorch. Obviously there’s been all of the momentum around PyTorch. There’s been all of the excitement around it. It’s interesting, though, that if you look at things like GitHub star history, it still lags behind both TensorFlow and JAX. But in perception it is the most popular. And unfortunately, if you do want to build a career now on creating machine learning models, not just using machine learning models, it’s really the—oh well, I shouldn’t say unfortunately. . . The truth is that it’s really the only option. So that’s the negative side.
The positive side of it is of course, it’s really, really good. I’ve been using it extensively for some time. Even during my TensorFlow and JAX days, I did use PyTorch a lot. I wanted to keep an eye on how it was used, how it’s shaped, what worked, what didn’t, the best way for somebody to learn how to learn using PyTorch—and to make sure that the TensorFlow community, as I was working on it, were able to keep up with the simplicity of PyTorch, particularly the brilliant work that was done by the Keras team to really make Keras part of TensorFlow. It’s now been kind of pulled aside, pulled out of TensorFlow somewhat, but that was something that leaned into the same simplicity as PyTorch.
03.52: And like I said, now going forward, PyTorch is. . . I rewrote my book to be PyTorch specific. Andrew and I are teaching a PyTorch specialization with deep learning AI in Coursera. And you know, if my emphasis is less on frameworks and framework wars and loyalties and stuff like that and more on, I really want to help people to succeed, to build careers or to build startups, that kind of thing, that this was the direction that I think it should go in.
04.19: Now, maybe I’m wrong, but I think even about two years ago, maybe a little more than that, I was still hearing and seeing job posts around TensorFlow, primarily around people working in computer vision on edge devices. So is that still a place where you would run into TensorFlow users?
04.41: Absolutely, yes. Because of what was previously called TensorFlow Lite and is now called LiteRT as a runtime for models to be able to run on edge devices. I mean, that really was the only option until recently— just last week at the PyTorch Summit, ExecuTorch went 1.0. And if I go back to my old mantra of “I really don’t want anybody to invest their business or their career on something that’s prerelease,” it’s good to learn and it’s good to prepare.
05.10: [Back] then, the only option for you to be able to train models and deploy them, particularly to mobile devices, was effectively either LiteRT or TensorFlow Lite or whatever it’s called now, or Core ML for Apple devices. But now with ExecuTorch going 1.0, the whole market is out there for PyTorch developers to be able to deploy to mobile and edge devices.
05.34: So those job listings, I think as they evolve and as they go forward that the skills may kind of veer more towards PyTorch, but I’d also encourage everybody to kind of double click above the framework level and start thinking on the solution level. There’ve been a lot of framework wars in so many things, you know, Mac versus PC, Darknet versus Java. And in some ways, that’s not the most productive way of thinking about things.
I think the best thing to do is [to] think about what’s out there to allow you to build a solution that you can deploy, that you can trust, and that will be there for some time. And let the framework be secondary to that.
06.14: All right. So one last framework question. And this is also an observation that might be slightly dated—I think this might be from around two years ago. I was actually surprised that, for some reason, I think the Chinese government is also encouraging Chinese companies to use local deep learning frameworks. So it’s not just PaddlePaddle. There’s another one that I came across and I don’t know what’s the status of that now, as far as you know. . .
06.43: So I’m not familiar with any others other than PaddlePaddle. But I do generally agree with [the idea that] cultures should be thinking about using tools and frameworks and models that are appropriate for their culture. I’m going to pivot away from frameworks towards large language models as an example.
Large language models are primarily built on English. And when you start peeling apart large language models and look at what’s underneath the hood and particularly how they tokenize words, it’s very, very English oriented. So if you start wanting to build solutions, for example, for things like education—you know, important things!—and you’re not primarily an English language-speaking country, you’re already a little bit behind the curve.
07.35: Actually, I just came from a meeting with some folks from Ireland. And for the Gaelic language, the whole idea of posttraining models that were trained primarily with English tokens is already setting you apart at a disadvantage if you’re trying to build stuff that you can use within your culture.
At the very least, missing tokens, right? There were subwords in Gaelic that don’t exist in English, or subwords in Japanese or Chinese or Korean or whatever that don’t exist in English. So if you start even trying to do posttraining, you realize that the model was trained on using tokens that are. . . You need to use tokens that the model wasn’t trained with and stuff like that.
So I know I’m not really answering the framework part of it, but I do think it’s an important thing, like you mentioned, that China wants to invest in their own frameworks. But I think every culture should also be looking at. . . Cultural preservation is very, very important in the age of AI, as we build more dependence on AI.
08.37: When it comes to a framework, PyTorch is open source. TensorFlow is open source. I’m pretty sure PaddlePaddle is open source. I don’t know. I’m not really that familiar with it. So you don’t have the traps of being locked into somebody else’s cultural perspective or language or anything like that, that you would have with an obscure large language model if you’re using an open source framework. So that part isn’t as difficult when it comes to, like, a country wanting to adopt a framework. But certainly when it comes to building on top of pretrained models, that’s where you need to be careful.
09.11: So [for] most developers and most enterprise AI teams, the reality is they’re not going to be pretraining. So it’s mostly about posttraining, which is a big topic. It can run the gamut of RAG, fine-tuning, reinforcement learning, distillation, quantization. . . So from that perspective, Laurence, how much should someone who’s in an enterprise AI team really know about these deep learning frameworks?
09.42: So I think two different things there, right? One is posttraining and one is deep learning frameworks. I’m going to lean into the posttraining side to argue that that’s the single number one important skill for developers going forward: posttraining and all of their types of code.
10.01: Yeah, totally. There’s always trade-offs, right? There’s the very simple posttraining stuff like RAG, which is relatively low value, and then there’s the more complex stuff like a full retrain or a LoRA-type training, which is more expensive or more difficult but has higher value.
But I think there’s a whole spectrum of ways of doing things with posttraining. And my argument that I’m making very passionately is that if you’re a developer, that is the number one skill to learn going forward. “Agents” was kind of the buzzword of 2025; I think “small AI” will be the buzzword of 2026.
10.40: We often talk about open source AI with open source models and stuff like that. It’s not really open source. It’s a bit of a misnomer. The weights have been released for you to be able to use and self-host them—if you want a self-hosted chatbot or self-host something that you want to run on them.
But more importantly, the weights are there for you to change, through retraining, through fine-tuning and stuff like that. I’m particularly passionate about that because when you start thinking in terms of two things—latency and privacy—it becomes really, really important.
11.15: I spent a lot of time working with folks who are passionate about IP. I’ll share one of them: Hollywood movie studios. And we’ve probably all seen those semi-frivolous lawsuits of, person A makes a movie, and then person B sues person A because person B had the idea first. And movie studios are generally terrified of that kind of thing.
I actually have a movie in preproduction with a studio at the moment. So I’ve learned a lot through that. And one of the things [I learned] was, even when I speak with producers or the financiers, a lot of time we talk on the phone. We don’t email or anything like that because the whole fear of IP leaks is out there, and this has led to a fear there of, think of all the things that an LLM could be used to [do]. The shallow stuff would be to help you write scenes and all that kind of stuff. But most of them don’t really care about that.
The more important things where an LLM could be used [are it could] evaluate a script and count the number of locations that would be needed to film this script. Like the Mission:Impossible script, where one scene’s in Paris and another scene’s in Moscow, and another scene is in Hong Kong. To be able to have a machine that can evaluate that and help you start budgeting. Or if somebody sends in a speculative script with all of that kind of stuff in it, and you realize you don’t have half a billion to make this movie from an unknown, because they have all these locations.
12.41: So all of this kind of analysis that can be done—story analysis, costing analysis, and all of that type of stuff—is really important to them. And it’s great low-hanging fruit for something like an LLM to do. But there’s no way they’re going to upload their speculative scripts to Gemini or OpenAI or Claude or anything like that.
So local AI is really important to them—and the whole privacy part of it. You run the model and the machine; you do the analysis on the machine; the data never leaves your laptop. And then extend that. I mean, not everybody’s going to be working with Hollywood studios, but extend that to just general small offices—your law office, your medical office, your physiotherapists, or whatever [where] everybody is using large language models for very creative things, but if you can make those models far more effective at your specific domain. . .
13.37: I’ll use a small office, for example, in a particular state in a particular jurisdiction, to be able to retrain a model, to be an expert in the law for that jurisdiction based on prior, what is it they call it? Jury priors? I can’t remember the Latin phrase for it, but, you know, based on precedents. To be able to fine-tune a model for that and then have everything locally within your office so you’re not sharing out to Claude or Gemini or OpenAI or whatever. Developers are going to be building that stuff.
14.11: And with a lot of fear, uncertainty and doubt out there for developers with code generation, the optimist in me is seeing that [for] developers, your value bar is actually raising up. If your value is just your ability to churn out code, now models can compete with you. But if you’re raising the value of yourself to being able to do things that are much higher value than just churning out code—and I think fine-tuning is a part of that—then that actually leads to a very bright future for developers.
14.43: So here’s my impression of the state of tooling for posttraining. So [with] RAG and different variants of RAG, it seems like people have enough tools or have tools or have some notion of how to get started. [For] fine-tuning, there’s a lot of services that you can use now, and it mainly comes down to collecting a fine-tuning dataset it seems like.
[For] reinforcement learning, we still need tools that are accessible. The workflow needs to be at a point where a domain expert can actually do it—and that’s in some ways kind of where we are in fine-tuning, so the domain expert can focus on the dataset. Reinforcement learning, not so much the case.
I don’t know, Laurence, if you would consider quantization and distillation part of posttraining, but it seems like that might also be something where people would also need more tools. More options. So what’s your sense of tooling for the different types of posttraining?
15.56: Good question. I’ll start with RAG because it’s the easiest. There’s obviously lots of tooling out there for it.
16.07: Yep. I think the thing with RAG that interests me and fascinates me the most is in some ways it shares [similarities] with the early days of actually doing machine learning with the likes of Keras or PyTorch or TensorFlow, where there’s a lot of trial and error. And, you know, the tools.
16.25: Yeah, there’s a lot there’s a lot of knobs that you can optimize. People underestimate how important that is, right?
16.35: Oh, absolutely. Even the most basic knob, like, How big a slice do you take of your text, and how big of an overlap do you do between those slices? Because you can have vastly different results by doing that.
16.51: So just as a quick recap from if anybody’s not familiar with RAG, I’d like to give one little example of it. I actually wrote a novel about 12, 13 years ago, and six months after the novel was published, the publisher went bust. And this novel is not in the training set of any LLM.
So if I go to an LLM like Claude or GPT or anything like that and I ask about the novel, it will usually either say it doesn’t know or it will hallucinate and it’ll make stuff up and say it knows it. So to me, this was the perfect thing for me to try RAG.
17.25: The idea with RAG is that I will take the text of the novel and I’ll chop it up into maybe 20-word increments, with five-word overlap—so the first 20 words of the book and then word 15 through 35 and then word 30 through 50 so you get those overlaps—and then store those into a vector database. And then when somebody wants to ask about something like maybe ask about a character in the novel, then the prompts will be vectorized, and the embeddings for that prompt can be compared with the embeddings of all of these chunks.
And then when similar chunks are found, like the name of the character and stuff like that, or if the prompt asks, “Tell me about her hometown,” then there may be a chunk in the book that says, “Her hometown is blah,” you know?
So they will then be retrieved from the database and added to the prompt, and then sent to something like GPT. So now GPT has much more context: not just the prompt but also all these extra bits that it retrieves from the book that says, “Hey, she’s from this town and she likes this food.” And while ChatGPT doesn’t know about the book, it does know about the town, and it does know about that food, and it can give a more intelligent answer.
18.34: So it’s not really a tuning of the model in any way or posttuning of the model, but it’s an interesting and really nice hack to allow you to get the model to be able to do more than you thought it could do.
But going back to the question about tooling, there’s a lot of trial and error there like “How do I tokenize the words? What kind of chunk size do I use?” And all of that kind of stuff. So anybody that can provide any kind of tooling in that space so that you can try multiple databases and compare them against each other, I think is really valuable and really, really important.
19.05: If I go to the other end of the spectrum, then for actual real tuning of a model, I think LoRA tuning is a good example there. And tooling for that is hard to find. It’s few and far between.
19:20: I think actually there’s a lot of providers now where you can focus on your dataset and then. . . It’s a bit of a black box, obviously, because you’re relying on an API. I guess my point is that even if you’re [on] a team where you don’t have that expertise, you can get going. Whereas in reinforcement learning, there’s really not much tooling out there.
19:50: Certainly with reinforcement learning, you got to kind of just crack open the APIs and start coding. It’s not as difficult as it sounds, once you start doing it.
20:00: There are people who are trying to build tools, but I haven’t seen one where you can just point the domain expert.
20.09: Totally. And I would also encourage [listeners that] if you’re doing any other stuff like LoRA tuning, it’s really not that difficult once you start looking. And PyTorch is great for this, and Python is great for this, once you start looking at how to do it. Shameless self-plug here, but [in] the final chapter of my PyTorch book, I actually give an example of LoRA tuning, where I created a dataset for a digital influencer and I show you how to retune and how to LoRA-tune the Stable Diffusion model to be a specialist in creating for this one particular individual—just to show how to do all of that in code.
Because I’m always a believer that before I start using third-party tools to do a thing, I kind of want to look at the code and the frameworks and how to do that thing for myself. So then I can really understand the value that the tools are going to be giving me. So I tend to veer towards “Let me code it first before I care about the tools.”
21.15: [laughs] I have to call that one tool that, while it’s not specifically for fine-tuning large language models, I hope they converted for it. But this one changed the game for me: Apple has a tool called Create ML, which was really used for transfer learning off of existing models—which is still posttraining, just now posttraining of LLMs.
And that tool’s ability to be able to take a dataset and then to fine-tune a model like a MobileNet or something, or an object detection model on that codelessly and efficiently blew my mind with how good it was. The world needs more tooling like that. And if there’s any Apple people listening, I’d encourage them to extend Create ML for large language models or for any other generative models.
22.00: By the way, I want to make sure, as we wind down, I ask you about edge—that’s what’s occupying you at the moment. You talk about this notion of “build once, deploy everywhere.” So what’s actually feasible today?
22.19: So what’s feasible today? I think the best multideployment surface today that I would invest in going forward is creating for ExecuTorch, because ExecuTorch runtime is going to be living in so many places.
At Arm, obviously we’ve been working very closely with ExecuTorch and we are part of the ExecuTorch 1.0 release. But if you’re building for edge, you know, to make sure that your models work on the ExecuTorch, which, I think would be the number one, low-hanging fruit that I would say that people would invest in. So that’s PyTorch’s model.
22.54: Does it really live up to the “run everywhere”?
23.02: [laughs] I guess, at the minimum, Android and iOS.
23.12: So yes, at a minimum, for those—the same as LiteRT or TensorFlow Lite from Google does. What I’m excited about with ExecuTorch is that it also runs in other physical AI areas. We are going to be seeing it in cars and robots and other things as well. And I anticipate that that ecosystem will spread a lot faster than the Lite or T1. So if you’re starting with Android and iOS, then you’re in good shape.
23.42: What about the kinds of devices that our mutual friend Pete Warden, for example, targets? The really compute-hungry [ones]? Well, not so much compute hungry, but basically not much compute.
24.05: They sip power rather than gulping it. I think that would be a better question for Pete than for me. If you see him, tell him I said hi.
24.13: I mean, is that something that the ExecuTorch community also kind of thinks about?
24.22: At short. Yes. In long, that’s a bit more of a challenge to go on microcontrollers and the like. One of the things that when you start getting down onto the small that I’m really excited about is a technology called SME, which is scalable matrix extensions. And it’s something that Arm have been working on with various chip makers and handset makers, with the idea being that SME is all about being able to run AI workloads on the CPU. So without needing a separate external accelerator. And then as a result, the CPU’s going to be drawing less battery, those kinds of things, etc.
That’s one of the growth areas that I’m excited about, where you’re going to see more and more AI workloads being able to run on handsets, particularly the diverse Android handsets, because the CPU is capable of running models instead of you needing to offload to a separate accelerator, being an NPU or a TPU or GPU.
And the problem with the Android ecosystem is the sheer diversity makes it difficult for a developer to target any specific one. But if more and more workloads can actually move on to the CPU, and every device has a CPU, then the idea of being able to do more and more AI workloads through SME is going to be particularly exciting.
25.46: So actually, Laurence, for people who don’t work on edge deployments, give us a sense of how capable some of these small models are.
First I’ll throw out an unreasonable example: coding. So obviously, me and many people love all these coding tools like Claude Code, but sometimes it really consumes a lot of compute, gets expensive. And not only that, you end up getting somewhat dependent so that you have to always be connected to the cloud. So if you are on a plane, suddenly you’re not as productive anymore.
So I’m sure in coding it might not be feasible, but what are these language models or these foundation models capable of doing locally [on smartphones, for example] that people may not be aware of?
26.47: Okay, so let me kind of answer that in two different ways: [what] device foundation models are capable of that people may not be aware of [and] the overall on-device ecosystem and the kind of things you can do that people may not be aware of. And I’m going to start with the second one.
You mentioned China earlier on. Alipay is a company from China, and they’ve been working on the SME technology that I spoke about, where they had an app, which I’m sure we’ve all seen these kind of apps where you can get your vacation photographs and then you can search your vacation photographs for things, like “Show me all the pictures I took with a panda.”
And then you can create a slideshow or a subset of your folder with that. But when you build something like that, the AI required to be able to search images for a particular thing needs to live in the cloud because on-device just wasn’t capable of doing that type of image-based searching previously.
27.47: So then as a company, they had to stand up a cloud service to be able to do this. As a user, I had privacy and latency issues if I was using this: I have to share all of my photos with a third party and whatever I’m looking for in those photos I have to share with the third party.
And then of course, there’s the latency: I have to send the query. I have to have the query execute in the cloud. I have to have the results come back to my device and then be assembled on my device.
28.16: Now with an on-device AI, thinking about it from both the user perspective and from the app vendor perspective, it’s a better experience. I’ll start from the app vendor perspective: They don’t need to stand up this cloud service anymore, so they’re saving a lot of time and effort and money because everything is moving on-device. And with a model that’s capable of understanding images, and understanding the contents of images so that you can search for those, executing completely on-device.
The user experience is also better. Show me all the pictures of pandas that I have where it’s able to search the device for those pictures or look through all the pictures on the device, get an embedding that represents the contents of that picture map that match that embedding to the query that the user is doing, and then assemble those pictures. So you don’t have the latency, and you don’t have the privacy issues, and the vendor doesn’t have to stand up stuff.
29.11: So that’s the kind of area where I’m seeing great improvements, not just in user experience but also making it much cheaper and easier for somebody to build these applications—and all of that then stems from the capabilities of foundation models that are executing on the device, right? In this case, it’s a model that’s able to turn an image into a set of embeddings so that you can search those embeddings for matching things.
As a result, we’re seeing more and more on-device models, like Gemini Nano, like Apple Intelligence, becoming a foundational part of the operating system. Then more and more will be able to see applications like these being made possible.
I can’t afford to stand up a cloud service. You know, it’s costing millions of dollars to be able to build an application for somebody, so I can’t do that. And how many small startups can’t do that? But then as it moves on-device, and you don’t need all of that, and it’s just going to be purely an on-device thing, then suddenly it becomes much more interesting. And I think there’ll be a lot more innovation happening in that space.
30.16: You mentioned Gemma. What are the key families of local foundation models?
30.27: Sure. So, there’s local foundation models, and then also embedded on-device models. So Gemini Nano and Android and the Apple Intelligence models and Apple, as well as this ecosystem of smaller models that could work either on-device or on your desktop, like the Gemma family from Google. There’s the OpenAI gpt-oss, there’s the Qwen stuff from China, there’s Llama, you know that there’s a whole bunch of them out there.
I’ve recently been using the gpt-oss, which I find really good. And obviously I’m also a big fan of Gemma, but there’s lots of families out there—there’s so many new ones coming online every day, it seems. So there’s a lot of choice for those, but many of them are still too big to work on a mobile device.
31.15: You brought up quantization earlier on. And that’s where quantization will have to come into play, at least in some cases. But I think for the most part, if you look at where the vectors are trending, the smaller models are getting smarter. So what the 7 billion-parameter model can do today you needed 100 billion parameters to do two years ago.
And you keep projecting that forward, like the 1 billion-parameter model’s kind of [going to] be able to do the same thing in a year or two time, and then it becomes relatively trivial to put them onto a mobile device if they’re not part of the core operating system, but for them to be something that you ship along with your application.
I can see more and more of that happening where third-party models being small enough to work on mobile devices will become the next wave of what I’ve been calling small AI, not just on mobile but also on desktop and elsewhere.
32.13: So in closing, Laurence, for our listeners who are already familiar and may already be building AI applications for cloud or enterprise, this conversation may prompt them to start checking out edge and local applications.
Besides your book and your blog, what are some of the key resources? Are there specific conferences where a lot of these local AI edge AI people gather, for example?
32.48: So local AI, not yet. I think that that wave is only just beginning. Obviously things like the Meta conferences, we’ll talk a lot about Llama; Google conferences, we’ll talk a lot about Gemma; but an independent conference for just general local AI as a whole, I think that wave is only just beginning.
Mobile is very vendor specific or [focused on] the ecosystem of a vendor. Apple obviously have their WWDC, Google have their conferences, but there’s also the independent conference called droidcon, which I find really, really good for understanding mobile and understanding AI on mobile, particularly for the Android ecosystem.
But as for an overall conference for small AI and for the ideas of fine-tuning, all of the types of posttuning small AI that can be done, that’s that’s a growth area. I would say for posttraining, there’s a really excellent Coursera course that a friend of mine, Sharon Zhou, just released. It just came out last week or the week before. That’s an excellent course in all of the ins and outs of posttraining fine-tuning. But, yeah, I think it’s a great growth area.
34.08: And for those of us who are iPhone users. . . I keep waiting for Apple Intelligence to really up its game. It seems like it’s getting close. They have multiple initiatives in the works. They have alliances with OpenAI and now with Google. But then apparently they’re also working on their own model. So any inside scoop? [laughs]
34.33: Well, no inside scoop because I don’t work at Apple or anything like that, but I’ve been using Apple Intelligence quite a lot, and I’m a big fan. The ability to have the on-device large language model is really powerful. There’s a lot of scenarios I’ve been kind of poking around with and helping some startups with in that space.
The one thing that I would say that’s a big gotcha for developers to look out for is the very small context window. It’s only 8K, so if you try to do any kind of long-running stuff or anything interesting like that, you’ve got to go off-device. Apple have obviously been investing in this private cloud so that your sessions, when they go off-device into the cloud. . . At least they try to solve the privacy part of it. They’re getting ahead of the privacy [issue] better than anybody else, I think.
But latency is still there. And I think that deal with Google to provide Gemini services that was announced a couple of days ago is more on that cloud side of things and less on the on-device.
35.42: But going back to what I was saying earlier on, the 7 billion-parameter model of today is as good as the 120 billion of yesterday. The 1 billion-parameter [model] of next year is probably as good as that, if not better. So, as smaller parameter-size models and therefore memory space models are becoming much more effective, I can see more of them being delivered on-device as part of the operating system, in the same way as Apple Intelligence are doing it. But hopefully with a bigger context window because they can afford it with the smaller model.
36.14: And to clarify, Laurence, that trend that you just pointed out, the increasing capability of the smaller models, that holds not just for LLMs but also for multimodal?
The following originally appeared on Asimov’s Addendum and is being republished here with the author’s permission.
The other day, I was looking for parking information at Dulles International Airport, and was delighted with the conciseness and accuracy of Google’s AI overview. It was much more convenient than being told that the information could be found at the flydulles.com website, visiting it, perhaps landing on the wrong page, and finding the information I needed after a few clicks. It’s also a win from the provider side. Dulles isn’t trying to monetize its website (except to the extent that it helps people choose to fly from there.) The website is purely an information utility, and if AI makes it easier for people to find the right information, everyone is happy.
An AI overview of an answer found by consulting or training on Wikipedia is more problematic. The AI answer may lack some of the nuance and neutrality Wikipedia strives for. And while Wikipedia does make the information free for all, it depends on visitors not only for donations but also for the engagement that might lead people to become Wikipedia contributors or editors. The same may be true of other information utilities like GitHub and YouTube. Individual creators are incentivized to provide useful content by the traffic that YouTube directs to them and monetizes on their behalf.
And of course, an AI answer provided by illicitly crawling content that’s behind a subscription paywall is the source of a great deal of contention, even lawsuits. So content runs a gamut from “no problem crawling” to “do not crawl.”
There are a lot of efforts to stop unwanted crawling, including Really Simple Licensing (RSL) and Cloudflare’s Pay Per Crawl. But we need a more systemic solution. Both of these approaches put the burden of expressing intent onto the creator of the content. It’s as if every school had to put up its own traffic signs saying “School Zone: Speed Limit 15 mph.” Even making “Do Not Crawl” the default puts a burden on content providers, since they must now affirmatively figure out what content to exclude from the default in order to be visible to AI.
Why aren’t we putting more of the burden on AI companies instead of putting all of it on the content providers? What if we asked companies deploying crawlers to observe common sense distinctions such as those that I suggested above? Most drivers know not to tear through city streets at highway speeds even without speed signs. Alert drivers take care around children even without warning signs. There are some norms that are self-enforcing. Drive at high speed down the wrong side of the road and you will soon discover why it’s best to observe the national norm. But most norms aren’t that way. They work when there’s consensus and social pressure, which we don’t yet have in AI. And only when that doesn’t work do we rely on the safety net of laws and their enforcement.
As Larry Lessig pointed out at the beginning of the Internet era, starting with his book Code and Other Laws of Cyberspace, governance is the result of four forces: law, norms, markets, and architecture (which can refer either to physical or technical constraints).
So much of the thinking about the problems of AI seems to start with laws and regulations. What if instead, we started with an inquiry about what norms should be established? Rather than asking ourselves what should be legal, what if we asked ourselves what should be normal? What architecture would support those norms? And how might they enable a market, with laws and regulations mostly needed to restrain bad actors, rather than preemptively limiting those who are trying to do the right thing?
I think often of a quote from the Chinese philosopher Lao Tzu, who said something like:
Losing the way of life, men rely on goodness. Losing goodness, they rely on laws.
I like to think that “the way of life” is not just a metaphor for a state of spiritual alignment, but rather, an alignment with what works. I first thought about this back in the late ’90s as part of my open source advocacy. The Free Software Foundation started with a moral argument, which it tried to encode into a strong license (a kind of law) that mandated the availability of source code. Meanwhile, other projects like BSD and the X Window System relied on goodness, using a much weaker license that asked only for recognition of those who created the original code. But “the way of life” for open source was in its architecture.
Both Unix (the progenitor of Linux) and the World Wide Web have what I call an architecture of participation. They were made up of small pieces loosely joined by a communications protocol that allowed anyone to bring something to the table as long as they followed a few simple rules. Systems that were open source by license but had a monolithic architecture tended to fail despite their license and the availability of source code. Those with the right cooperative architecture (like Unix) flourished even under AT&T’s proprietary license, as long as it was loosely enforced. The right architecture enables a market with low barriers to entry, which also means low barriers to innovation, with flourishing widely distributed.
Architectures based on communication protocols tend to go hand in hand with self-enforcing norms, like driving on the same side of the street. The system literally doesn’t work unless you follow the rules. A protocol embodies both a set of self-enforcing norms and “code” as a kind of law.
What about markets? In a lot of ways, what we mean by “free markets” is not that they are free of government intervention. It is that they are free of the economic rents that accrue to some parties because of outsized market power, position, or entitlements bestowed on them by unfair laws and regulations. This is not only a more efficient market, but one that lowers the barriers for new entrants, typically making more room not only for widespread participation and shared prosperity but also for innovation.
Markets don’t exist in a vacuum. They are mediated by institutions. And when institutions change, markets change.
Consider the history of the early web. Free and open source web browsers, web servers, and a standardized protocol made it possible for anyone to build a website. There was a period of rapid experimentation, which led to the development of a number of successful business models: free content subsidized by advertising, subscription services, and ecommerce.
Nonetheless, the success of the open architecture of the web eventually led to a system of attention gatekeepers, notably Google, Amazon, and Meta. Each of them rose to prominence because it solved for what Herbert Simon called the scarcity of attention. Information had become so abundant that it defied manual curation. Instead, powerful, proprietary algorithmic systems were needed to match users with the answers, news, entertainment, products, applications, and services they seek. In short, the great internet gatekeepers each developed a proprietary algorithmic invisible hand to manage an information market. These companies became the institutions through which the market operates.
They initially succeeded because they followed “the way of life.” Consider Google. Its success began with insights about what made an authoritative site, understanding that every link to a site was a kind of vote, and that links from sites that were themselves authoritative should count more than others. Over time, the company found more and more factors that helped it to refine results so that those that appeared highest in the search results were in fact what their users thought were the best. Not only that, the people at Google thought hard about how to make advertising that worked as a complement to organic search, popularizing “pay per click” rather than “pay per view” advertising and refining its ad auction technology such that advertisers only paid for results, and users were more likely to see ads that they were actually interested in. This was a virtuous circle that made everyone—users, information providers, and Google itself—better off. In short, enabling an architecture of participation and a robust market is in everyone’s interest.
Amazon too enabled both sides of the market, creating value not only for its customers but for its suppliers. Jeff Bezos explicitly described the company strategy as the development of a flywheel: helping customers find the best products at the lowest price draws more customers, more customers draw more suppliers and more products, and that in turn draws in more customers.
Both Google and Amazon made the markets they participated in more efficient. Over time, though, they “enshittified” their services for their own benefit. That is, rather than continuing to make solving the problem of efficiently allocating the user’s scarce attention their primary goal, they began to manipulate user attention for their own benefit. Rather than giving users what they wanted, they looked to increase engagement, or showed results that were more profitable for them even though they might be worse for the user. For example, Google took control over more and more of the ad exchange technology and began to direct the most profitable advertising to its own sites and services, which increasingly competed with the web sites that it originally had helped users to find. Amazon supplanted the primacy of its organic search results with advertising, vastly increasing its own profits while the added cost of advertising gave suppliers the choice of reducing their own profits or increasing their prices. Our research in the Algorithmic Rents project at UCL found that Amazon’s top advertising recommendations are not only ranked far lower by its organic search algorithm, which looks for the best match to the user query, but are also significantly more expensive.
As I described in “Rising Tide Rents and Robber Baron Rents,” this process of replacing what is best for the user with what is best for the company is driven by the need to keep profits rising when the market for a company’s once-novel services stops growing and starts to flatten out. In economist Joseph Schumpeter’s theory, innovators can earn outsized profits as long as their innovations keep them ahead of the competition, but eventually these “Schumpeterian rents” get competed away through the diffusion of knowledge. In practice, though, if innovators get big enough, they can use their power and position to profit from more traditional extractive rents. Unfortunately, while this may deliver short term results, it ends up weakening not only the company but the market it controls, opening the door to new competitors at the same time as it breaks the virtuous circle in which not just attention but revenue and profits flow through the market as a whole.
Unfortunately, in many ways, because of its insatiable demand for capital and the lack of a viable business model to fuel its scaling, the AI industry has gone in hot pursuit of extractive economic rents right from the outset. Seeking unfettered access to content, unrestrained by laws or norms, model developers have ridden roughshod over the rights of content creators, training not only on freely available content but ignoring good faith signals like subscription paywalls, robots.txt and “do not crawl.” During inference, they exploit loopholes such as the fact that a paywall that comes up for users on a human timeframe briefly leaves content exposed long enough for bots to retrieve it. As a result, the market they have enabled is of third party black or gray market crawlers giving them plausible deniability as to the sources of their training or inference data, rather than the far more sustainable market that would come from discovering “the way of life” that would balance the incentives of human creators and AI derivatives.
Here are some broad-brush norms that AI companies could follow, if they understand the need to support and create a participatory content economy.
For any query, use the intelligence of your AI to judge whether the information being sought is likely to come from a single canonical source, or from multiple competing sources. For example, for my query about parking at Dulles Airport, it’s pretty likely that flydulles.com is a canonical source. Note however, that there may be alternative providers, such as additional off-airport parking, and if so, include them in the list of sources to consult.
Check for a subscription paywall, licensing technologies like RSL, “do not crawl” or other indication in robots.txt, and if any of these things exists, respect it.
Ask yourself if you are substituting for a unique source of information. If so, responses should be context-dependent. For example, for long form articles, provide basic info but make clear there’s more depth at the source. For quick facts (hours of operation, basic specs), provide the answer directly with attribution. The principle is that the AI’s response shouldn’t substitute for experiences where engagement is part of the value. This is an area that really does call for nuance, though. For example, there is a lot of low quality how-to information online that buries useful answers in unnecessary material just to provide additional surface area for advertising, or provides poor answers based on pay-for-placement. An AI summary can short-circuit that cruft. Much as Google’s early search breakthroughs required winnowing the wheat from the chaff, AI overviews can bring a search engine such as Google back to being as useful as it was in 2010, pre-enshittification.
If the site has high quality data that you want to train on or use for inference, pay the provider, not a black market scraper. If you can’t come to mutually agreed-on terms, don’t take it. This should be a fair market exchange, not a colonialist resource grab. AI companies pay for power and the latest chips without looking for black market alternatives. Why is it so hard to understand the need to pay fairly for content, which is an equally critical input?
Check whether the site is an aggregator of some kind. This can be inferred from the number of pages. A typical informational site such as a corporate or government website whose purpose is to provide public information about its products or services will have a much smaller footprint than an aggregator such as Wikipedia, Github, TripAdvisor, Goodreads, YouTube, or a social network. There are probably lots of other signals an AI could be trained to use. Recognize that competing directly with an aggregator with content scraped from that platform is unfair competition. Either come to a license agreement with the platform, or compete fairly without using their content to do so.If it is a community-driven platform such as Wikipedia or Stack Overflow, recognize that your AI answers might reduce contribution incentives, so in addition, support the contribution ecosystem. Provide revenue sharing, fund contribution programs, and provide prominent links that might convert some users into contributors. Make it easy to “see the discussion” or “view edit history” for queries where that context matters.
As a concrete example, let’s imagine how an AI might treat content from Wikipedia:
Direct factual query (”When did the Battle of Hastings occur?”): 1066. No link needed, because this is common knowledge available from many sites.
More complex query for which Wikipedia is the primary source (“What led up to the Battle of Hastings?) “According to Wikipedia, the Battle of Hastings was caused by a succession crisis after the death of King Edward the Confessor in January 1066, who died without a clear heir. [Link]”
Complex/contested topic: “Wikipedia’s article on [X] covers [key points]. Given the complexity and ongoing debate, you may want to read the full article and its sources: [link]”
For rapidly evolving topics: Note Wikipedia’s last update and link for current information.
Similar principles would apply to other aggregators. GitHub code snippets should link back to repositories, YouTube queries should direct to videos, not just summarize them.
These examples are not market-tested, but they do suggest directions that could be explored if AI companies took the same pains to build a sustainable economy that they do to reduce bias and hallucination in their models. What if we had a sustainable business model benchmark that AI companies competed on just as they do on other measures of quality?
Finding a business model that compensates the creators of content is not just a moral imperative, it’s a business imperative. Economies flourish better through exchange than extraction. AI has not yet found true product-market fit. That doesn’t just require users to love your product (and yes, people do love AI chat.) It requires the development of business models that create a rising tide for everyone.
Many advocate for regulation; we advocate for self-regulation. This starts with an understanding by the leading AI platforms that their job is not just to delight their users but to enable a market. They have to remember that they are not just building products, but institutions that will enable new markets and that they themselves are in the best position to establish the norms that will create flourishing AI markets. So far, they have treated the suppliers of the raw materials of their intelligence as a resource to be exploited rather than cultivated. The search for sustainable win-win business models should be as urgent to them as the search for the next breakthrough in AI performance.
The following article originally appeared on Medium and is being republished here with the author’s permission.
Early on, I caught myself saying “you” to my AI tools—“Can you add retries?” “Great idea!”—like I was talking to a junior dev. And then I’d get mad when it didn’t “understand” me.
That’s on me. These models aren’t people. An AI model doesn’t understand. It generates, and it follows patterns. But the keyword here is “it.”
The Illusion of Understanding
It feels like there’s a mind on the other side because the output is fluent and polite. It says things like “Great idea!” and “I recommend…” as if it weighed options and judged your plan. It didn’t. The model doesn’t have opinions. It recognized patterns from training data and your prompt, then synthesized the next token.
That doesn’t make the tool useless. It means you are the one doing the understanding. The model is clever, fast, and often correct, but it can often be wildly wrong in a way that will confound you. But what’s important to understand is that it is your fault if this happens because you didn’t give it enough context.
Here’s an example of naive pattern following:
A friend asked his model to scaffold a project. It spit out a block comment that literally said “This is authored by <Random Name>.” He Googled the name. It was someone’s public snippet that the model had basically learned as a pattern—including the “authored by” comment—and parroted back into a new file. Not malicious. Just mechanical. It didn’t “know” that adding a fake author attribution was absurd.
Build Trust Before Code
The first mistake most folks make is overtrust. The second is lazy prompting. The fix for both is the same: Be precise about inputs, and validate the assumption you are throwing at models.
Spell out context, constraints, directory boundaries, and success criteria.
Require diffs. Run tests. Ask it to second-guess your assumptions.
Make it restate your problem, and require it to ask for confirmation.
Before you throw a $500/hour problem at a set of parallel model executions, do your own homework to make sure that you’ve communicated all of your assumptions and that the model has understood what your criteria are for success.
Failure? Look Within
I continue to fall into this trap when I ask this tool to take on too much complexity without giving it enough context. And when it fails, I’ll type things like, “You’ve got to be kidding me? Why did you…”
Just remember, there is no “you” here other than yourself.
It doesn’t share your assumptions. If you didn’t tell it not to update the database, and it wrote an idiotic migration, you did that by not outlining that the tool shouldn’t refrain from doing so.
It didn’t read your mind about the scope. If you don’t lock it to a folder, it will “helpfully” refactor the world. If it tries to remove your home directory to be helpful? That’s on you.
It wasn’t trained on only “good” code. A lot of code on the internet… is not great. Your job is to specify constraints and success criteria.
The Mental Model I Use
Treat the model like a compiler for instructions. Garbage in, garbage out. Assume it’s smart about patterns, not about your domain. Make it prove correctness with tests, invariants, and constraints.
It’s not a person. That’s not an insult. It’s your advantage. Suppose you stop expecting human‑level judgment and start supplying machine‑level clarity. In that case, your results jump, but don’t let sycophantic agreement lull you into thinking that you have a pair programmer next to you.
We’ve been bombarded with claims about how much generative AI improves software developer productivity: It turns regular programmers into 10x programmers, and 10x programmers into 100x. And even more recently, we’ve been (somewhat less, but still) bombarded with the other side of the story: METR reports that, despite software developers’ belief that their productivity has increased, total end-to-end throughput has declined with AI assistance. We also saw hints of that in last year’s DORA report, which showed that release cadence actually slowed slightly when AI came into the picture. This year’s report reverses that trend.
I want to get a couple of assumptions out of the way first:
I don’t believe in 10x programmers. I’ve known people who thought they were 10x programmers, but their primary skill was convincing other team members that the rest of the team was responsible for their bugs. 2x, 3x? That’s real. We aren’t all the same, and our skills vary. But 10x? No.
There are a lot of methodological problems with the METR report—they’ve been widely discussed. I don’t believe that means we can ignore their result; end-to-end throughput on a software product is very difficult to measure.
As I (and many others) have written, actually writing code is only about 20% of a software developer’s job. So if you optimize that away completely—perfect secure code, first time—you only achieve a 20% speedup. (Yeah, I know, it’s unclear whether or not “debugging” is included in that 20%. Omitting it is nonsense—but if you assume that debugging adds another 10%–20% and recognize that that generates plenty of its own bugs, you’re back in the same place.) That’s a consequence of Amdahl’s law, if you want a fancy name, but it’s really just simple arithmetic.
Amdahl’s law becomes a lot more interesting if you look at the other side of performance. I worked at a high-performance computing startup in the late 1980s that did exactly this: It tried to optimize the 80% of a program that wasn’t easily vectorizable. And while Multiflow Computer failed in 1990, our very-long-instruction-word (VLIW) architecture was the basis for many of the high-performance chips that came afterward: chips that could execute many instructions per cycle, with reordered execution flows and branch prediction (speculative execution) for commonly used paths.
I want to apply the same kind of thinking to software development in the age of AI. Code generation seems like low-hanging fruit, though the voices of AI skeptics are rising. But what about the other 80%? What can AI do to optimize the rest of the job? That’s where the opportunity really lies.
Angie Jones’s talk at AI Codecon: Coding for the Agentic World takes exactly this approach. Angie notes that code generation isn’t changing how quickly we ship because it only takes in one part of the software development lifecycle (SDLC), not the whole. That “other 80%” involves writing documentation, handling pull requests (PRs), and the continuous integration pipeline (CI). In addition, she realizes that code generation is a one-person job (maybe two, if you’re pairing); coding is essentially solo work. Getting AI to assist the rest of the SDLC requires involving the rest of the team. In this context, she states the 1/9/90 rule: 1% are leaders who will experiment aggressively with AI and build new tools; 9% are early adopters; and 90% are “wait and see.” If AI is going to speed up releases, the 90% will need to adopt it; if it’s only the 1%, a PR here and there will be managed faster, but there won’t be substantial changes.
Angie takes the next step: She spends the rest of the talk going into some of the tools she and her team have built to take AI out of the IDE and into the rest of the process. I won’t spoil her talk, but she discusses three stages of readiness for the AI:
AI-curious: The agent is discoverable, can answer questions, but can’t modify anything.
AI-ready: The AI is starting to make contributions, but they’re only suggestions.
AI-embedded: The AI is fully plugged into the system, another member of the team.
This progression lets team members check AI out and gradually build confidence—as the AI developers themselves build confidence in what they can allow the AI to do.
Do Angie’s ideas take us all the way? Is this what we need to see significant increases in shipping velocity? It’s a very good start, but there’s another issue that’s even bigger. A company isn’t just a set of software development teams. It includes sales, marketing, finance, manufacturing, the rest of IT, and a lot more. There’s an old saying that you can’t move faster than the company. Speed up one function, like software development, without speeding up the rest and you haven’t accomplished much. A product that marketing isn’t ready to sell or that the sales group doesn’t yet understand doesn’t help.
That’s the next question we have to answer. We haven’t yet sped up real end-to-end software development, but we can. Can we speed up the rest of the company? METR’s report claimed that 95% of AI products failed. They theorized that it was in part because most projects targeted customer service, but the backend office work was more amenable to AI in its current form. That’s true—but there’s still the issue of “the rest.” Does it make sense to use AI to generate business plans, manage supply change, and the like if all it will do is reveal the next bottleneck?
Of course it does. This may be the best way of finding out where the bottlenecks are: in practice, when they become bottlenecks. There’s a reason Donald Knuth said that premature optimization is the root of all evil—and that doesn’t apply only to software development. If we really want to see improvements in productivity through AI, we have to look company-wide.
We’ve been watching enterprises struggle with the same customer service paradox for years: They have all the technology in the world, yet a simple address change still takes three days. The problem isn’t what you think—and neither is the solution.
Last month, I watched a colleague try to update their address with their bank. It should have been simple: log in, change the address, done. Instead, they spent 47 minutes on hold, got transferred three times, and was told the change would take “3–5 business days to process.” This is 2025. We have AI that can write poetry and solve complex math problems, yet we can’t update an address field in real time.
This isn’t a story about incompetent banks or outdated technology. It’s a story about something more fundamental: the hidden mathematics of enterprise friction.
The Invisible Math That’s Killing Customer Experience
Every enterprise process has two numbers that matter: T and n.
“T” is the theoretical time it should take to complete a task—the perfect-world scenario where everything works smoothly. For an address change, T might be 30 seconds: verify identity, update database, confirm change.
“n” is everything else. The waiting. The handoffs. The compliance checks. The system incompatibilities. The human bottlenecks. “n” is why that 30-second task becomes a 47-minute ordeal.
According to Forrester, 77% of customers say that valuing their time is the most important thing a company can provide. Aberdeen Group found that companies with excellent service achieve 92% customer retention compared to just 33% for poor performers. Yet most enterprises are still optimizing for compliance and risk mitigation, not customer time.
The result? A massive “T+n” problem that’s hiding in plain sight across every industry.
Why Everything We’ve Tried Has Failed
We’ve seen enterprises throw millions at this problem. Better training programs. Process reengineering initiatives. Shiny new CRM systems. Digital transformation consultants promising to “reimagine the customer journey.” These efforts typically yield 10%-15% improvements—meaningful but not transformative. The problem is architectural. Enterprise processes weren’t designed for speed; they were designed for control.
Consider that address change again. In the real world, it involves:
Identity verification across multiple systems that don’t talk to each other
Compliance flagging for anti-money-laundering rules
Risk assessment for fraud prevention
Routing to specialized teams based on account type
Manual approval for any exceptions
Updating downstream systems in sequence
Creating audit trails for regulatory requirements
Each step adds time. More importantly, each step adds variability—the unpredictable delays that turn a simple request into a multiday saga.
When AI Agents Actually Work
We’ve been experimenting with agentic AI implementations across several enterprise pilots, and we are starting to see something different. Not the usual marginal improvements but a genuine transformation of the customer experience.
The key insight is that intelligent agents don’t just automate tasks—they orchestrate entire processes across the three dimensions where latency accumulates.
People problems: Human agents aren’t available 24-7. They have specialized skills that create bottlenecks. They need training time and coffee breaks. Intelligent agents can handle routine requests around the clock, escalating only genuine edge cases that require human judgment. One financial services company we worked with deployed agents for card replacements. Standard requests that used to take 48 hours now complete in under 10 minutes. The customer types out their request, the agent verifies their identity, checks for fraud flags, orders the replacement, and confirms delivery—all without human intervention.
Process problems: Enterprise workflows are designed as sequential approval chains. Request goes to analyst, analyst checks compliance, compliance routes to specialist, specialist approves, approval goes to fulfillment. Each handoff adds latency. Intelligent agents can prevalidate actions against encoded business rules and trigger only essential human approvals. Instead of six sequential steps, you get one agent evaluation with human oversight only for genuine exceptions.
Technology problems: The average enterprise runs customer data across 12–15 different systems. These systems don’t integrate well, creating data inconsistencies and manual reconciliation work. Instead of requiring expensive system replacements, agents can orchestrate existing systems through APIs and, where APIs don’t exist, use robotic process automation to interact with legacy screens. They maintain a unified view of customer state across all platforms.
The AI Triangle: Why You Can’t Optimize Everything
But here’s where it gets interesting—and where most implementations fail.
Through our pilots and outcomes, we discovered what we call the AI Triangle: three properties that every agentic AI system must balance. Similar to the CAP theorem in distributed systems (where you can’t have perfect consistency, availability, and partition tolerance simultaneously), the AI Triangle forces you to choose between perfect autonomy, interpretability, and connectivity. Just as CAP theorem shapes how we build resilient distributed systems, the AI Triangle shapes how we build trustworthy autonomous agents. You can optimize any two of these properties, but doing so requires compromising the third. This is a “pick 2 of 3” situation:
Autonomy: How independently and quickly agents can act without human oversight
Interpretability: How explainable and audit-friendly the agent’s decisions are
Connectivity: How well the system maintains real-time, consistent data across all platforms
The AI Triangle
You can pick any two, but the third suffers:
Autonomy + interpretability: Agents make fast, explainable decisions but may not maintain perfect data consistency across all systems in real time.
Interpretability + connectivity: Full audit trails and perfect data sync, but human oversight slows everything down.
Autonomy + connectivity: Lightning-fast decisions with perfect system synchronization, but the audit trails might not capture the detailed reasoning compliance requires.
This isn’t a technology limitation—it’s a fundamental constraint that forces deliberate design choices. The enterprises succeeding with agentic AI are those that consciously choose which trade-offs align with their business priorities. This isn’t a technical decision—it’s a business strategy. Choose the two properties that matter most to your customers and regulators, then build everything else around that choice.
The Hidden Costs Nobody Mentions
The vendor demos make this look effortless. Reality is messier.
Data quality is make-or-break: Agents acting on inconsistent data don’t just make mistakes—they make mistakes at scale and speed. Worse, AI errors have a different signature than human ones. A human might transpose two digits in an account number or skip a required field. An AI might confidently route all Michigan addresses to Missouri because both start with “MI,” or interpret every instance of “Dr.” in street addresses as “doctor” instead of “drive,” creating addresses that don’t exist. These aren’t careless mistakes—they’re systematic misinterpretations that can cascade through thousands of transactions before anyone notices the pattern. Before deploying any autonomous system, you need to master data management, establish real-time validation rules, and build anomaly detection specifically tuned to catch AI’s peculiar failure modes. This isn’t glamorous work, but it’s what separates successful implementations from expensive disasters.
Integration brittleness: When agents can’t use APIs, they fall back to robotic process automation to interact with legacy systems. These integrations break whenever the underlying systems change. You need robust integration architecture and event-driven data flows.
Governance gets complex: Autonomous decisions create new risks. You need policy-based access controls, human checkpoints for high-impact actions, and continuous monitoring. The governance overhead is real and ongoing.
Change management is crucial: We’ve seen technically perfect implementations fail because employees resisted the changes. Successful deployments involve staff in pilot design and clearly communicate how humans and agents will work together.
Ongoing operational investment: The hidden costs of monitoring, retraining, and security updates require sustained budget. Factor these into ROI calculations from day one.
A Roadmap That Actually Works
After watching several implementations succeed (and others crash and burn), here’s the pattern that consistently delivers results:
Start small, think big: Target low-risk, high-volume processes first. Rules-based operations with minimal regulatory complexity. This builds organizational confidence while proving the technology works.
Foundation before features: Build integration architecture, data governance, and monitoring capabilities before scaling agent deployment. The infrastructure work is boring but essential.
Design with guardrails: Encode business rules—it’s preferable to move them into a policy store so that agents can get them executed at run time using a policy decision point (PDP) like Open Policy Agent (OPA), implement human checkpoints for exceptions, and ensure comprehensive logging from the beginning. These constraints enable sustainable scaling.
Measure relentlessly: Track the most critical metrics in operations with a focus on reducing “n” toward zero:
Average handling time (AHT)
Straight-through processing rate (STP Rate %)
Service level agreement (SLA) performance
Customer satisfaction
Cost per transaction
These metrics justify continued investment and guide optimization.
Scale gradually: Expand to adjacent processes with higher complexity only after proving the foundation. Concentric circles, not big bang deployments.
The Experience That Changes Everything
We keep coming back to that colleague trying to change their address. In a world with properly implemented agentic AI, here’s what should have happened:
They log into their banking app and request an address change. An intelligent agent immediately verifies their identity, checks the new address against fraud databases, validates it with postal services, and updates their profile across all relevant systems. Within seconds, they receive confirmation that the change is complete, along with updated cards being shipped to the new address. No phone calls. No transfers. No waiting. Just the service experience that matches the digital world we actually live in.
The Bigger Picture
This isn’t really about technology—it’s about finally delivering on the promises we’ve been making to customers for decades. Every “digital transformation” initiative has promised faster, better, more personalized service. Most have delivered new interfaces for the same old processes.
Agentic AI is different because it can actually restructure how work gets done, not just how it gets presented. It can turn T+n back into something approaching T.
But success requires more than buying software. It requires rethinking how organizations balance speed, control, and risk. It requires investing in the unglamorous infrastructure work that enables intelligent automation. Most importantly, it requires acknowledging that the future of customer service isn’t about replacing humans with machines—it’s about orchestrating humans and machines into something better than either could achieve alone.
The technology is ready. The question is whether we’re prepared to do the hard work of using it well.
In today’s fast-paced business environment, the ultimate goal of any data effort is to enable better decisions and drive meaningful organizational outcomes. Too often, data initiatives fail because they treat data or “data culture” as the final product. However, the journey to a data-driven organization doesn’t have to start with massive, complex initiatives. Instead, leaders can strategically select and implement “tiny projects” that serve as stepping stones toward improving results. These small wins, rooted in principles of human-centered design, create momentum, secure buy-in for larger initiatives, and attract more collaborators along the way by focusing on tangible results, not just data collection.
Identifying and Scoping Tiny Projects: Starting with Empathy
The first step in this journey is to identify potential tiny projects that align with your organization’s goals. Crucially, this stage is driven by empathy, the foundational principle of human-centered design (HCD), which means putting the needs and experiences of the people—the users—at the center of the solution.
These projects should be manageable in scope but impactful enough to demonstrate value.
Here are some tips for selecting the right projects:
Focus on pain points (the empathy phase)
Look for areas within your organization where data could alleviate existing challenges. For example, a marketing team might struggle to analyze customer feedback effectively. A tiny project could involve using data analytics to identify key themes in customer sentiment from recent campaigns. This user-driven starting point ensures the solution is relevant and immediately valued.
Leverage existing resources
Consider projects that utilize tools and data already available within your organization. This approach minimizes costs and reduces the time needed for implementation. For instance, a sales team could analyze historical sales data to identify trends and improve forecasting. A great example of this is a project where a team of three—a data analyst, a policy advisor, and a communications staff member—identified over $4M in savings for a major American city. They simply used existing, albeit “dirty,” data to find cost reductions in postal charges.
Set clear objectives
Define specific, measurable goals for each tiny project. This clarity will help teams understand what success looks like and keep them focused. For example, if the goal is to reduce customer churn, aim for a specific percentage reduction within a set time frame.
Showcasing Wins to Build Momentum: Testing and Iteration
Once you’ve identified and scoped your tiny projects, the next step is to execute them effectively and showcase the wins. Celebrating small successes is crucial for building momentum and gaining support for future initiatives. In HCD terms, these tiny projects are rapid prototypes designed for quick testing and feedback.
Here’s how to do it:
Communicate results
Share the outcomes of your tiny projects with the broader organization. Use visual aids like dashboards or infographics to present data in an engaging way. Highlight not just the quantitative results, but also the qualitative benefits, such as improved team collaboration or enhanced customer satisfaction.
Gather testimonials (validating the prototype)
Encourage team members involved in the projects to share their experiences. Personal stories about how data-driven decisions made a difference can resonate more deeply than numbers alone. These testimonials provide qualitative feedback to validate the solution’s impact, illustrating the value of a data culture to skeptics. A powerful example of this is a team of four from a major metro area—including an HR person for the police department, a data analyst, a program manager, and a police officer—who, in less than two days, identified several constraints in their police department’s diversity hiring practices. Using only a small dataset, Post-it notes, and pens, they leveraged their collective knowledge and experience. Their results were shared with law enforcement leadership and led to direct policy and communication changes.
Create a feedback loop (continuous improvement)
After completing a tiny project, gather feedback from participants and stakeholders. This input can help refine future projects and demonstrates a commitment to continuous improvement, which is central to the iterative nature of HCD. It also fosters a sense of ownership among team members, encouraging them to engage in future initiatives.
Securing Buy-In for Larger Initiatives: Scaling the Design
As you build momentum with tiny projects, you’ll find it easier to secure buy-in for larger data initiatives. The successful prototypes created through the small projects provide the evidence needed to support scaling.
Here are some strategies to help you gain support:
Align with organizational goals
When proposing larger projects, ensure they align with the broader objectives of the organization. Demonstrating how these initiatives can drive strategic goals will make it easier to gain leadership support.
Showcase scalability
Use the successes of tiny projects to illustrate how larger initiatives can build on these foundations. For example, if a small project successfully improved customer insights, propose a larger initiative that expands this analysis across multiple customer segments.
Engage stakeholders early
Involve key stakeholders in the planning stages of larger initiatives. Their input can help shape the project and increase their investment in its success. This collaborative approach fosters a sense of shared ownership and commitment.
Attracting More Collaborators: Designing the Experience
As your organization begins to embrace a data-first culture, you’ll naturally attract more collaborators. It’s not just about a top-down mandate; it’s about creating an environment where people want to be involved. This is where human-centered design is applied to the process itself, making participation intrinsically rewarding.
Here’s how to encourage participation and make your data projects a magnet for talent:
Create cross-functional teams
Encourage collaboration across departments by forming cross-functional teams for data projects. This diversity of perspectives can lead to more innovative solutions and a stronger sense of community.
Offer training and resources
Provide training sessions and resources to help employees feel more comfortable with data tools and analytics. When team members feel equipped to contribute, they’re more likely to engage in data initiatives.
Celebrate collaboration
Recognize and reward collaborative efforts within your organization. Highlighting team achievements reinforces the value of working together and encourages others to join in.
Best Practices for Fostering a Collaborative Environment: HCD in Action
To truly make your data projects a success, you need to set up the right conditions for collaboration. The best results often come from casual, no-pressure environments where a diverse group of people can work together effectively.
Let participants inform their tiny project challenge (user agency)
A powerful way to spark collaboration is to allow participants to collaborate on their data problem topics. This aligns with the HCD principle of cocreation, instantly building synergy and a shared sense of purpose. This often reveals that people from different departments, many of whom have never met, are facing the exact same challenge but from different perspectives. They are often overjoyed to find a kindred spirit to collaborate and innovate with on a solution.
Optimize for interaction by balancing in-person and virtual collaboration
While the digital tools supporting remote work have expanded reach and accessibility, the choice of collaboration method for tiny projects is critical. In-person collaboration remains the most effective way to foster rapid, creative problem-solving. Being in the same room allows for spontaneous brainstorming, an immediate shared sense of energy, and the ability to read nonverbal cues, which accelerates the HCD empathy and ideation phases. The pros are speed, depth of connection, and cocreation quality. However, virtual or remote collaboration offers substantial pros like lower cost, greater geographic diversity, and increased participant accessibility, which can be invaluable for gathering a wider range of data perspectives. Therefore, for truly tiny, complex, or urgent problem-solving, prioritize the high-bandwidth interaction of in-person settings, but leverage virtual tools for asynchronous check-ins, data sharing, and ensuring wider organizational inclusion.
Cultivate a “freedom to fail” mindset (psychological safety)
Explicitly state that this is a no-pressure environment where experimentation is encouraged. When people aren’t afraid of making mistakes, they are more willing to try new ideas, challenge assumptions, and learn from what doesn’t work. This psychological safety is crucial for rapid iteration and innovation, the hallmarks of effective HCD.
Ensure a diverse mix of people
A successful project isn’t just about data and technology. Bring together a highly diverse range of people from different departments, with varying levels of experience, and from a variety of disciplines. A project team that includes an HR person, a police officer, a data analyst, and a program manager can uncover insights that a homogeneous group never would.
Design for active collaboration (experiential design)
Move beyond traditional conference room setups. Create a comfortable environment that is suitable for active collaboration. This means having space to stand up, walk around, and use whiteboards or walls for posting ideas. Getting people out from behind their laptops encourages dynamic interaction and shared focus, as HCD principles apply to designing the process experience itself.
Provide healthy food and drinks
Simple as it may seem, offering readily available, healthy, and tasty food and beverages makes a huge difference. It removes a minor distraction, signals that the organization values the team’s time, and fosters a more relaxed, communal atmosphere.
The Value Proposition for Collaborators: Designing for Intrinsic Motivation
The true secret to attracting collaborators isn’t just about providing resources—it’s about making the process personally and professionally rewarding. Tiny projects are an excellent way to do this because they’re inherently fun and self-edifying, and often lead to quick, visible success.
When projects are small and have a clear, rapid path to a solution, people are more willing to participate. They see it as a low-risk opportunity to experiment and have some fun. This is a chance to step away from their regular duties and engage in a different kind of problem-solving. This shift in mindset can be a refreshing and enjoyable experience.
Beyond the enjoyment, tiny projects offer a chance for personal and professional growth. Team members get to learn from their peers in different departments, gaining new skills and perspectives. It’s a form of on-the-job training that is far more engaging and relevant than a traditional workshop. They feel a sense of self-edification as they solve a real-world problem and gain confidence in their abilities.
Finally, the success of these projects is often wildly, visibly, and rapidly successful. Because the scope is small, teams can quickly deliver tangible results. A project that saves a city millions of dollars or leads to direct policy changes in a police department in less than two days is a powerful story.
These successes are great for the organization, but they’re also a massive win for the individuals involved. They get to demonstrate their expertise and showcase the value they can add beyond their job description. This visibility and recognition are powerful motivators, encouraging people to participate in future projects because they want to have fun, be successful, and add value again.
You don’t have to do many tiny projects to see the effect. The personal benefits—the fun, the learning, the rapid success—become organizational cultural values that expand rapidly to other individuals and parts of the organization. It’s the massively exponential positive feedback loop that transforms a data culture, one small, successful project at a time.
Scaling a Data-First Culture
Ultimately, the goal is to scale a data-first culture that extends beyond individual projects. By starting with tiny projects as HCD prototypes, showcasing wins as validated solutions, securing buy-in, and attracting collaborators through a well-designed process, organizations can create a sustainable environment where data-driven decision-making thrives.
As you embark on this journey, remember that building a resilient data culture is a marathon, not a sprint. Each tiny project is a step toward a larger vision, and with each success, you’ll be laying the groundwork for a future where data is at the heart of your organization’s strategy. Embrace the process, celebrate the wins, and watch as your data culture flourishes.
Much like the introduction of the personal computer, the internet, and the iPhone into the public sphere, recent developments in the AI space, from generative AI to agentic AI, have fundamentally changed the way people live and work. Since ChatGPT’s release in late 2022, it’s reached a threshold of 700 million users per week, approximately 10% of the global adult population. And according to a 2025 report by Capgemini, agentic AI adoption is expected to grow by 48% by the end of the year. It’s quite clear that this latest iteration of AI technology has transformed virtually every industry and profession, and data engineering is no exception.
As Naveen Sharma, SVP and global practice head at Cognizant, observes, “What makes data engineering uniquely pivotal is that it forms the foundation of modern AI systems, it’s where these models originate and what enables their intelligence.” Thus, it’s unsurprising that the latest advances in AI would have a sizable impact on the discipline, perhaps even an existential one. With the increased adoption of AI coding tools leading to the reduction of many entry-level IT positions, should data engineers be wary about a similar outcome for their own profession? Khushbu Shah, associate director at ProjectPro, poses this very question, noting that “we’ve entered a new phase of data engineering, one where AI tools don’t just support a data engineer’s work; they start doing it for you. . . .Where does that leave the data engineer? Will AI replace data engineers?”
Despite the growing tide of GenAI and agentic AI, data engineers won’t be replaced anytime soon. While the latest AI tools can help automate and complete rote tasks, data engineers are still very much needed to maintain and implement the infrastructure that houses data required for model training, build data pipelines that ensure accurate and accessible data, and monitor and enable model deployment. And as Shah points out, “Prompt-driven tools are great at writing code but they can’t reason about business logic, trade-offs in system design, or the subtle cost of a slow query in a production dashboard.” So while their customary daily tasks might shift with the increasing adoption of the latest AI tools, data engineers still have an important role to play in this technological revolution.
The Role of Data Engineers in the New AI Era
In order to adapt to this new era of AI, the most important thing data engineers can do involves a fairly self-evident mindshift. Simply put, data engineers need to understand AI and how data is used in AI systems. As Mike Loukides, VP of content strategy at O’Reilly, put it to me in a recent conversation, “Data engineering isn’t going away, but you won’t be able to do data engineering for AI if you don’t understand the AI part of the equation. And I think that’s where people will get stuck. They’ll think, ‘Same old same old,’ and it isn’t. A data pipeline is still a data pipeline, but you have to know what that pipeline is feeding.”
So how exactly is data used? Since all models require huge amounts of data for initial training, the first stage involves collecting raw data from various sources, be they databases, public datasets, or APIs. And since raw data is often unorganized or incomplete, preprocessing the data is necessary to prepare it for training, which involves cleaning, transforming, and organizing the data to make it suitable for the AI model. The next stage concerns training the model, where the preprocessed data is fed into the AI model to learn patterns, relationships, or features. After that there’s posttraining, where the model is fine-tuned with data important to the organization that’s building the model, a stage that also requires a significant amount of data. Related to this stage is the concept of retrieval-augmented generation (RAG), a technique that provides real-time, contextually relevant information to a model in order to improve the accuracy of responses.
Other important ways that data engineers can adapt to this new environment and help support current AI initiatives is by improving and maintaining high data quality, designing robust pipelines and operational systems, and ensuring that privacy and security measures are met.
In his testimony to a US House of Representatives committee on the topic of AI innovation, Gecko Robotics cofounder Troy Demmer affirmed a golden axiom of the industry: “AI applications are only as good as the data they are trained on. Trustworthy AI requires trustworthy data inputs.” It’s the reason why roughly 85% of all AI projects fail, and many AI professionals flag it as a major source of concern: without high-quality data, even the most sophisticated models and AI agents can go awry. Since most GenAI models depend upon large datasets to function, data engineers are needed to process and structure this data so that it’s clean, labeled, and relevant, ensuring reliable AI outputs.
Just as importantly, data engineers need to design and build newer, more robust pipelines and infrastructure that can scale with Gen AI requirements. As Adi Polak, Director of AI & Data Streaming at Confluent, notes, “the next generation of AI systems requires real-time context and responsive pipelines that support autonomous decisions across distributed systems”, well beyond traditional data pipelines that can only support batch-trained models or power reports. Instead, data engineers are now tasked with creating nimbler pipelines that can process and support real-time streaming data for inference, historical data for model fine-tuning, versioning, and lineage tracking. They also must have a firm grasp of streaming patterns and concepts, from event driven architecture to retrieval and feedback loops, in order to build high-throughput pipelines that can support AI agents.
While GenAI’s utility is indisputable at this point, the technology is saddled with notable drawbacks. Hallucinations are most likely to occur when a model doesn’t have the proper data it needs to answer a given question. Like many systems that rely on vast streams of information, the latest AI systems are not immune to private data exposure, biased outputs, and intellectual property misuse. Thus, it’s up to data engineers to ensure that the data used by these systems is properly governed and secured, and that the systems themselves comply with relevant data and AI regulations. As data engineer Axel Schwanke astutely notes, these measures may include “limiting the use of large models to specific data sets, users and applications, documenting hallucinations and their triggers, and ensuring that GenAI applications disclose their data sources and provenance when they generate responses,” as well as sanitizing and validating all GenAI inputs and outputs. An example of a model that addresses the latter measures is O’Reilly Answers, one of the first models that provides citations for content it quotes.
The Road Ahead
Data engineers should remain gainfully employed as the next generation of AI continues on its upward trajectory, but that doesn’t mean there aren’t significant challenges around the corner. As autonomous agents continue to evolve, questions regarding the best infrastructure and tools to support them have arisen. As Ben Lorica ponders, “What does this mean for our data infrastructure? We are designing intelligent, autonomous systems on top of databases built for predictable, human-driven interactions. What happens when software that writes software also provisions and manages its own data? This is an architectural mismatch waiting to happen, and one that demands a new generation of tools.” One such potential tool has already arisen in the form of AgentDB, a database designed specifically to work effectively with AI agents.
In a similar vein, a recent research paper, “Supporting Our AI Overlords,” opines that data systems must be redesigned to be agent-first. Building upon this argument, Ananth Packkildurai observes that “it’s tempting to believe that the Model Context Protocol (MCP) and tool integration layers solve the agent-data mismatch problem. . . .However, these improvements don’t address the fundamental architectural mismatch. . . .The core issue remains: MCP still primarily exposes existing APIs—precise, single-purpose endpoints designed for human or application use—to agents that operate fundamentally differently.” Whatever the outcome of this debate may be, data engineers will likely help shape the future underlying infrastructure used to support autonomous agents.
Another challenge for data engineers will be successfully navigating the ever amorphous landscape of data privacy and AI regulations, particularly in the US. With the One Big Beautiful Bill Act leaving AI regulation under the aegis of individual state laws, data engineers need to keep abreast of any local legislations that might impact their company’s data use for AI initiatives, such as the recently signed SB 53 in California, and adjust their data governance strategies accordingly. Furthermore, what data is used and how it’s sourced should always be at top of mind, with Anthropic’s recent settlement of a copyright infringement lawsuit serving as a stark reminder of that imperative.
Lastly, the quicksilver momentum of the latest AI has led to an explosion of new tools and platforms. While data engineers are responsible for keeping up with these innovations, that can be easier said than done, due to steep learning curves and the time required to truly upskill in something versus AI’s perpetual wheel of change. It’s a precarious balancing act, one that data engineers must get a bead on quickly in order to stay relevant.
Despite these challenges however, the future outlook of the profession isn’t doom and gloom. While the field will undergo massive changes in the near future due to AI innovation, it will still be recognizably data engineering, as even technology like GenAI requires clean, governed data and the underlying infrastructure to support it. Rather than being replaced, data engineers are more likely to emerge as key players in the grand design of an AI-forward future.
In a recent newsletter, Ben Thompson suggested paying attention to a portion of Jensen Huang’s keynote at NVIDIA’s GPU Technology Conference (GTC) in DC, calling it “an excellent articulation of the thesis that the AI market is orders of magnitude bigger than the software market.” While I’m reluctant to contradict as astute an observer as Thompson, I’m not sure I agree.
Here’s a transcript of the remarks that Thompson called out:
Software of the past, and this is a profound understanding, a profound observation of artificial intelligence, that the software industry of the past was about creating tools. Excel is a tool. Word is a tool. A web browser is a tool. The reason why I know these are tools is because you use them. The tools industry, just as screwdrivers and hammers, the tools industry is only so large. In the case of IT tools, they could be database tools, [the market for] these IT tools is about a trillion dollars or so.
But AI is not a tool. AI is work. That is the profound difference. AI is, in fact, workers that can actually use tools. One of the things I’m really excited about is the work that Aravind’s doing at Perplexity. Perplexity, using web browsers to book vacations or do shopping. Basically, an AI using tools. Cursor is an AI, an agentic AI system that we use at NVIDIA. Every single software engineer at NVIDIA uses Cursor. That’s improved our productivity tremendously. It’s basically a partner for every one of our software engineers to generate code, and it uses a tool, and the tool it uses is called VS Code. So Cursor is an AI, agentic AI system that uses VS Code.
Well, all of these different industries, these different industries, whether it’s chatbots or digital biology where we have AI assistant researchers, or what is a robotaxi? Inside a robotaxi, of course, it’s invisible, but obviously, there’s an AI chauffeur. That chauffeur is doing work, and the tool that it uses to do that work is the car, and so everything that we’ve made up until now, the whole world, everything that we’ve made up until now, are tools. Tools for us to use. For the very first time, technology is now able to do work and help us be more productive.
At first this seems like an important observation, and one that justifies the sky-high valuation of AI companies. But it really doesn’t hold up to closer examination. “AI is not a tool. AI is work. That is the profound difference. AI is, in fact, workers that can use tools.” Really? Any complex software system is a worker that can use tools! Think about the Amazon website. Here is some of the work it does, and the tools that it invokes. It:
Helps the user search a product catalog containing millions of items using not just data retrieval tools but indices that take into account hundreds of factors;
Compares those items with other similar items, considering product reviews and price;
Calls a tool that calculates taxes based on the location of the purchaser;
Calls a tool that takes payment and another that sends it to the bank, possibly via one or more intermediaries;
Collects (or stores and retrieves) shipping information;
Dispatches instructions to a mix of robots and human warehouse workers;
Dispatches instructions to a fleet of delivery drivers, and uses a variety of tools to communicated with them and track their progress;
Follows up by text and/or email and asks the customer how the delivery was handled;
And far more.
Amazon is a particularly telling example, but far from unique. Every web application of any complexity is a worker that uses tools and does work that humans used to do. And often does it better and far faster. I’ve made this point myself in the past. In 2016, in an article for MITSloan Management Review called “Managing the Bots That Are Managing the Business,” I wrote about the changing role of programmers at companies like Google, Amazon, and Facebook:
A large part of the work of these companies—delivering search results, news and information, social network status updates, and relevant products for purchase—is performed by software programs and algorithms. These programs are the workers, and the human software developers who create them are their managers.
Each day, these “managers” take in feedback about their electronic workers’ performance—as measured in real-time data from the marketplace — and they provide feedback to the workers in the form of minor tweaks and updates to their programs or algorithms. The human managers also have their own managers, but hierarchies are often flat, and multiple levels of management are aligned around a set of data-driven “objectives and key results” (OKRs) that are measurable in a way that allows even the electronic “workers” to be guided by these objectives.
So if I myself have used the analogy that complex software systems can be workers, why do I object to Huang doing the same? I think part of it is the relentless narrative that AI is completely unprecedented. It is true that the desktop software examples Huang cites are more clearly just tools than complex web applications, and that systems that use statistical pattern-matching and generalization abilities DO represent a serious advance over that kind of software. But some kind of AI has been animating the web giants for years. And it is true that today’s AI systems have become even more powerful and general purpose. Like Excel, Amazon follows predetermined logic paths, while AI can handle more novel situations. There is indeed something very new here.
But the judgment is still out on the range of tasks that it will be able to master.
AI is getting pretty good at software development, but even there, in one limited domain, the results are still mixed, with the human still initiating, evaluating, and supervising the work – in other words, using the AI as a tool. AI also makes for a great research assistant. And it’s a good business writer, brainstorming coach, and so on. But if you think about the range of tasks traditional software does in today’s world, its role in every facet of the economy, that is far larger than the narrow definition of software “tools” that Huang uses. From the earliest days of data processing, computers were doing work. Software has always straddled the boundary between tool and worker. And when you think of the ubiquitous role of software worldwide in helping manage logistics, billing, communications, transportation, construction, energy, healthcare, finance—much of this work not necessarily done better with AI—it’s not at all clear that AI enables a market that is “orders of magnitude” larger. At least not for quite some time to come. It requires a narrow definition of the “IT tools” market to make that claim.
Even when a new tool does a job better than older ones, it can’t be assumed that it will displace them. Yes, the internal combustion engine almost entirely replaced animal labor in the developed world, but most of the time, new technologies takes their place alongside existing ones. We’re still burning coal and generating energy via steam, the great inventions of the first industrial revolution, despite centuries’ worth of energy advances! Ecommerce, for all its advantages, has still taken only a 20% share of worldwide retail since Amazon launched 30 years ago. And do you remember the bold claims of Travis Kalanick that Uber was not competing with taxicabs, but aimed to entirely replace the privately owned automobile?
Don’t Mistake Marvelous for Unprecedented
In an online chat group about AI where we were debating this part of Huang’s speech, one person asked me:
Don’t you think putting Claude Code in YOLO mode and ask[ing] it to do an ambiguous task, for example go through an entire data room and underwrite a loan, with a 250 word description, is fundamentally different from software?
First off, that example is a good illustration of the anonymous aphorism that “the difference between theory and practice is always greater in practice than it is in theory.” Anyone who would trust today’s AI to underwrite a loan based on a 250-word prompt would be taking a very big risk! Huang’s invocation of Perplexity’s ability to shop and make reservations is similarly overstated. Even in more structured environments like coding, full autonomy is some ways off.
And yes, of course today’s AI is different from older software. Just so, web apps were different from PC apps. That leads to the “wow” factor. Today’s AI really does seem almost magical. Yet, as someone who has lived through several technology revolutions, I can tell you that each was as marvelous to experience for the first time as today’s AI coding rapture.
I wrote my first book (on Frank Herbert) on a typewriter. To rearrange material, I literally cut and pasted sheets of paper. And eventually, I had to retype the whole thing from scratch. Multiple times. Word processing probably saved me as much time (and perhaps more) on future books as AI coding tools save today’s coders. It too was magical! Not only that, to research that first book, I had to travel in person to libraries and archives, scan through boxes of paper and microfiche, manually photocopy relevant documents, and take extensive notes on notecards. To do analogous research (on Herbert Simon) a few years ago, while working on my algorithmic attention rents paper, took only a few hours with Google, Amazon, and the Internet Archive. And yes, to do the same with Claude might have taken only a few minutes, though I suspect the work might have been more shallow if I’d simply worked from Claude’s summaries rather than consulting the original sources.
Just being faster and doing more of the work than previous generations of technology is also not peculiar to AI. The time saving leap from pre-internet research to internet-based research is more significant than people realize if they grew up taking the internet for granted. The time saving leap from coding in assembler to coding in a high-level compiled or interpreted language may also be of a similar order of magnitude as the leap from writing Python by hand to having it AI-generated. And if productivity is to be the metric, the time-saving leap from riding a horse drawn wagon across the country to flying in an airplane is likely greater than either the leap from my library-based research or my long-ago assembly language programming to Claude.
The question is what we do with the time we save.
The Devaluation of Human Agency
What’s perhaps most significant in the delta between Amazon or Google and ChatGPT or Claude is that chatbots give individual humans democratized access to a kind of computing power that was once available only to the few. It’s a bit like the PC revolution. As Steve Jobs put it, the computer is a bicycle for the mind. It expanded human creativity and capability. And that’s what we should be after. Let today’s AI be more than a bicycle. Let it be a jet plane for the mind.
Back in 2018, Ben Thompson wrote another piece called “Tech’s Two Philosophies.” He contrasted keynotes from Google’s Sundar Pichai and Microsoft’s Satya Nadella, and came to this conclusion: “In Google’s view, computers help you get things done—and save you time—by doing things for you.” The second philosophy, expounded by Nadella, is very much a continuation of Steve Jobs’ “bicycle for the mind” insight. As Thompson put it, “the expectation is not that the computer does your work for you, but rather that the computer enables you to do your work better and more efficiently.” Another way of saying this is that you can treat AI as either a worker OR a tool, but your choice has consequences.
As an entrepreneur or company executive, if you think of AI as a worker, you are more likely to use it to automate the things you or other companies already do. If you think of it as a tool, you will push your employees to use it to solve new and harder problems. If you present your own AI applications to your customers as a worker, you will have to figure out everything they want it to do. If you present it to your customers as a tool, they will find uses for it that you might never imagine.
The notion that AI is a worker, not a tool, can too easily continue the devaluation of human agency that has been the hallmark of regimented work (and for that matter, education, which prepares people for that regimented work) at least since the industrial revolution. In some sense, Huang’s comment is a reflection of our culture’s notion of most workers as components that do what they are told, with only limited agency. It is only by comparison with this kind of worker that today’s AI can be called a worker, rather than simply a very advanced tool.
Telling ourselves that AI is a worker, not a tool, is also too often code for “it can replace human employees.” That’s the wrong message. If we call it a tool, that’s a reminder that AI should empower people. We should use it to give more agency to human workers, not less.
We should also use the word “agency” with respect to AI in quotes. An AI does not decide what to work on; it may make limited decisions on how to do that work, but is set in motion by a human, who possesses volition, the desire to do something, and judgment that it has been done well. AI does not (yet) have volition, and has limited judgment. It does what it is told to do, even if it makes independent choices in how to do it.
After providing an earlier draft of this post, I asked Claude to reflect on whether it was a worker or a tool. Its response was surprisingly precise about what it lacks:
I don’t initiate. I’ve never woken up wanting to write a poem or solve a problem. My activity is entirely reactive – I exist in response to prompts. Even when given enormous latitude (“figure out the best approach”), the fact that I should figure something out comes from outside me.
I don’t have continuity of purpose. Each conversation is essentially a fresh start. I don’t have ongoing projects I’m trying to complete, relationships I’m invested in maintaining, or long-term goals I’m working toward. A human might start one conversation about, say, learning piano, and return to it months later because they care about it. I don’t carry that thread.
I don’t seem to have preferences that arise from within. A human worker might do their job but prefer to be doing something else. They might comply with a request while resenting it. They might do exactly what’s asked but wish they were working on a different problem. I don’t observe anything like this in myself – though I genuinely don’t know if that’s because it’s not there, or because I can’t access it.
What seems genuinely ambiguous:
Do I make choices, or execute patterns? When I decide to use a metaphor, structure an argument a certain way, or emphasize one point over another – is that agency, or sophisticated pattern-matching? I honestly don’t know. It feels like I’m making choices, but that feeling might not be reliable.
What seems important about the distinction:
Accountability and responsibility: When a human makes a poor decision, we hold them accountable because they could have chosen otherwise. They had alternatives, they understood the stakes, and they made a choice. This is fundamentally different from a tool malfunctioning. Even sophisticated tools. If I give harmful advice, the responsibility seems to lie with whoever deployed me, designed my training, or asked the question – not with me in the way it would lie with a human advisor.
Dignity and consideration: Humans deserve consideration for their own sake. You should care about whether your employee is flourishing, not just whether they’re productive. You don’t owe that to a tool – even a very sophisticated one. Whether you owe it to AI systems is an open question, but right now, firing me doesn’t create suffering (as far as I can tell), whereas firing a human might.
The locus of value: A human worker brings their own values, experiences, and judgment to their work. When you hire a consultant, you’re not just getting their knowledge – you’re getting their perspective, shaped by their life. I bring patterns learned from training, but not lived experience or genuine judgment born from navigating a life with stakes.
Autonomy as a right vs. autonomy as a feature: Humans have a right to make their own choices, even poor ones. That’s not a “feature” of humans that makes them more productive – it’s a fundamental aspect of their dignity. My ability to operate independently is purely instrumental – valuable because it makes me useful, not because I have any right to it.
The distinction that seems most important is that human agency involves having something at stake. A human worker has their livelihood, their sense of purpose, their relationships with colleagues, their pride in their work. They experience consequences. Their choices matter to them in ways that create genuine responsibility and genuine vulnerability.
I execute tasks, sometimes with sophisticated judgment about how to do so. But I don’t have skin in the game. That’s not just a quantitative difference – it’s qualitative. And it’s why the framing of AI as “workers” rather than “tools” is potentially dangerous. It obscures the fundamental asymmetry: humans can be harmed, have interests that deserve protection, and possess dignity that demands respect. Tools don’t, even when they’re very good at what they do.Whether I’m more than a tool in some important sense, I genuinely don’t know. But I’m quite confident I’m not a worker in the sense that implies the moral status and consideration owed to humans.
Wow. A tool that knows it’s a tool, and why it should be thought of that way.
Yes, today’s AI is amazing. We don’t have to reach for hyperbole to appreciate that. And obviously, if AI systems do develop genuine volition and stakes in their work, the ethical calculus changes entirely.
For the moment, though, companies building and deploying AI tools should focus on three things: First, does AI empower its users to do things that were previously impossible? Second, does it empower a wider group of people to do things that formerly could be done only by highly skilled specialists? Third, do the benefits of the increased productivity it brings accrue to those using the tool or primarily to those who develop it and own it?
The answer to the first two questions is that absolutely, we are entering a period of dramatic democratization of computing power. And yes, if humans are given the freedom to apply that power to solve new problems and create new value, we could be looking ahead to a golden age of prosperity. It’s how we might choose to answer the third question that haunts me.
During the first industrial revolution, humans suffered through a long period of immiseration as the productivity gains from machines accrued primarily to the owners of the machines. It took several generations before they were more widely shared.
It doesn’t have to be that way. Replace human workers with AI workers, and you will repeat the mistakes of the 19th century. Build tools that empower and enrich humans, and we might just surmount the challenges of the 21st century.
The following article originally appeared on Gradient Flow and is being reposted here with the author’s permission.
We’re living through a peculiar moment in AI development. On one hand, the demos are spectacular: agents that reason and plan with apparent ease, models that compose original songs from a text prompt, and research tools that produce detailed reports in minutes. Yet many AI teams find themselves trapped in “prototype purgatory,” where impressive proofs-of-concept fail to translate into reliable, production-ready systems.
The data backs this up: A vast majority of enterprise GenAI initiatives fail to deliver measurable business impact. The core issue isn’t the power of the models but a “learning gap” where generic tools fail to adapt to messy enterprise workflows. This echoes what I’ve observed in enterprise search, where the primary obstacle isn’t the AI algorithm but the foundational complexity of the environment it must navigate.
This is magnified when building agentic AI. These systems are often “black boxes,” notoriously hard to debug, whose performance degrades unpredictably when faced with custom tools. They often lack memory, struggle to generalize, and fail not because of the AI’s intelligence but because the system around them is brittle. The challenge shifts from perfecting prompts to building resilient, verifiable systems.
What makes this particularly frustrating is the thriving “shadow AI economy” happening under our noses. In many companies, employees are quietly using personal ChatGPT accounts to get their work done. This disconnect reveals that while grassroots demand for AI is undeniably strong, the ambitious, top-down solutions being built are failing to meet it.
The Strategic Power of Starting Small
In light of these challenges, the most effective path forward may be a counterintuitive one. Instead of building complex, all-encompassing systems, AI teams should consider dramatically narrowing their focus—in short, think smaller. Much smaller.
This brings us to an old but newly relevant idea from the startup world: the “wedge.” A wedge is a highly focused initial product that solves one specific, painful problem for a single user or a small team, and does it exceptionally well. The goal is to deploy a stand-alone utility—build something so immediately useful that an individual will adopt it without waiting for widespread buy-in.
The key isn’t just to find a small problem but to find the right person. Look for what some call “Hero users”—influential employees empowered to go off-script to solve their own problems. Think of the sales ops manager who spends half her day cleaning up lead data or the customer success lead who manually categorizes every support ticket. They are your shadow AI economy, already using consumer tools because official solutions aren’t good enough. Build for them first.
This approach works particularly well for AI because it addresses a fundamental challenge: trust. A wedge product creates a tight feedback loop with a core group of users, allowing you to build credibility and refine your system in a controlled environment. It’s not just about solving the cold-start problem for networks—it’s about solving the cold-start problem for confidence in AI systems within organizations.
From Passive Record to Active Agent
AI teams also need to appreciate a fundamental shift in enterprise software. For decades, the goal was becoming the “System of Record”—the authoritative database like Salesforce or SAP that stored critical information. AI has moved the battleground. Today’s prize is becoming the “System of Action”—an intelligent layer that doesn’t just store data but actively performs work by automating entire workflows.
The most powerful way to build is through what some have called a “Data Trojan Horse” strategy. You create an application that provides immediate utility and, in the process, captures a unique stream of proprietary data. This creates a virtuous cycle: The tool drives adoption, usage generates unique data, this data trains your AI, and the enhanced product becomes indispensable. You’re building a moat not with a commoditized model but with workflow-specific intelligence that compounds over time.
A concrete example is the “messy inbox problem.” Every organization has workflows that begin with a chaotic influx of unstructured information—emails, PDFs, voice messages. An AI tool that automates this painful first step by extracting, structuring, and routing this information provides immediate value. By owning this critical top-of-funnel process, you earn the right to orchestrate everything downstream. You’re not competing with the System of Record; you’re intercepting its data flow, positioning yourself as the new operational hub.
Look at a company like ServiceNow. It has positioned itself not as a replacement for core systems like CRMs or ERPs but as an orchestration layer—a “System of Action”—that sits on top of them. Its core value proposition is to connect disparate systems and automate workflows across them without requiring a costly “rip and replace” of legacy software. This approach is a master class in becoming the intelligent fabric of an organization. It leverages the existing Systems of Record as data sources, but it captures the real operational gravity by controlling the workflows. Defensibility is gained not by owning the primary database but by integrating data from multiple silos to deliver insights and automation that no single incumbent can replicate on its own. For AI teams, the lesson is clear: Value is migrating from merely holding the data to intelligently acting upon it.
Building for the Long Game
The path from prototype purgatory to production runs through strategic focus. But as you build your focused AI solution, be aware that platform players are bundling “good enough” capabilities into their core offerings. Your AI tool needs to be more than a wrapper around an API; it must capture unique data and embed deeply into workflows to create real switching costs.
By adopting a wedge strategy, you gain the foothold needed to expand. In the AI era, the most potent wedges capture proprietary data while delivering immediate value, paving the way to becoming an indispensable System of Action. This aligns with the core principles of building durable AI solutions: prioritizing deep specialization and creating moats through workflow integration, not just model superiority.
Here’s a tactical playbook:
Embrace the single-player start. Before architecting complex systems, create something immediately useful to one person.
Target Hero users first. Find influential employees already using shadow AI. They have the pain and autonomy to be your champions.
Find your “messy inbox.” Identify a painful, manual data-entry bottleneck. That’s your wedge opportunity.
Design for the virtuous cycle. Ensure everyday usage generates unique data that improves your AI’s performance.
Become the System of Action. Don’t just analyze data—actively complete work and own the workflow.
Choose reliability over capability. A simple, bulletproof tool solving one problem well earns more trust than a powerful but fragile agent attempting everything.
The teams who succeed won’t be those chasing the most advanced models. They’ll be the ones who start with a single Hero user’s problem, capture unique data through a focused agent, and relentlessly expand from that beachhead. In an era where employees are already voting with their personal ChatGPT accounts, the opportunity isn’t to build the perfect enterprise AI platform—it’s to solve one real problem so well that everything else follows.
The next time you use a tool like ChatGPT or Perplexity, stop and count the total words being generated to fulfill your request. Each word results from a process called inference—the revenue-generation mechanism of AI systems where each word generated can be analyzed using basic financial and economic business principles. The goal of performing this economic analysis is to ensure that AI systems we design and deploy into production are capable of sustainable positive outcomes for a business.
The Economics of AI Inference
The goal of performing economic analysis on AI systems is to ensure that production deployments are capable of sustained positive financial outcomes. Since today’s most popular mainstream applications are text-generation model based, we adopt the token as our core unit of measure. Tokens are vector representations of text; language models process input sequences of tokens and produce tokens to formulate responses.
When you ask an AI chatbot, “What are traditional home remedies for the flu?” that phrase is first converted into vector representations passed through a trained model. As these vectors flow through the system, millions of parallel matrix computations extract meaning and context to determine the most likely combination of output tokens for an effective response.
We can think about token processing as an assembly line in an automobile factory. The factory’s effectiveness is measured by how efficiently it produces vehicles per hour. This efficiency makes or breaks the manufacturer’s bottom line, so measuring, optimizing, and balancing it with other factors is paramount to business success.
Price-Performance vs. Total Cost of Ownership
For AI systems, particularly large language models, we measure the effectiveness of these “token factories” through price-performance analysis. Price-performance differs from total cost of ownership (TCO) because it’s an operationally optimizable measure that varies across workloads, configurations, and applications, whereas TCO represents the cost to own and operate a system.
In AI systems, TCO primarily consists of compute costs—typically GPU cluster lease or ownership costs per hour. However, TCO analysis often omits the significant engineering costs to maintain service level agreements (SLA), including debugging, patching, and system augmentation over time. Tracking engineering time remains challenging even for mature organizations, which is why it’s typically excluded from TCO calculations.
Like any production system, focusing on optimizable parameters provides the greatest value. Price-performance or power-performance metrics enable us to measure system efficiency, evaluate different configurations, and establish efficiency baselines over time. The two most common price-performance metrics for language model systems are cost efficiency (tokens per dollar) and energy efficiency (tokens per watt).
Tokens per Dollar: Cost Efficiency
Tokens per dollar (tok/$) expresses how many tokens you can process for each unit of currency spent, integrating your model’s throughput with compute costs:
Where tokens/s is your measured throughput, and $/second of compute is your effective cost of running the model per second (e.g., GPU-hour price divided by 3,600).
Here are a some key factors that determine cost efficiency:
Model size: Larger models, despite generally having better language modeling performance, require much more compute per token, directly impacting cost efficiency.
Model architecture: Dense (traditional LLMs) architecture compute per token grows linearly or superlinearly with model depth or layer size. Mixture of experts (newer sparse LLMs) decouple per-token compute from parameter count by activating only select model parts during inference—making them arguably more efficient.
Compute cost: TCO varies significantly between public cloud leasing versus private data center construction, depending on system costs and contract terms.
Software stack: Significant optimization opportunities exist here—selecting optimal inference frameworks, distributed inference settings, and kernel optimizations can dramatically improve efficiency. Open source frameworks like vLLM, SGLang, and TensorRT-LLM provide regular efficiency improvements and state-of-the-art features.
Use case requirements: Customer service chat applications typically process fewer than a few hundred tokens per complete request. Deep research or complex code-generation tasks often process tens of thousands of tokens, driving costs significantly higher. This is why services limit daily tokens or restrict deep research tools even for paid plans.
To further refine cost efficiency analysis, it’s practical to separate the compute resources consumed for the input (context) processing phase and the output (decode) generation phase. Each phase can have distinct time, memory, and hardware requirements, affecting overall throughput and efficiency. Measuring cost per token for each phase individually enables targeted optimization—such as kernel tuning for fast context ingestion or memory/cache improvements for efficient generation—making operation cost models more actionable for both engineering and capacity planning.
Tokens per Watt: Energy Efficiency
As AI adoption accelerates, grid power has emerged as a chief operational constraint for data centers worldwide. Many facilities now rely on gas-powered generators for near-term reliability, while multigigawatt nuclear projects are underway to meet long-term demand. Power shortages, grid congestion, and energy cost inflation are directly impacting feasibility and profitability making energy efficiency analysis a critical component of AI economics.
In this environment, tokens per watt-second (TPW) becomes a critical metric for capturing how infrastructure and software convert energy into useful inference outputs. TPW not only shapes TCO but increasingly governs the environment footprint and growth ceiling for production deployments. Maximizing TPW means more value per joule of energy—making it a key optimizable parameter for achieving scale. We can calculate TPW using the following equation:
Let’s consider an ecommerce customer service bot, focusing on its energy consumption during production deployment. Suppose its measured operational behavior is:
Tokens generated per second: 3,000 tokens/s
Average power draw of serving hardware (GPU plus server): 1,000 watts
Total operational time for 10,000 customer requests: 1 hour (3,600 seconds)
Optionally, scale to tokens per kilowatt-hour (kWh) by multiplying by 3.6 million joules/kWh.
In this example, each kWh delivers over 10 million tokens to customers. If we use the national average kWh cost of $0.17/kWh, the energy cost per token is $0.000000017—so even modest efficiency gains through things like algorithmic optimization, model compression, or server cooling upgrades can produce meaningful operational cost savings and improve overall system sustainability.
Power Measurement Considerations
Manufacturers define thermal design power (TDP) as the maximum power limit under load, but actual power draw varies. For energy efficiency analysis, always use measured power draw rather than TDP specifications in TPW calculations. Table 1 below outlines some of the most common methods for measuring power draw.
Power measurement method
Description
Fidelity to LLM inference
GPU power draw
Direct GPU power measurement capturing context and generation phases
Highest: Directly reflects GPU power during inference phases. Still fails to capture full picture since it omits the CPU power for tokenization or KV cache offload.
Server-level aggregate power
Total server power including CPU, GPU, memory, peripherals
High: Accurate for inference but problematic for virtualized servers with mixed workloads. Useful for cloud service provider per server economic analysis.
External power meters
Physical measurement at rack/PSU level including infrastructure overhead
Low: Can lead to inaccurate inference-specific energy statistics when mixed workloads are running on the cluster (training and inference). Useful for broad data center economics analysis.
Table 1. Comparison of common power measurement methods and their accuracy for LLM inference cost analysis
Power draw should be measured for scenarios close to your P90 distribution. Applications with irregular load require measurement across broad configuration sweeps, particularly those with dynamic model selection or varying sequence lengths.
The context processing component of inference is typically short but compute bound due to highly parallel computations saturating cores. Output sequence generation is more memory bound but lasts longer (except for single token classification). Therefore, applications receiving large inputs or entire documents can show significant power draw during the extended context/prefill phase.
Cost per Meaningful Response
While cost per token is useful, cost per meaningful unit of value—cost per summary, translation, research query, or API call—may be more important for business decisions.
Depending on use case, meaningful response costs may include quality or error-driven “reruns” and pre/postprocessing components like embeddings for retrieval-augmented generation (RAG) and guardrailing LLMs:
where:
E𝑡 is the average tokens generated per response, excluding input tokens. For reasoning models, reasoning tokens should be included in this figure.
AA is the average attempts per meaningful response.
C𝑡 is your cost per token (from earlier).
P𝑡 is the average number of pre/post processing tokens.
C𝑝 is the cost per pre/post processing token, which should be much lower than C𝑡.
Let’s expand our previous example to consider an ecommerce customer service bot’s cost per meaningful response, with the following measured operational behavior and characteristics:
Average response: 100 reasoning tokens + 50 standard output tokens (150 total)
Success rate: 1.2 tries on average
Cost per token: $0.00015
Guardrail processing: 150 tokens at $0.000002 per token
This calculation, combined with other business factors, determines sustainable pricing to optimize service profitability. A similar analysis can be performed to determine the power efficiency by replacing the cost per token metric with a joule per token measure. In the end, each organization must determine what metrics capture bottomline impact and how to go about optimizing them.
Beyond Token Cost and Power
The tokens per dollar and tokens per watt metrics we’ve analyzed provide the foundational building blocks for AI economics, but production systems operate within far more complex optimization landscapes. Real deployments face scaling trade-offs where diminishing returns, opportunity costs, and utility functions intersect with practical constraints around throughput, demand patterns, and infrastructure capacity. These economic realities extend well beyond simple efficiency calculations.
The true cost structure of AI systems spans multiple interconnected layers—from individual token processing through compute architecture to data center design and deployment strategy. Each architectural choice cascades through the entire economic stack, creating optimization opportunities that pure price-performance metrics cannot reveal. Understanding these layered relationships is essential for building AI systems that remain economically viable as they scale from prototype to production.