Reading view

There are new articles available, click to refresh the page.

Software 2.0 Means Verifiable AI

By: Mike Loukides

9 December 2025 at 07:23

Quantum computing (QC) and AI have one thing in common: They make mistakes.

There are two keys to handling mistakes in QC: We’ve made tremendous progress in error correction in the last year. And QC focuses on problems where generating a solution is extremely difficult, but verifying it is easy. Think about factoring 2048-bit prime numbers (around 600 decimal digits). That’s a problem that would take years on a classical computer, but a quantum computer can solve it quickly—with a significant chance of an incorrect answer. So you have to test the result by multiplying the factors to see if you get the original number. Multiply two 1024-bit numbers? Easy, very easy for a modern classical computer. And if the answer’s wrong, the quantum computer tries again.

One of the problems with AI is that we often shoehorn it into applications where verification is difficult. Tim Bray recently read his AI-generated biography on Grokipedia. There were some big errors, but there were also many subtle errors that no one but him would detect. We’ve all done the same, with one chat service or another, and all had similar results. Worse, some of the sources referenced in the biography purporting to verify claims actually “entirely fail to support the text,”—a well-known problem with LLMs.

Andrej Karpathy recently proposed a definition for Software 2.0 (AI) that places verification at the center. He writes: “In this new programming paradigm then, the new most predictive feature to look at is verifiability. If a task/job is verifiable, then it is optimizable directly or via reinforcement learning, and a neural net can be trained to work extremely well.” This formulation is conceptually similar to quantum computing, though in most cases verification for AI will be much more difficult than verification for quantum computers. The minor facts of Tim Bray’s life are verifiable, but what does that mean? That a verification system has to contact Tim to verify the details before authorizing a bio? Or does it mean that this kind of work should not be done by AI? Although the European Union’s AI Act has laid a foundation for what AI applications should and shouldn’t do, we’ve never had anything that’s easily, well, “computable.” Furthermore: In quantum computing it’s clear that if a machine fails to produce correct output, it’s OK to try again. The same will be true for AI; we already know that all interesting models produce different output if you ask the question again. We shouldn’t underestimate the difficulty of verification, which might prove to be more difficult than training LLMs.

Regardless of the difficulty of verification, Karpathy’s focus on verifiability is a huge step forward. Again from Karpathy: “The more a task/job is verifiable, the more amenable it is to automation…. This is what’s driving the ‘jagged’ frontier of progress in LLMs.”

What differentiates this from Software 1.0 is simple:

Software 1.0 easily automates what you can specify.
Software 2.0 easily automates what you can verify.

That’s the challenge Karpathy lays down for AI developers: determine what is verifiable and how to verify it. Quantum computing gets off easily because we only have a small number of algorithms that solve straightforward problems, like factoring large numbers. Verification for AI won’t be easy, but it will be necessary as we move into the future.

What If? AI in 2026 and Beyond

Oreilly

By: Tim O’Reilly and Mike Loukides

8 December 2025 at 12:58

The market is betting that AI is an unprecedented technology breakthrough, valuing Sam Altman and Jensen Huang like demigods already astride the world. The slow progress of enterprise AI adoption from pilot to production, however, still suggests at least the possibility of a less earthshaking future. Which is right?

At O’Reilly, we don’t believe in predicting the future. But we do believe you can see signs of the future in the present. Every day, news items land, and if you read them with a kind of soft focus, they slowly add up. Trends are vectors with both a magnitude and a direction, and by watching a series of data points light up those vectors, you can see possible futures taking shape.

This is how we’ve always identified topics to cover in our publishing program, our online learning platform, and our conferences. We watch what we call “the alpha geeks“: paying attention to hackers and other early adopters of technology with the conviction that, as William Gibson put it, “The future is here, it’s just not evenly distributed yet.” As a great example of this today, note how the industry hangs on every word from AI pioneer Andrej Karpathy, hacker Simon Willison, and AI-for-business guru Ethan Mollick.

We are also fans of a discipline called scenario planning, which we learned decades ago during a workshop with Lawrence Wilkinson about possible futures for what is now the O’Reilly learning platform. The point of scenario planning is not to predict any future but rather to stretch your imagination in the direction of radically different futures and then to identify “robust strategies” that can survive either outcome. Scenario planners also use a version of our “watching the alpha geeks” methodology. They call it “news from the future.”

Is AI an Economic Singularity or a Normal Technology?

For AI in 2026 and beyond, we see two fundamentally different scenarios that have been competing for attention. Nearly every debate about AI, whether about jobs, about investment, about regulation, or about the shape of the economy to come, is really an argument about which of these scenarios is correct.

Scenario one: AGI is an economic singularity. AI boosters are already backing away from predictions of imminent superintelligent AI leading to a complete break with all human history, but they still envision a fast takeoff of systems capable enough to perform most cognitive work that humans do today. Not perfectly, perhaps, and not in every domain immediately, but well enough, and improving fast enough, that the economic and social consequences will be transformative within this decade. We might call this the economic singularity (to distinguish it from the more complete singularity envisioned by thinkers from John von Neumann, I. J. Good, and Vernor Vinge to Ray Kurzweil).

In this possible future, we aren’t experiencing an ordinary technology cycle. We are experiencing the start of a civilization-level discontinuity. The nature of work changes fundamentally. The question is not which jobs AI will take but which jobs it won’t. Capital’s share of economic output rises dramatically; labor’s share falls. The companies and countries that master this technology first will gain advantages that compound rapidly.

If this scenario is correct, most of the frameworks we use to think about technology adoption are wrong, or at least inadequate. The parallels to previous technology transitions such as electricity, the internet, or mobile are misleading because they suggest gradual diffusion and adaptation. What’s coming will be faster and more disruptive than anything we’ve experienced.

Scenario two: AI is a normal technology. In this scenario, articulated most clearly by Arvind Narayanan and Sayash Kapoor of Princeton, AI is a powerful and important technology but nonetheless subject to all the normal dynamics of adoption, integration, and diminishing returns. Even if we develop true AGI, adoption will still be a slow process. Like previous waves of automation, it will transform some industries, augment many workers, displace some, but most importantly, take decades to fully diffuse through the economy.

In this world, AI faces the same barriers that every enterprise technology faces: integration costs, organizational resistance, regulatory friction, security concerns, training requirements, and the stubborn complexity of real-world workflows. Impressive demos don’t translate smoothly into deployed systems. The ROI is real but incremental. The hype cycle does what hype cycles do: Expectations crash before realistic adoption begins.

If this scenario is correct, the breathless coverage and trillion-dollar valuations are symptoms of a bubble, not harbingers of transformation.

Reading News from the Future

These two scenarios lead to radically different conclusions. If AGI is an economic singularity, then massive infrastructure investment is rational, and companies borrowing hundreds of billions to spend on data centers to be used by companies that haven’t yet found a viable economic model are making prudent bets. If AI is a normal technology, that spending looks like the fiber-optic overbuild of 1999. It’s capital that will largely be written off.

If AGI is an economic singularity, then workers in knowledge professions should be preparing for fundamental career transitions; firms should be thinking how to radically rethink their products, services, and business models; and societies should be planning for disruptions to employment, taxation, and social structure that dwarf anything in living memory.

If AI is normal technology, then workers should be learning to use new tools (as they always have), but the breathless displacement predictions will join the long list of automation anxieties that never quite materialized.

So, which scenario is correct? We don’t know yet, or even if this face-off is the right framing of possible futures, but we do know that a year or two from now, we will tell ourselves that the answer was right there, in plain sight. How could we not have seen it? We weren’t reading the news from the future.

Some news is hard to miss: The change in tone of reporting in the financial markets, and perhaps more importantly, the change in tone from Sam Altman and Dario Amodei. If you follow tech closely, it’s also hard to miss news of real technical breakthroughs, and if you’re involved in the software industry, as we are, it’s hard to miss the real advances in programming tools and practices. There’s also an area that we’re particularly interested in, one which we think tells us a great deal about the future, and that is market structure, so we’re going to start there.

The Market Structure of AI

The economic singularity scenario has been framed as a winner-takes-all race for AGI that creates a massive concentration of power and wealth. The normal technology scenario suggests much more of a rising tide, where the technology platforms become dominant precisely because they create so much value for everyone else. Winners emerge over time rather than with a big bang.

Quite frankly, we have one big signal that we’re watching here: Does OpenAI, Anthropic, or Google first achieve product-market fit? By product-market fit we don’t just mean that users love the product or that one company has dominant market share but that a company has found a viable economic model, where what people are willing to pay for AI-based services is greater than the cost of delivering them.

OpenAI appears to be trying to blitzscale its way to AGI, building out capacity far in excess of the company’s ability to pay for it. This is a massive one-way bet on the economic singularity scenario, which makes ordinary economics irrelevant. Sam Altman has even said that he has no idea what his business will be post-AI or what the economy will look like. So far, investors have been buying it, but doubts are beginning to shape their decisions.

Anthropic is clearly in pursuit of product-market fit, and its success in one target market, software development, is leading the company on a shorter and more plausible path to profitability. Anthropic leaders talk AGI and economic singularity, but they walk the walk of a normal technology believer. The fact that Anthropic is likely to beat OpenAI to an IPO is a very strong normal technology signal. It’s also a good example of what scenario planners view as a robust strategy, good in either scenario.

Google gives us a different take on normal technology: an incumbent looking to balance its existing business model with advances in AI. In Google’s normal technology vision, AI disappears “into the walls” like networks did. Right now, Google is still foregrounding AI with AI overviews and NotebookLM, but it’s in a position to make it recede into the background of its entire suite of products, from Search and Google Cloud to Android and Google Docs. It has too much at stake in the current economy to believe that the route to the future consists in blowing it all up. That being said, Google also has the resources to place big bets on new markets with clear economic potential, like self-driving cars, drug discovery, and even data centers in space. It’s even competing with Nvidia, not just with OpenAI and Anthropic. This is also a robust strategy.

What to watch for: What tech stack are developers and entrepreneurs building on?

Right now, Anthropic’s Claude appears to be winning that race, though that could change quickly. Developers are increasingly not locked into a proprietary stack but are easily switching based on cost or capability differences. Open standards such as MCP are gaining traction.

On the consumer side, Google Gemini is gaining on ChatGPT in terms of daily active users, and investors are starting to question OpenAI’s lack of a plausible business model to support its planned investments.

These developments suggest that the key idea behind the massive investment driving AI boom, that one winner gets all the advantages, just doesn’t hold up.

Capability Trajectories

The economic singularity scenario depends on capabilities continuing to improve rapidly. The normal technology scenario is comfortable with limits rather than hyperscaled discontinuity. There is already so much to digest!

On the economic singularity side of the ledger, positive signs would include a capability jump that surprises even insiders, such as Yann LeCun’s objections being overcome. That is, AI systems demonstrably have world models, can reason about physics and causality, and aren’t just sophisticated pattern matchers. Another game changer would be a robotics breakthrough: embodied AI that can navigate novel physical environments and perform useful manipulation tasks.

Evidence that AI is normal technology include AI systems that are good enough to be useful but not good enough to be trusted, continuing to require human oversight that limits productivity gains; prompt injection and security vulnerabilities remain unsolved, constraining what agents can be trusted to do; domain complexity continues to defeat generalization, and what works in coding doesn’t transfer to medicine, law, science; regulatory and liability barriers prove high enough to slow adoption regardless of capability; and professional guilds successfully protect their territory. These problems may be solved over time, but they don’t just disappear with a new model release.

Regard benchmark performance with skepticism, since benchmarks are even more likely to be gamed when investors are losing enthusiasm than they are now, while everyone is still afraid of missing out.

Reports from practitioners actually deploying AI systems are far more important. Right now, tactical progress is strong. We see software developers in particular making profound changes in development workflows. Watch for whether they are seeing continued improvement or a plateau. Is the gap between demo and production narrowing or persisting? How much human oversight do deployed systems require? Listen carefully to reports from practitioners about what AI can actually do in their domain versus what it’s hyped to do.

We are not persuaded by surveys of corporate attitudes. Having lived through the realities of internet and open source software adoption, we know that, like Hemingway’s marvelous metaphor of bankruptcy, corporate adoption happens gradually, then suddenly, with late adopters often full of regret.

If AI is achieving general intelligence, though, we should see it succeed across multiple domains, not just the ones where it has obvious advantages. Coding has been the breakout application, but coding is in some ways the ideal domain for current AI. It’s characterized by well-defined problems, immediate feedback loops, formally defined languages, and massive training data. The real test is whether AI can break through in domains that are harder and farther away from the expertise of the people developing the AI models.

What to watch for: Real-world constraints start to bite. For example, what if there is not enough power to train or run the next generation of models at the scale company ambitions require? What if capital for the AI build-out dries up?

Our bet is that various real-world constraints will become more clearly recognized as limits to the adoption of AI, despite continued technical advances.

Bubble or Bust?

It’s hard not to notice how the narrative in the financial press has shifted in the past few months, from mindless acceptance of industry narratives to a growing consensus that we are in the throes of a massive investment bubble, with the chief question on everyone’s mind seeming to be when and how it will pop.

The current moment does bear uncomfortable similarities to previous technology bubbles. Famed short investor Michael Burry is comparing Nvidia to Cisco and warning of a worse crash than the dot-com bust of 2000. The circular nature of AI investment—in which Nvidia invests in OpenAI, which buys Nvidia chips; Microsoft invests in OpenAI, which pays Microsoft for Azure; and OpenAI commits to massive data center build-outs with little evidence that it will ever have enough profit to justify those commitments—has reached levels that would be comical if the numbers weren’t so large.

But there’s a counterargument: Every transformative infrastructure build-out begins with a bubble. The railroads of the 1840s, the electrical grid of the 1900s, the fiber-optic networks of the 1990s all involved speculative excess, but all left behind infrastructure that powered decades of subsequent growth. One question is whether AI infrastructure is like the dot-com bubble (which left behind useful fiber and data centers) or the housing bubble (which left behind empty subdivisions and a financial crisis).

The real question when faced with a bubble is What will be the source of value in what is left? It most likely won’t be in the AI chips, which have a short useful life. It may not even be in the data centers themselves. It may be in a new approach to programming that unlocks entirely new classes of applications. But one pretty good bet is that there will be enduring value in the energy infrastructure build-out. Given the Trump administration’s war on renewable energy, the market demand for energy in the AI build-out may be its saving grace. A future of abundant, cheap energy rather than the current fight for access that drives up prices for consumers could be a very nice outcome.

Signs pointing toward economic singularity: Widespread job losses across multiple industries and spiking business bankruptcy rate; storied companies are wiped out by major new applications that just couldn’t exist without AI; sustained high utilization of AI infrastructure (data centers, GPU clusters) over multiple years; actual demand meets or exceeds capacity; continued spiking of energy prices, especially in areas with many data centers.

Signs pointing toward bubble: Continued reliance on circular financing structures (vendor financing, equity swaps between AI companies); enterprise AI projects stall in the pilot phase, failing to scale; a “show me the money” moment arrives, where investors demand profitability and AI companies can’t deliver.

Signs pointing towards normal technology recovery postbubble: Strong revenue growth at AI application companies, not just infrastructure providers; enterprises report concrete, measurable ROI from AI deployments.

What to watch: There are so many possibilities that this is an act of imagination! Start with Wile E. Coyote running over a cliff in pursuit of Road Runner in the classic Warner Bros. cartoons. Imagine the moment when investors realize that they are trying to defy gravity.

Going over a cliff — Image generated with Gemini and Nano Banana Pro

What made them notice? Was it the failure of a much-hyped data center project? Was it that it couldn’t get financing, that it couldn’t get completed because of regulatory constraints, that it couldn’t get enough chips, that it couldn’t get enough power, that it couldn’t get enough customers?

Imagine one or more storied AI lab or startup unable to complete its next fundraise. Imagine Oracle or SoftBank trying to get out of a big capital commitment. Imagine Nvidia announcing a revenue miss. Imagine another DeepSeek moment coming out of China.

Our bet for the most likely prick to pop the bubble is that Anthropic and Google’s success against OpenAI persuades investors that OpenAI will not be able to pay for the massive amount of data center capacity it has contracted for. Given the company’s centrality to the AGI singularity narrative, a failure of belief in OpenAI could bring down the whole web of interconnected data center bets, many of them financed by debt. But that’s not the only possibility.

Always Update Your Priors

DeepSeek’s emergence in January was a signal that the American AI establishment may not have the commanding lead it assumed. Rather than racing for AGI, China seems to be heavily betting on normal technology, building towards low-cost, efficient AI, industrial capacity, and clear markets. While claims about what DeepSeek spent on training its V3 model have been contested, training isn’t the only cost: There’s also the cost of inference and, for increasingly popular reasoning models, the cost of reasoning. And when these are taken into account, DeepSeek is very much a leader.

If DeepSeek and other Chinese AI labs are right, the US may be intent on winning the wrong race. What’s more, our conversations with Chinese AI investors reveals a much heavier tilt towards embodied AI (robotics and all its cousins) than towards consumer or even enterprise applications. Given the geopolitical tensions between China and the US, it’s worth asking what kind of advantage a GPT-9 with limited access to the real world might provide against an army of drones and robots powered by the equivalent of GPT-8!

The point is that the discussion above is meant to be provocative, not exhaustive. Expand your horizons. Think about how US and international politics, advances in other technologies, and financial market impacts ranging from a massive market collapse to a simple change in investor priorities might change industry dynamics.

What you’re watching for is not any single data point but the pattern across multiple vectors over time. Remember that the AGI versus normal technology framing is not the only or maybe even the most useful way to look at the future.

The most likely outcome, even restricted to these two hypothetical scenarios, is something in between. AI may achieve something like AGI for coding, text, and video while remaining a normal technology for embodied tasks and complex reasoning. It may transform some industries rapidly while others resist for decades. The world is rarely as neat as any scenario.

But that’s precisely why the “news from the future” approach matters. Rather than committing to a single prediction, you stay alert to the signals, ready to update your thinking as evidence accumulates. You don’t need to know which scenario is correct today. You need to recognize which scenario is becoming correct as it happens.

AI in 2026 and Beyond infographic — Infographic created with Gemini and Nano Banana Pro

What If? Robust Strategies in the Face of Uncertainty

The second part of scenario planning is to identify robust strategies that will help you do well regardless of which possible future unfolds. In this final section, as a way of making clear what we mean by that, we’ll consider 10 “What if?” questions and ask what the robust strategies might be.

1. What if the AI bubble bursts in 2026?

The vector: We are seeing massive funding rounds for AI foundries and massive capital expenditure on GPUs and data centers without a corresponding explosion in revenue for the application layer.

The scenario: The “revenue gap” becomes undeniable. Wall Street loses patience. Valuations for foundational model companies collapse and the river of cheap venture capital dries up.

In this scenario, we would see responses like OpenAI’s “Code Red” reaction to improvements in competing products. We would see declines in prices for stocks that aren’t yet traded publicly. And we might see signs that the massive fundraising for data centers and power are performative, not backed by real capital. In the words of one commenter, they are “bragawatts.”

A robust strategy: Don’t build a business model that relies on subsidized intelligence. If your margins only work because VC money is paying for 40% of your inference costs, you are vulnerable. Focus on unit economics. Build products where the AI adds value that customers are willing to pay for now, not in a theoretical future where AI does everything. If the bubble bursts, infrastructure will remain, just as the dark fiber did, becoming cheaper for the survivors to use.

2. What if energy becomes the hard limit?

The vector: Data centers are already stressing grids. We are seeing a shift from the AI equivalent of Moore’s law to a world where progress may be limited by energy constraints.

The scenario: In 2026, we hit a wall. Utilities simply cannot provision power fast enough. Inference becomes a scarce resource, available only to the highest bidders or those with private nuclear reactors. Highly touted data center projects are put on hold because there isn’t enough power to run them, and rapidly depreciating GPUs are put in storage because there aren’t enough data centers to deploy them.

A robust strategy: Efficiency is your hedge. Stop treating compute as infinite. Invest in small language models (SLMs) and edge AI that run locally. If you can run 80% of your workload on a laptop-grade chip rather than an H100 in the cloud, you are at least partially insulated from the energy crunch.

3. What if inference becomes a commodity?

The vector: Chinese labs continue to release open weight models with performance comparable to each previous generation of top-of-the line US frontier models but at a fraction of the training and inference cost. What’s more, they are training them with lower-cost chips. And it appears to be working.

The scenario: The price of “intelligence” collapses to near zero. The moat of having the biggest model and the best cutting-edge chips for training evaporates.

A robust strategy: Move up the stack. If the model is a commodity, the value is in the integration, the data, and the workflow. Build applications and services using the unique data, context, and workflows that no one else has.

4. What if Yann LeCun is right?

The vector: LeCun has long argued that auto-regressive LLMs are an “off-ramp” on the highway to AGI because they can’t reason or plan; they only predict the next token. He bets on world models (JEPA). OpenAI cofounder Ilya Sutskever has also argued that the AI industry needs fundamental research to solve basic problems like the ability to generalize.

The scenario: In 2026, LLMs hit a plateau. The market realizes we’ve spent billions on a dead end technology for true AGI.

A robust strategy: Diversify your architecture. Don’t bet the farm on today’s AI. Focus on compound AI systems that use LLMs as just one component, while relying on deterministic code, databases, and small, specialized models for additional capabilities. Keep your eyes and your options open.

5. What if there is a major security incident?

The vector: We are currently hooking insecure LLMs up to banking APIs, email, and purchasing agents. Security researchers have been screaming about indirect prompt injection for years.

The scenario: A worm spreads through email auto-replies, tricking AI agents into transferring funds or approving fraudulent invoices at scale. Trust in agentic AI collapses.

A robust strategy: “Trust but verify” is dead; use “verify then trust.” Implement well-known security practices like least privilege (restrict your agents to the minimal list of resources they need) and zero trust (require authentication before every action). Stay on top of OWASP’s lists of AI vulnerabilities and mitigations. Keep a “human in the loop” for high-stakes actions. Advocate for and adopt standard AI disclosure and audit trails. If you can’t trace why your agent did something, you shouldn’t let it handle money.

6. What if China is actually ahead?

The vector: While the US focuses on raw scale and chip export bans, China is focusing on efficiency and embedded AI in manufacturing, EVs, and consumer hardware.

The scenario: We discover that 2026’s “iPhone moment” comes from Shenzhen, not Cupertino, because Chinese companies integrated AI into hardware better while we were fighting over chatbot and agentic AI dominance.

A robust strategy: Look globally. Don’t let geopolitical narratives blind you to technical innovation. If the best open source models or efficiency techniques are coming from China, study them. Open source has always been the best way to bridge geopolitical divides. Keep your stack compatible with the global ecosystem, not just the US silo.

7. What if robotics has its “ChatGPT moment”?

The vector: End-to-end learning for robots is advancing rapidly.

The scenario: Suddenly, physical labor automation becomes as possible as digital automation.

A robust strategy: If you are in a “bits” business, ask how you can bridge to “atoms.” Can your software control a machine? How might you embody useful intelligence into your products?

8. What if vibe coding is just the start?

The vector: Anthropic and Cursor are changing programming from writing syntax to managing logic and workflow. Vibe coding lets nonprogrammers build apps by just describing what they want.

The scenario: The barrier to entry for software creation drops to zero. We see a Cambrian explosion of apps built for a single meeting or a single family vacation. Alex Komoroske calls it disposable software: “Less like canned vegetables and more like a personal farmer’s market.”

A robust strategy: In a world where AI is good enough to generate whatever code we ask for, value shifts to knowing what to ask for. Coding is much like writing: Anyone can do it, but some people have more to say than others. Programming isn’t just about writing code; it’s about understanding problems, contexts, organizations, and even organizational politics to come up with a solution. Create systems and tools that embody unique knowledge and context that others can use to solve their own problems.

9. What if AI kills the aggregator business model?

The vector: Amazon and Google make money by being the tollbooth between you and the product or information you want. If people get answers from AI, or an AI agent buys for you, it bypasses the ads and the sponsored listings, undermining the business model of internet incumbents.

The scenario: Search traffic (and ad revenue) plummets. Brands lose their ability to influence consumers via display ads. AI has destroyed the source of internet monetization and hasn’t yet figured out what will take its place.

A robust strategy: Own the customer relationship directly. If Google stops sending you traffic, you need an MCP, an API, or a channel for direct brand loyalty that an AI agent respects. Make sure your information is accessible to bots, not just humans. Optimize for agent readability and reuse.

10. What if a political backlash arrives?

The vector: The divide between the AI rich and those who fear being replaced by AI is growing.

The scenario: A populist movement targets Big Tech and AI automation. We see taxes on compute, robot taxes, or strict liability laws for AI errors.

A robust strategy: Focus on value creation, not value capture. If your AI strategy is “fire 50% of the support staff,” you are not only making a shortsighted business decision; you are painting a target on your back. If your strategy is “supercharge our staff to do things we couldn’t do before,” you are building a defensible future. Align your success with the success of both your workers and customers.

In Conclusion

The future isn’t something that happens to us; it’s something we create. The most robust strategy of all is to stop asking “What will happen?” and start asking “What future do we want to build?”

As Alan Kay once said, “The best way to predict the future is to invent it.” Don’t wait for the AI future to happen to you. Do what you can to shape it. Build the future you want to live in.

Radar Trends to Watch: December 2025

Oreilly

By: Mike Loukides

2 December 2025 at 07:15

November ended. Thanksgiving (in the US), turkey, and a train of model announcements. The announcements were exciting: Google’s Gemini 3 puts it in the lead among large language models, at least for the time being. Nano Banana Pro is a spectacularly good text-to-image model. OpenAI has released its heavy hitters, GPT-5.1-Codex-Max and GPT-5.1 Pro. And the Allen Institute released its latest open source model, Olmo 3, the leading open source model from the US.

Since Trends avoids deal-making (should we?), we’ve also avoided the angst around an AI bubble and its implosion. Right now, it’s safe to say that the bubble is formed of money that hasn’t yet been invested, let alone spent. If it is a bubble, it’s in the future. Do promises and wishes make a bubble? Does a bubble made of promises and wishes pop with a bang or a pffft?

AI

Now that Google and OpenAI have laid down their cards, Anthropic has released its latest heavyweight model: Opus 4.5. They’ve also dropped the price significantly.
The Allen Institute has launched its latest open source model, Olmo 3. The institute’s opened up the whole development process to allow other teams to understand its work.
Not to be outdone, Google has introduced Nano Banana Pro (aka Gemini 3 Pro Image), its state-of-the-art image generation model. Nano Banana’s biggest feature is the ability to edit images to change the appearance of items without redrawing them from scratch. And according to Simon WIllison, it watermarks the parts of an image it generates with SynthID.
OpenAI has released two more components of GPT-5.1, GPT-5.1-Codex-Max (API) and GPT-5.1 Pro (ChatGPT). This release brings the company’s most powerful models for generative work into view.
A group of quantum physicists claim to have reduced the size of the DeepSeek model by half, and to have removed Chinese censorship. The model can now tell you what happened in Tiananmen Square, explain what Pooh looked like, and answer other forbidden questions.
The release train for Gemini 3 has begun, and the commentariat quickly crowned it king of the LLMs. It includes the ability to spin up a web interface so users can give it more information about their questions, and to generate diagrams along with text output.
As part of the Gemini 3 release, Google has also announced a new agentic IDE called Antigravity.
Google has released a new weather forecasting model, WeatherNext 2, that can forecast with resolutions up to 1 hour. The data is available through Earth Engine and BigQuery, for those who would like to do their own forecasting. There’s also an early access program on Vertex AI.
Grok 4.1 has been released, with reports that it is currently the best model at generative prose, including creative writing. Be that as it may, we don’t see why anyone would use an AI that has been trained to reflect Elon Musk’s thoughts and values. If AI has taught us one thing, it’s that we need to think for ourselves.
AI demands the creation of new data centers and new energy sources. States want to ensure that those power plants are built, and built in ways that don’t pass costs on to consumers.
Grokipedia uses questionable sources. Is anyone surprised? How else would you train an AI on the latest conspiracy theories?
AMD GPUs are competitive, but they’re hampered because there are few libraries for low-level operations. To solve this problem, Chris Ré and others have announced HipKittens, a library of programming primitive operations for AMD GPUs.
OpenAI has released GPT-5.1. The two new models are Instant, which is tuned to be more conversational and “human,” and Thinking, a reasoning model that now adapts the time it takes to “think” to the difficulty of the questions.
Large language models, including GPT-5 and the Chinese models, show bias against users who use a German dialect rather than standard German. The bias appeared to be greater as the model size increased. These results also apply to languages like English.
Ethan Mollick on evaluating (ultimately, interviewing) your AI models is a must-read.
Yann LeCun is leaving Facebook to launch a new startup that will develop his ideas about building AI.
Harbor is a new tool that simplifies benchmarking frameworks and models. It’s from the developers of the Terminal-Bench benchmark. And it brings us a step closer to a world where people build their own specialized AI rather than rely on large providers.
Music rights holders are beginning to make deals with Udio (and presumably other companies) that train their models on existing music. Unfortunately, this doesn’t solve the bigger problem: Music is a “collectively produced shared cultural good, sustained by human labor. Copyright isn’t suited to protecting this kind of shared value,” as professors Oliver Bown and Kathy Bowrey have argued.
Moonshot AI has finally released Kimi K2 Thinking, the first open weights model to have benchmark results competitive with—or exceeding—the best closed weights models. It’s designed to be used as an agent, calling external tools as needed to solve problems.
Tongyi DeepResearch is a new fully open source agent for doing research. Its results are comparable to OpenAI deep research, Claude Sonnet 4, and similar models. Tongyi is part of Alibaba; it’s yet another important model to come out of China.
Data centers in space? It’s an interesting and challenging idea. Cooling is a much bigger problem than you’d expect. They would require massive arrays of solar cells for power. But some people think it might happen.
MiniMax M2 is a new open weights model that focuses on building agents. It has performance similar to Claude Sonnet but at a much lower price point. It also embeds its thought processes between <think> and </think> tags, which is an important step toward interpretability.
DeepSeek has introduced a new model for OCR with some very interesting properties: It has a new process for storing and retrieving memories that also makes the model significantly more efficient.
Agent Lightning provides a code-free way to train agents using reinforcement learning.

Programming

The Zig programming language has published a book. Online, of course.
Google is weakening its controversial new rules about developer verification. The company plans to create a separate class for applications with limited distribution, and develop a flow that will allow the installation of unverified apps.
Google’s LiteRT is a library for running AI models in browsers and small devices. LiteRT supports Android, iOS, embedded Linux, and microcontrollers. Supported languages include Java, Kotlin, Swift, Embedded C, and C++.
Does AI-assisted coding mean the end of new languages? Simon Willison thinks that LLMs can encourage the development of new programming languages. Design your language and ship it with a Claude Skills-style document; that should be enough for an LLM to learn how to use it.
Deepnote, a successor to the Jupyter Notebook, is a next-generation notebook for data analytics that’s built for teams. There’s now a shared workspace; different blocks can use different languages; and AI integration is on the road map. It’s now open source.
The idea of assigning colors (red, blue) to tools may be helpful in limiting the risk of prompt injection when building agents. What tools can return something damaging? This sounds like a step towards the application of the “least privilege” principle to AI design.

Security

We’re making the same mistake with AI security as we made with cloud security (and security in general): treating security as an afterthought.
Anthropic claims to have disrupted a Chinese cyberespionage group that was using Claude to generate attacks against other systems. Anthropic claims that the attack was 90% automated, though that claim is controversial.
Don’t become a victim. Data collected for online age verification makes your site a target for attackers. That data is valuable, and they know it.
A research collaboration uses data poisoning and AI to disrupt deepfake images. Users use Silverer to process their images before posting. The tool makes invisible changes to the original image that confuse AIs creating new images, leading to unusable distortions.
Is it a surprise that AI is being used to generate fake receipts and expense reports? After all, it’s used to fake just about everything else. It was inevitable that enterprise applications of AI fakery would appear.
HydraPWK2 is a Linux distribution designed for penetration testing. It’s based on Debian and is supposedly easier to use than Kali Linux.
How secure is your trusted execution environment (TEE)? All of the major hardware vendors are vulnerable to a number of physical attacks against “secure enclaves.” And their terms of service often exclude physical attacks.
Atroposia is a new malware-as-a-service package that includes a local vulnerability scanner. Once an attacker has broken into a site, they can find other ways to remain there.
A new kind of phishing attack (CoPhishing) uses Microsoft Copilot Studio agents to steal credentials by abusing the Sign In topic. Microsoft has promised an update that will defend against this attack.

Operations

Here’s how to install Open Notebook, an open source equivalent to NotebookLM, to run on your own hardware. It uses Docker and Ollama to run the notebook and the model locally, so data never leaves your system.
Open source isn’t “free as in beer.” Nor is it “free as in freedom.” It’s “free as in puppies.” For better or for worse, that just about says it.
Need a framework for building proxies? Cloudflare’s next generation Oxy framework might be what you need. (Whatever you think of their recent misadventure.)
MIT Media Labs’ Project NANDA intends to build infrastructure for a decentralized network of AI agents. They describe it as a global decentralized registry (not unlike DNS) that can be used to discover and authenticate agents using MCP and A2A. Isn’t this what we wanted from the internet in the first place?

Web

The spread of misinformation can be reduced, though not stopped, by adding a small amount of friction to the sharing process.
Netflix has released an excellent guide to using generative AI to create content.
Luke Wroblewski suggests a new model for designing AI chat sessions. A simple chat isn’t as simple as it seems; particularly with reasoning models, it can become cluttered to the point of uselessness. This new design addresses those problems.

Things

You need the Slightly Annoying Rubik’s Cube Automatic Solving Machine aka S.A.R.C.A.S.M. And you can make your own.
Tracking butterfly migration: Can you make a sensor with associated telemetry small enough to fit on a butterfly tag? Yes, you can!

The Other 80%: What Productivity Really Means

Oreilly

By: Mike Loukides

11 November 2025 at 07:10

We’ve been bombarded with claims about how much generative AI improves software developer productivity: It turns regular programmers into 10x programmers, and 10x programmers into 100x. And even more recently, we’ve been (somewhat less, but still) bombarded with the other side of the story: METR reports that, despite software developers’ belief that their productivity has increased, total end-to-end throughput has declined with AI assistance. We also saw hints of that in last year’s DORA report, which showed that release cadence actually slowed slightly when AI came into the picture. This year’s report reverses that trend.

I want to get a couple of assumptions out of the way first:

I don’t believe in 10x programmers. I’ve known people who thought they were 10x programmers, but their primary skill was convincing other team members that the rest of the team was responsible for their bugs. 2x, 3x? That’s real. We aren’t all the same, and our skills vary. But 10x? No.
There are a lot of methodological problems with the METR report—they’ve been widely discussed. I don’t believe that means we can ignore their result; end-to-end throughput on a software product is very difficult to measure.

As I (and many others) have written, actually writing code is only about 20% of a software developer’s job. So if you optimize that away completely—perfect secure code, first time—you only achieve a 20% speedup. (Yeah, I know, it’s unclear whether or not “debugging” is included in that 20%. Omitting it is nonsense—but if you assume that debugging adds another 10%–20% and recognize that that generates plenty of its own bugs, you’re back in the same place.) That’s a consequence of Amdahl’s law, if you want a fancy name, but it’s really just simple arithmetic.

Amdahl’s law becomes a lot more interesting if you look at the other side of performance. I worked at a high-performance computing startup in the late 1980s that did exactly this: It tried to optimize the 80% of a program that wasn’t easily vectorizable. And while Multiflow Computer failed in 1990, our very-long-instruction-word (VLIW) architecture was the basis for many of the high-performance chips that came afterward: chips that could execute many instructions per cycle, with reordered execution flows and branch prediction (speculative execution) for commonly used paths.

I want to apply the same kind of thinking to software development in the age of AI. Code generation seems like low-hanging fruit, though the voices of AI skeptics are rising. But what about the other 80%? What can AI do to optimize the rest of the job? That’s where the opportunity really lies.

Angie Jones’s talk at AI Codecon: Coding for the Agentic World takes exactly this approach. Angie notes that code generation isn’t changing how quickly we ship because it only takes in one part of the software development lifecycle (SDLC), not the whole. That “other 80%” involves writing documentation, handling pull requests (PRs), and the continuous integration pipeline (CI). In addition, she realizes that code generation is a one-person job (maybe two, if you’re pairing); coding is essentially solo work. Getting AI to assist the rest of the SDLC requires involving the rest of the team. In this context, she states the 1/9/90 rule: 1% are leaders who will experiment aggressively with AI and build new tools; 9% are early adopters; and 90% are “wait and see.” If AI is going to speed up releases, the 90% will need to adopt it; if it’s only the 1%, a PR here and there will be managed faster, but there won’t be substantial changes.

Angie takes the next step: She spends the rest of the talk going into some of the tools she and her team have built to take AI out of the IDE and into the rest of the process. I won’t spoil her talk, but she discusses three stages of readiness for the AI:

AI-curious: The agent is discoverable, can answer questions, but can’t modify anything.
AI-ready: The AI is starting to make contributions, but they’re only suggestions.
AI-embedded: The AI is fully plugged into the system, another member of the team.

This progression lets team members check AI out and gradually build confidence—as the AI developers themselves build confidence in what they can allow the AI to do.

Do Angie’s ideas take us all the way? Is this what we need to see significant increases in shipping velocity? It’s a very good start, but there’s another issue that’s even bigger. A company isn’t just a set of software development teams. It includes sales, marketing, finance, manufacturing, the rest of IT, and a lot more. There’s an old saying that you can’t move faster than the company. Speed up one function, like software development, without speeding up the rest and you haven’t accomplished much. A product that marketing isn’t ready to sell or that the sales group doesn’t yet understand doesn’t help.

That’s the next question we have to answer. We haven’t yet sped up real end-to-end software development, but we can. Can we speed up the rest of the company? METR’s report claimed that 95% of AI products failed. They theorized that it was in part because most projects targeted customer service, but the backend office work was more amenable to AI in its current form. That’s true—but there’s still the issue of “the rest.” Does it make sense to use AI to generate business plans, manage supply change, and the like if all it will do is reveal the next bottleneck?

Of course it does. This may be the best way of finding out where the bottlenecks are: in practice, when they become bottlenecks. There’s a reason Donald Knuth said that premature optimization is the root of all evil—and that doesn’t apply only to software development. If we really want to see improvements in productivity through AI, we have to look company-wide.

Radar Trends to Watch: November 2025

Oreilly

By: Mike Loukides

4 November 2025 at 07:02

AI has so thoroughly colonized every technical discipline that it’s becoming hard to organize items of interest in Radar Trends. Should a story go under AI or programming (or operations or biology or whatever the case may be)? Maybe it’s time to go back to a large language model that doesn’t require any electricity and has over 217K parameters: Merriam-Webster. But no matter where these items ultimately appear, it’s good to see practical applications of AI in fields as diverse as bioengineering and UX design.

AI

Alibaba’s Ling-1T may be the best model you’ve never heard of. It’s a nonthinking mixture-of-experts model with 1T parameters, 50B active at any time. And it’s open weights (MIT license).
Marin is a new lab for creating fully open source models. They say that the development of models will be completely transparent from the beginning. Everything is tracked by GitHub; all experiments may be observed by anyone; there’s no cherrypicking of results.
WebMCP is a proposal and an implementation for a protocol that allows websites to become MCP servers. As servers, they can interact directly with agents and LLMs.
Claude has announced Agent Skills. Skills are essentially just a Markdown file describing how to perform a task, possibly accompanied by scripts and resources. They’re easy to add and only used as needed. A Skill-creator Skill makes it very easy to build Skills. Simon Willison thinks that Skills may be a “bigger deal than MCP.”
Pete Warden describes his work on the smallest of AI. Small AI serves an important set of applications without compromising privacy or requiring enormous resources.
Anthropic has released Claude Haiku 4.5, skipping 4.0 and 4.1 in the process. Haiku is their smallest and fastest model. The new release claims performance similar to Sonnet 4, but it’s much faster and less expensive.
NVIDIA is now offering the DGX Spark, a desktop AI supercomputer. It offers 1 petaflop performance on models with up to 200B parameters. Simon Willison has a review of a preview unit.
Andrej Karpathy has released nanochat, a small ChatGPT-like model that’s completely open and can be trained for roughly $100. It’s intended for experimenters, and Karpathy has detailed instructions on building and training.
There’s an agent-shell for Emacs? There had to be one. Emacs abhors a vacuum.
Anthropic launched “plugins,” which give developers the ability to write extensions to Claude Code. Of course, these extensions can be agents. Simon Willison points to Jesse Vincent’s Superpowers as a glimpse of what plugins can accomplish.
Google has released the Gemini 2.5 Computer Use model into public preview. While the thrill of teaching computers to click browsers and other web applications faded quickly, Gemini 2.5 Computer Use appears to be generating excitement.
Thinking Machines Labs has announced Tinker, an API for training open weight language models. Tinker runs on Thinking Machines’ infrastructure. It’s currently in beta.
Merriam-Webster will release its newest large language model on November 18. It has no data centers and requires no electricity.
We know that the data products, including AI, reflect historical biases in their training data. In India, OpenAI reflects caste biases. But it’s not just OpenAI; these biases appear in all models. Although caste bias was outlawed in the middle of the 20th century, these biases live on in the data.
DeepSeek has released an experimental version of its reasoning model, DeepSeek-V3.2-Exp. This model uses a technique called sparse attention to reduce the processing requirements (and cost) of the reasoning process.
OpenAI has added an Instant Checkout feature that allows users to make purchases with Etsy and Shopify merchants, taking them directly to checkout after finding their products. It’s based on the Agentic Commerce Protocol.
OpenAI’s GDPval tests go beyond existing benchmarks by challenging LLMs with real-world tasks rather than simple problems. The tasks were selected from 44 industries and were chosen for economic value.

Programming

Steve Yegge’s Beads is a memory management system for coding agents. It’s badly needed, and worth checking out.
Do you use coding agents in parallel? Simon Willison was a skeptic, but he’s gradually becoming convinced it’s a good practice.
One problem with generative coding is that AI is trained on “the worst code in the world.” For web development, we’ll need better foundations to get to a post–frontend-framework world.
If you’ve wanted to program with Claude from your phone or some other device, now you can. Anthropic has added web and mobile interfaces to Claude Code, along with a sandbox for running generated code safely.
You may have read “Programming with Nothing,” a classic article that strips programming to the basics of lambda calculus. “Programming with Less Than Nothing” does FizzBuzz in many lines of combinatory logic.
What’s the difference between technical debt and architectural debt? Don’t confuse them; they’re significantly different problems, with different solutions.
For graph fans: The IRS has released its fact graph, which, among other things, models the US Internal Revenue Code. It can be used with JavaScript and any JVM language.
What is spec-driven development? It has become one of the key buzzwords in the discussion of AI-assisted software development. Birgitta Böckeler attempts to define SDD precisely, then looks at three tools for aiding SDD.
IEEE Spectrum released its 2025 programming languages rankings. Python is still king, with Java second; JavaScript has fallen from third to fifth. But more important, Spectrum wonders whether AI-assisted programming will make these rankings irrelevant.

Web

Cloudflare CEO Matthew Prince is pushing for regulation to prevent Google from tying web crawlers for search and for training content together. You can’t block the training crawler without also blocking the search crawler, and blocking the latter has significant consequences for businesses.
OpenAI has released Atlas, its Chromium-based web browser. As you’d expect, AI is integrated into everything. You can chat with the browser, interrogate your history, your settings, or your bookmarks, and (of course) chat with the pages you’re viewing.
Try again? Apple has announced a second-generation Vision Pro, with a similar design and at the same price point.
Have we passed peak social? Social media usage has been declining for all age groups. The youngest group, 16–24, is the largest but has also shown the sharpest decline. Are we going to reinvent the decentralized web? Or succumb to a different set of walled gardens?
Addy Osmani’s post “The History of Core Web Vitals” is a must-read for anyone working in web performance.
Features from the major web frameworks are being implemented by browsers. Frameworks won’t disappear, but their importance will diminish. People will again be programming to the browser. In turn, this will make browser testing and standardization that much more important.
Luke Wroblewski writes about using AI to solve common problems in user experience (UX). AI can help with problems like collecting data from users and onboarding users to new applications.

Operations

There’s a lot to be learned from AWS’s recent outage, which stemmed from a DynamoDB DNS failure in the US-EAST-1 region. It’s important not to write this off as a war story about Amazon’s failure. Instead, think: How do you make your own distributed networks more reliable?
PyTorch Monarch is a new library that helps developers manage distributed systems for training AI models. It lets developers write a script that “orchestrates all distributed resources,” allowing the developer to work with them as a single almost-local system.

Security

The solution to the fourth part of Kryptos, the cryptosculpture at the CIA’s headquarters, has been discovered! The discovery came through an opsec error that led researchers to the clear text stored at the Smithsonian. This is an important lesson: Attacks against cryptosystems rarely touch the cryptography. They attack the protocols, people, and systems surrounding codes.
Public cryptocurrency blockchains are being used by international threat actors as “bulletproof” hosts for storing and distributing malware.
Apple is now giving a $2M bounty for zero-day exploits that allow zero-click remote code execution on iOS. These vulnerabilities have been exploited by commercial malware vendors.
Signal has incorporated postquantum encryption into its Signal protocol. This is a major technological achievement. They’re one of the few organizations that’s ready for the quantum world.
Salesforce is refusing to pay extortion after a major data loss of over a billion records. Data from a number of major accounts was stolen by a group calling itself Scattered LAPSUS$ Hunters. Attackers simply asked the victim’s staff to install an attacker-controlled app.
Context is the key to AI security. We’re not surprised; right now, context is the key to just about everything in AI. Attackers have the advantage now, but in 3–5 years that advantage will pass to defenders who use AI effectively.
Google has announced that Gmail users can now send end-to-end encrypted (E2EE) regardless of whether they’re using Gmail. Recipients who don’t use Gmail will receive a notification and the ability to read the message on a one-time guest account.
The best way to attack your company isn’t through the applications; it’s through the service help desk. Human engineering remains extremely effective—more effective than attacks against software. Training helps; a well-designed workflow and playbook is crucial.
Ransomware detection has now been built into the desktop version of Google Drive. When it detects activities that indicate ransomware, Drive suspends file syncing and alerts users. It’s enabled by default, but it is possible to opt out.
OpenAI is routing requests with safety issues to an unknown model. This is presumably a specialized version of GPT-5 that has been trained specially to deal with sensitive issues.

Robotics

Would you buy a banana from a robot? A small chain of stores in Chicago is finding out.
Rodney Brooks, founder of iRobot, warns that humans should stay at least 10 feet (3 meters) away from humanoid walking robots. There is a lot of potential energy in their limbs when they move them to retain balance. Unsurprisingly, this danger stems from the vision-only approach that Tesla and other vendors have adopted. Humans learn and act with all five senses.

Quantum Computing

Google claims to have demonstrated a verifiable quantum advantage on its quantum processor: The output of the computation can be tested for correctness. Verifiable quantum advantage doesn’t just mean that it’s fast; it means that error correction is working.
Researchers at Institute of Science Tokyo have developed a quantum error correction method that’s efficient and (in theory) scales to hundreds of thousands of qubits. Quantum computers of that size haven’t been built yet but will be needed to perform real work.

Biology

Scientists have discovered a new narrow-spectrum antibiotic that could be used to treat inflammatory bowel disease. AI was able to predict how the antibiotic would work, apparently a first.
A red-teaming security group at Microsoft has announced that they have found a zero-day that allows malicious actors to design harmful proteins with AI.
AI has successfully designed the DNA for a bacteriophage (essentially a very simple virus) capable of infecting and killing E. coli, a common bacteria. This is the first time AI has been used to synthesize an entire genome.

On the AWS Outage

Oreilly

By: Mike Loukides

3 November 2025 at 05:47

Everybody notices when something big fails—like AWS’s US-EAST-1 region. And fail it did. All sorts of services and sites became inaccessible, and we all knew it was Amazon’s fault. A week later, when I run into a site that’s down, I still say, “Must be some hangover from the AWS outage. Some cache that didn’t get refreshed.” Amazon gets blamed—maybe even rightly—even when it’s not their fault.

I’m not writing about fault, though, and I’m also not writing a technical analysis of what happened. There are good places for that online, including AWS’s own summary. What I am writing about is a reaction to the outage that I’ve seen all too often: “This proves we can’t trust AWS. We need to build our own infrastructure.”

Building your own infrastructure is fine. But I’m also reminded of the wisest comment I heard after the 2012 US-EAST outage. I asked JD Long about his reaction to the outage. He said, “I’m really glad it wasn’t my guys trying to fix the problem.”¹ JD wasn’t disparaging his team; he was saying that Amazon has a lot of expertise in running, maintaining, and troubleshooting really big systems that can fail suddenly in unpredictable ways—when just the right conditions happen to tickle a bug that had been latent in the system for years. That expertise is hard to find and expensive when you find it. And no matter how expert “your guys” are, all complex systems fail. After last month’s AWS failure, Microsoft’s Azure obligingly failed about 10 days later.

I’m not really an Amazon fan or, more specifically, an AWS fan. But outages like this should force us to remember what they do right. AWS outages also warn us that we need to learn how to “craft ways of undoing this concentration and creating real choice,” as Signal CEO Meredith Whittaker points out. But Meredith understands how difficult it will be to build this infrastructure and that, for the present, there’s no viable alternative to AWS or one of the other hyperscalers.

Operating and troubleshooting large systems is difficult and requires very specialized skills. If you decide to build your own infrastructure, you will need those skills. And you may end up wishing that it isn’t your guys trying to fix the problem.

Footnote

In 2012, I happened to be flying out of DC just as the storm that took US-EAST down was rolling in. My flight made it out, but it was dramatic.

Enlightenment

Oreilly

By: Mike Loukides

14 October 2025 at 07:03

In a fascinating op-ed, David Bell, a professor of history at Princeton, argues that “AI is shedding enlightenment values.” As someone who has taught writing at a similarly prestigious university, and as someone who has written about technology for the past 35 or so years, I had a deep response.

Bell’s is not the argument of an AI skeptic. For his argument to work, AI has to be pretty good at reasoning and writing. It’s an argument about the nature of thought itself. Reading is thinking. Writing is thinking. Those are almost clichés—they even turn up in students’ assessments of using AI in a college writing class. It’s not a surprise to see these ideas in the 18th century, and only a bit more surprising to see how far Enlightenment thinkers took them. Bell writes:

The great political philosopher Baron de Montesquieu wrote: “One should never so exhaust a subject that nothing is left for readers to do. The point is not to make them read, but to make them think.” Voltaire, the most famous of the French “philosophes,” claimed, “The most useful books are those that the readers write half of themselves.”

And in the late 20th century, the great Dante scholar John Freccero would say to his classes “The text reads you”: How you read The Divine Comedy tells you who you are. You inevitably find your reflection in the act of reading.

Is the use of AI an aid to thinking or a crutch or a replacement? If it’s either a crutch or a replacement, then we have to go back to Descartes’s “I think, therefore I am” and read it backward: What am I if I don’t think? What am I if I have offloaded my thinking to some other device? Bell points out that books guide the reader through the thinking process, while AI expects us to guide the process and all too often resorts to flattery. Sycophancy isn’t limited to a few recent versions of GPT; “That’s a great idea” has been a staple of AI chat responses since its earliest days. A dull sameness goes along with the flattery—the paradox of AI is that, for all the talk of general intelligence, it really doesn’t think better than we do. It can access a wealth of information, but it ultimately gives us (at best) an unexceptional average of what has been thought in the past. Books lead you through radically different kinds of thought. Plato is not Aquinas is not Machiavelli is not Voltaire (and for great insights on the transition from the fractured world of medieval thought to the fractured world of Renaissance thought, see Ada Palmer’s Inventing the Renaissance).

We’ve been tricked into thinking that education is about preparing to enter the workforce, whether as a laborer who can plan how to spend his paycheck (readin’, writin’, ’rithmetic) or as a potential lawyer or engineer (Bachelor’s, Master’s, Doctorate). We’ve been tricked into thinking of schools as factories—just look at any school built in the 1950s or earlier, and compare it to an early 20th century manufacturing facility. Take the children in, process them, push them out. Evaluate them with exams that don’t measure much more than the ability to take exams—not unlike the benchmarks that the AI companies are constantly quoting. The result is that students who can read Voltaire or Montesquieu as a dialogue with their own thoughts, who could potentially make a breakthrough in science or technology, are rarities. They’re not the students our institutions were designed to produce; they have to struggle against the system, and frequently fail. As one elementary school administrator told me, “They’re handicapped, as handicapped as the students who come here with learning disabilities. But we can do little to help them.”

So the difficult question behind Bell’s article is: How do we teach students to think in a world that will inevitably be full of AI, whether or not that AI looks like our current LLMs? In the end, education isn’t about collecting facts, duplicating the answers in the back of the book, or getting passing grades. It’s about learning to think. The educational system gets in the way of education, leading to short-term thinking. If I’m measured by a grade, I should do everything I can to optimize that metric. All metrics will be gamed. Even if they aren’t gamed, metrics shortcut around the real issues.

In a world full of AI, retreating to stereotypes like “AI is damaging” and “AI hallucinates” misses the point, and is a sure route to failure. What’s damaging isn’t the AI, but the set of attitudes that make AI just another tool for gaming the system. We need a way of thinking with AI, of arguing with it, of completing AI’s “book” in a way that goes beyond maximizing a score. In this light, so much of the discourse around AI has been misguided. I still hear people say that AI will save you from needing to know the facts, that you won’t have to learn the dark and difficult corners of programming languages—but as much as I personally would like to take the easy route, facts are the skeleton on which thinking is based. Patterns arise out of facts, whether those patterns are historical movements, scientific theories, or software designs. And errors are easily uncovered when you engage actively with AI’s output.

AI can help to assemble facts, but at some point those facts need to be internalized. I can name a dozen (or two or three) important writers and composers whose best work came around 1800. What does it take to go from those facts to a conception of the Romantic movement? An AI could certainly assemble and group those facts, but would you then be able to think about what that movement meant (and continues to mean) for European culture? What are the bigger patterns revealed by the facts? And what would it mean for those facts and patterns to reside only within an AI model, without human comprehension? You need to know the shape of history, particularly if you want to think productively about it. You need to know the dark corners of your programming languages if you’re going to debug a mess of AI-generated code. Returning to Bell’s argument, the ability to find patterns is what allows you to complete Voltaire’s writing. AI can be a tremendous aid in finding those patterns, but as human thinkers, we have to make those patterns our own.

That’s really what learning is about. It isn’t just collecting facts, though facts are important. Learning is about understanding and finding relationships and understanding how those relationships change and evolve. It’s about weaving the narrative that connects our intellectual worlds together. That’s enlightenment. AI can be a valuable tool in that process, as long as you don’t mistake the means for the end. It can help you come up with new ideas and new ways of thinking. Nothing says that you can’t have the kind of mental dialogue that Bell writes about with an AI-generated essay. ChatGPT may not be Voltaire, but not much is. But if you don’t have the kind of dialogue that lets you internalize the relationships hidden behind the facts, AI is a hindrance. We’re all prone to be lazy—intellectually and otherwise. What’s the point at which thinking stops? What’s the point at which knowledge ceases to become your own? Or, to go back to the Enlightenment thinkers, when do you stop writing your share of the book?

That’s not a choice AI makes for you. It’s your choice.

Radar Trends to Watch: October 2025

Oreilly

By: Mike Loukides

7 October 2025 at 07:17

This month we have two more protocols to learn. Google has announced the Agent Payments Protocol (AP2), which is intended to help agents to engage in ecommerce—it’s largely concerned with authenticating and authorizing parties making a transaction. And the Agent Client Protocol (ACP) is concerned with communications between code editors and coding agents. When implemented, it would allow any code editor to plug in any compliant agent.

All hasn’t been quiet on the virtual reality front. Meta has announced its new VR/AR glasses, with the ability to display images on the lenses along with capabilities like live captioning for conversations. They’re much less obtrusive than the previous generation of VR goggles.

AI

Suno has announced an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.
Ollama has added its own web search API. Ollama’s search API can be used to augment the information available to models.
GitHub Copilot now offers a command-line tool, GitHub CLI. It can use either Claude Sonnet 4 or GPT-5 as the backing model, though other models should be available soon. Claude 4 is the default.
Alibaba has released Qwen3-Max, a trillion-plus parameter model. There are reasoning and nonreasoning variants, though the reasoning variant hasn’t yet been released. Alibaba also released models for speech-to-text, vision-language, live translation, and more. They’ve been busy.
GitHub has launched its MCP Registry to make it easier to discover MCP servers archived on GitHub. It’s also working with Anthropic and others to build an open source MCP registry, which lists servers regardless of their origin and integrates with GitHub’s registry.
DeepMind has published version 3.0 of its Frontier Safety Framework, a framework for experimenting with AI-human alignment. They’re particularly interested in scenarios where the AI doesn’t follow a user’s directives, and in behaviors that can’t be traced to a specific reasoning chain.
Alibaba has released the Tongyi DeepResearch reasoning model. Tongyi is a 30.5B parameter mixture-of-experts model, with 3.3B parameters active. More importantly, it’s fully open source, with no restrictions on how it can be used.
Locally AI is an iOS app that lets you run large language models on your iPhone or iPad. It works offline; there’s no need for a network connection.
OpenAI has added control over the “reasoning” process to its GPT-5 models. Users can choose between four levels: Light (Pro users only), Standard, Extended, and Heavy (Pro only).
Google has announced the Agent Payments Protocol (AP2), which facilitates purchases. It focuses on authorization (proving that it has the authority to make a purchase), authentication (proving that the merchant is legitimate), and accountability (in case of a fraudulent transaction).
Bring Your Own AI: Employee adoption of AI greatly exceeds official IT adoption. We’ve seen this before, on technologies as different as the iPhone and open source.
Alibaba has released the ponderously named Qwen3-Next-80B-A3B-Base. It’s a mixture-of-experts model with a high ratio of active parameters to total parameters (3.75%). Alibaba claims that the model cost 1/10 as much to train and is 10 times faster than its previous models. If this holds up, Alibaba is winning on performance where it counts.
Anthropic has announced a major upgrade to Claude’s capabilities. It can now execute Python scripts in a sandbox and can create Excel spreadsheets, PowerPoint presentations, PNG files, and other documents. You can upload files for it to analyze. And of course this comes with security risks.
The SIFT method—stop, investigate the source, find better sources, and trace quotes to their original context—is a way of structuring your use of AI output that will make you less vulnerable to misinformation. Hint: it’s not just for AI.
OpenAI’s Projects feature is now available to free accounts. Projects is a set of tools for organizing conversations with the LLM. Projects are separate workspaces with their own custom instructions, independent memory, and context. They can be forked. Projects sounds something like Git for LLMs—a set of features that’s badly needed.
EmbeddingGemma is a new open weights embedding model (308M parameters) that’s designed to run on devices, requiring as little as 200 MB of memory.
An experiment with GPT-4o-mini shows that language models can fall to psychological manipulation. Is this surprising? After all, they are trained on human output.
“Platform Shifts Redefine Apps”: AI is a new kind of platform and demands rethinking what applications mean and how they should work. Failure to do this rethinking may be why so many AI efforts fail.
MCP-UI is a protocol that allows MCP servers to send React components or Web Components to agents, allowing the agent to build an appropriate browser-based interface on the fly.
The Agent Client Protocol (ACP) is a new protocol that standardizes communications between code editors and coding agents. It’s currently supported by the Zed and Neovim editors, and by the Gemini CLI coding agent.
Gemini 2.5 Flash is now using a new image generation model that was internally known as “nano banana.” This new model can edit uploaded images, merge images, and maintain visual consistency across a series of images.

Programming

Anthropic released Claude Code 2.0. New features include the ability to checkpoint your work, so that if a coding agent wanders off-course, you can return to a previous state. They have also added the ability to run tasks in the background, call hooks, and use subagents.
Suno has announced an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.
The Wasmer project has announced that it now has full Python support in the beta version of Wasmer Edge, its WebAssembly runtime for serverless edge deployment.
Mitchell Hashimoto, founder of Hashicorp, has promised that a library for Ghostty (libghostty) is coming! This library will make it easy to embed a terminal emulator into an application. Perhaps more important, libghostty might standardize the code for terminal output across applications.
There’s a new benchmark for agentic coding: CompileBench. CompileBench tests the ability of models to solve complex problems in figuring out how to build code.
Apple is reportedly rewriting iOS in a new programming language. Rust would be the obvious choice, but rumors are that it’s something of their own creation. Apple likes languages it can control.
Java 25, the latest long-term support release, has a number of new features that reduce the boilerplate that makes Java difficult to learn.
Luau is a new scripting language derived from Lua. It claims to be fast, small, and safe. It’s backward compatible with Version 5.1 of Lua.
OpenAI has launched GPT-5 Codex, its generation model trained specifically for software engineering. Codex is now available both in the CLI tool and through the API. It’s clearly intended to challenge Anthropic’s dominant coding tool, Claude Code.
Do prompts belong in code repositories? We’ve argued that prompts should be archived. But they don’t belong in a source code repo like Git. There are better tools available.
This is cool and different. A developer has hacked the 2001 game Animal Crossing so that the dialog is generated by LLM rather than coming from the game’s memory.
There’s a new programming language, vibe-coded in its entirety with Claude. Cursed is similar to Claude, but all the keywords are Gen Z slang. It’s not yet on the list, but it’s a worthy addition to Esolang.
Claude Code is now integrated into the Zed editor (beta), using the Agent Client Protocol (ACP).
Ida Bechtle’s documentary on the history of Python, complete with many interviews with Guido van Rossum, is a must-watch.

Security

The first malicious MCP server has been found in the wild. Postmark-MCP, an MCP server for interacting with the Postmark application, suddenly (version 1.0.16) started sending copies of all the email it handles to its developer.
I doubt this is the first time, but supply chain security vulnerabilities have now hit Rust’s package management system, Crates.io. Two packages that steal keys for cryptocurrency wallets have been found. It’s time to be careful about what you download.
Cross-agent privilege escalation is a new kind of vulnerability in which a compromised intelligent agent uses indirect prompt injection to cause a victim agent to overwrite its configuration, granting it additional privileges.
GitHub is taking a number of measures to improve software supply chain security, including requiring two-factor authentication (2FA), expanding trusted publishing, and more.
A compromised npm package uses a QR code to encode malware. The malware is apparently downloaded in the QR code (which is valid, but too dense to be read by a normal camera), unpacked by the software, and used to steal cookies from the victim’s browser.
Node.js and its package manager npm have been in the news because of an ongoing series of supply chain attacks. Here’s the latest report.
A study by Cisco has discovered over a thousand unsecured LLM servers running on Ollama. Roughly 20% were actively serving requests. The rest may have been idle Ollama instances, waiting to be exploited.
Anthropic has announced that Claude will train on data from personal accounts, effective September 28. This includes Free, Pro, and Max plans. Work plans are exempted. While the company says that training on personal data is opt-in, it’s (currently) enabled by default, so it’s opt-out.
We now have “vibe hacking,” the use of AI to develop malware. Anthropic has reported several instances in which Claude was used to create malware that the authors could not have created themselves. Anthropic is banning threat actors and implementing classifiers to detect illegal use.
Zero trust is basic to modern security. But groups implementing zero trust have to realize that it’s a project that’s never finished. Threats change, people change, systems change.
There’s a new technique for jailbreaking LLMs: write prompts with bad grammar and run-on sentences. These seem to prevent guardrails from taking effect.
In an attempt to minimize the propagation of malware on the Android platform, Google plans to block “sideloading” apps for Android devices and require developer ID verification for apps installed through Google Play.
A new phishing attack called ZipLine targets companies using their own “contact us” pages. The attacker then engages in an extended dialog with the company, often posing as a potential business partner, before eventually delivering a malware payload.

Operations

The 2025 DORA report is out! DORA may be the most detailed summary of the state of the IT industry. DORA’s authors note that AI is everywhere and that the use of AI now improves end-to-end productivity, something that was ambiguous in last year’s report.
Microsoft has announced that Word will save files to the cloud (OneDrive) by default. This (so far) appears to apply only when using Windows. The feature is currently in beta.

Web

Do we need another browser? Helium is a Chromium-based browser that is private by default.
Are scientists moving from Twitter to Bluesky?

Virtual and Augmented Reality

Meta has announced a pair of augmented reality glasses with a small display on one of the lenses, bringing it to the edge of AR. In addition to displaying apps from your phone, the glasses can do “live captioning” for conversations. The display is controlled by a wristband.

Megawatts and Gigawatts of AI

Oreilly

By: Mike Loukides

9 September 2025 at 06:54

We can’t not talk about power these days. We’ve been talking about it ever since the Stargate project, with half a trillion dollars in data center investment, was floated early in the year. We’ve been talking about it ever since the now-classic “Stochastic Parrots” paper. And, as time goes on, it only becomes more of an issue.

“Stochastic Parrots” deals with two issues: AI’s power consumption and the fundamental nature of generative AI; selecting sequences of words according to statistical patterns. I always wished those were two papers, because it would be easier to disagree about power and agree about parrots. For me, the power issue is something of a red herring—but increasingly, I see that it’s a red herring that isn’t going away because too many people with too much money want herrings; too many believe that a monopoly on power (or a monopoly on the ability to pay for power) is the route to dominance.

Why, in a better world than we currently live in, would the power issue be a red herring? There are several related reasons:

I have always assumed that the first generation language models would be highly inefficient, and that over time, we’d develop more efficient algorithms.
I have also assumed that the economics of language models would be similar to chip foundries or pharma factories: The first chip coming out of a foundry costs a few billion dollars, everything afterward is a penny apiece.
I believe (now more than ever) that, long-term, we will settle on small models (70B parameters or less) that can run locally rather than giant models with trillions of parameters running in the cloud.

And I still believe those points are largely true. But that’s not sufficient. Let’s go through them one by one, starting with efficiency.

Better Algorithms

A few years ago, I saw a fair number of papers about more efficient models. I remember a lot of articles about pruning neural networks (eliminating nodes that contribute little to the result) and other techniques. Papers that address efficiency are still being published—most notably, DeepMind’s recent “Mixture-of-Recursions” paper—but they don’t seem to be as common. That’s just anecdata, and should perhaps be ignored. More to the point, DeepSeek shocked the world with their R1 model, which they claimed cost roughly 1/10 as much to train as the leading frontier models. A lot of commentary insisted that DeepSeek wasn’t being up front in their measurement of power consumption, but since then several other Chinese labs have released highly capable models, with no gigawatt data centers in sight. Even more recently, OpenAI has released gpt-oss in two sizes (120B and 30B), which were reportedly much less expensive to train. It’s not the first time this has happened—I’ve been told that the Soviet Union developed amazingly efficient data compression algorithms because their computers were a decade behind ours. Better algorithms can trump larger power bills, better CPUs, and more GPUs, if we let them.

What’s wrong with this picture? The picture is good, but much of the narrative is US-centric, and that distorts it. First, it’s distorted by our belief that bigger is always better: Look at our cars, our SUVs, our houses. We’re conditioned to believe that a model with a trillion parameters has to be better than a model with a mere 70B, right? That a model that cost a hundred million dollars to train has to be better than one that can be trained economically? That myth is deeply embedded in our psyche. Second, it’s distorted by economics. Bigger is better is a myth that would-be monopolists play on when they talk about the need for ever bigger data centers, preferably funded with tax dollars. It’s a convenient myth, because convincing would-be competitors that they need to spend billions on data centers is an effective way to have no competitors.

One area that hasn’t been sufficiently explored is extremely small models developed for specialized tasks. Drew Breunig writes about the tiny chess model in Stockfish, the world’s leading chess program: It’s small enough to run in an iPhone, and replaced a much larger general-purpose model. And it soundly defeated Claude Sonnet 3.5 and GPT-4o.¹ He also writes about the 27 million parameter Hierarchical Reasoning Model (HRM) that has beaten models like Claude 3.7 on the ARC benchmark. Pete Warden’s Moonshine does real-time speech-to-text transcription in the browser—and is as good as any high-end model I’ve seen. None of these are general-purpose models. They won’t vibe code; they won’t write your blog posts. But they are extremely effective at what they do. And if AI is going to fulfill its destiny of “disappearing into the walls,” of becoming part of our everyday infrastructure, we will need very accurate, very specialized models. We will have to free ourselves of the myth that bigger is better.²

The Cost of Inference

The purpose of a model isn’t to be trained; it’s to do inference. This is a gross simplification, but part of training is doing inference trillions of times and adjusting the model’s billions of parameters to minimize error. A single request takes an extremely small fraction of the effort required to train a model. That fact leads directly to the economics of chip foundries: The ability to process the first prompt costs millions of dollars, but once they’re in production, processing a prompt costs fractions of a cent. Google has claimed that processing a typical text prompt to Gemini takes 0.24 watt-hours, significantly less than it takes to heat water for a cup of coffee. They also claim that increases in software efficiency have led to a 33x reduction in energy consumption over the past year.

That’s obviously not the entire story: Millions of people prompting ChatGPT adds up, as does usage of newer “reasoning” modules that have an extended internal dialog before arriving at a result. Likewise, driving to work rather than biking raises the global temperature a nanofraction of a degree—but when you multiply the nanofraction by billions of commuters, it’s a different story. It’s fair to say that an individual who uses ChatGPT or Gemini isn’t a problem, but it’s also important to realize that millions of users pounding on an AI service can grow into a problem quite quickly. Unfortunately, it’s also true that increases in efficiency often don’t lead to reductions in energy use but to solving more complex problems within the same energy budget. We may be seeing that with reasoning models, image and video generation models, and other applications that are now becoming financially feasible. Does this problem require gigawatt data centers? No, not that, but it’s a problem that can justify the building of gigawatt data centers.

There is a solution, but it requires rethinking the problem. Telling people to use public transportation or bicycles for their commute is ineffective (in the US), as will be telling people not to use AI. The problem needs to be rethought: redesigning work to eliminate the commute (O’Reilly is 100% work from home), rethinking the way we use AI so that it doesn’t require cloud-hosted trillion parameter models. That brings us to using AI locally.

Staying Local

Almost everything we do with GPT-*, Claude-*, Gemini-*, and other frontier models could be done equally effectively on much smaller models running locally: in a small corporate machine room or even on a laptop. Running AI locally also shields you from problems with availability, bandwidth, limits on usage, and leaking private data. This is a story that would-be monopolists don’t want us to hear. Again, this is anecdata, but I’ve been very impressed by the results I get from running models in the 30 billion parameter range on my laptop. I do vibe coding and get mostly correct code that the model can (usually) fix for me; I ask for summaries of blogs and papers and get excellent results. Anthropic, Google, and OpenAI are competing for tenths of a percentage point on highly gamed benchmarks, but I doubt that those benchmark scores have much practical meaning. I would love to see a study on the difference between Qwen3-30B and GPT-5.

What does that mean for energy costs? It’s unclear. Gigawatt data centers for doing inference would go unneeded if people do inference locally, but what are the consequences of a billion users doing inference on high-end laptops? If I give my local AIs a difficult problem, my laptop heats up and runs its fans. It’s using more electricity. And laptops aren’t as efficient as data centers that have been designed to minimize electrical use. It’s all well and good to scoff at gigawatts, but when you’re using that much power, minimizing power consumption saves a lot of money. Economies of scale are real. Personally, I’d bet on the laptops: Computing with 30 billion parameters is undoubtedly going to be less energy-intensive than computing with 3 trillion parameters. But I won’t hold my breath waiting for someone to do this research.

There’s another side to this question, and that involves models that “reason.” So-called “reasoning models” have an internal conversation (not always visible to the user) in which the model “plans” the steps it will take to answer the prompt. A recent paper claims that smaller open source models tend to generate many more reasoning tokens than large models (3 to 10 times as many, depending on the models you’re comparing), and that the extensive reasoning process eats away at the economics of the smaller models. Reasoning tokens must be processed, the same as any user-generated tokens; this processing incurs charges (which the paper discusses), and charges presumably relate directly to power.

While it’s surprising that small models generate more reasoning tokens, it’s no surprise that reasoning is expensive, and we need to take that into account. Reasoning is a tool to be used; it tends to be particularly useful when a model is asked to solve a problem in mathematics. It’s much less useful when the task involves looking up facts, summarization, writing, or making recommendations. It can help in areas like software design but is likely to be a liability for generative coding. In these cases, the reasoning process can actually become misleading—in addition to burning tokens. Deciding how to use models effectively, whether you’re running them locally or in the cloud, is a task that falls to us.

Going to the giant reasoning models for the “best possible answer” is always a temptation, especially when you know you don’t need the best possible answer. It takes some discipline to commit to the smaller models—even though it’s difficult to argue that using the frontier models is less work. You still have to analyze their output and check their results. And I confess: As committed as I am to the smaller models, I tend to stick with models in the 30B range, and avoid the 1B–5B models (including the excellent Gemma 3N). Those models, I’m sure, would give good results, use even less power, and run even faster. But I’m still in the process of peeling myself away from my knee-jerk assumptions.

Bigger isn’t necessarily better; more power isn’t necessarily the route to AI dominance. We don’t yet know how this will play out, but I’d place my bets on smaller models running locally and trained with efficiency in mind. There will no doubt be some applications that require large frontier models—perhaps generating synthetic data for training the smaller models—but we really need to understand where frontier models are needed, and where they aren’t. My bet is that they’re rarely needed. And if we free ourselves from the desire to use the latest, largest frontier model just because it’s there—whether or not it serves your purpose any better than a 30B model—we won’t need most of those giant data centers. Don’t be seduced by the AI-industrial complex.

Footnotes

I’m not aware of games between Sockfish and more recent Claude 4, Claude 4.1, and GPT-5 models. There’s every reason to believe the results would be similar.
Kevlin Henney makes a related point in “Scaling False Peaks.”

Radar Trends to Watch: September 2025

Oreilly

By: Mike Loukides

2 September 2025 at 06:10

For better or for worse, AI has colonized this list so thoroughly that AI itself is little more than a list of announcements about new or upgraded models. But there are other points of interest. Is it just a coincidence (possibly to do with BlackHat) that so much happened in security in the past month? We’re still seeing programming languages—even some new programming languages for writing AI prompts! If you’re into retrocomputing, the much-beloved Commodore 64 is back—with an upgraded audio chip, a new processor, much more RAM, and all your old ports. Heirloom peripherals should still work.

AI

OpenAI has released their Realtime APIs. The model supports MCP servers, phone calls using the SIP protocol, and image inputs. The release includes gpt-realtime, an advanced speech-to-speech model.
ChatGPT now supports project-only memory. Project memory, which can use previous conversations for additional context, can be limited to a specific project. Project-only memory gives more control over context and prevents one project’s context from contaminating another.
FairSense is a framework for investigating whether AI systems are fair early on. FairSense runs long-term simulations to detect whether a system will become unfair as it evolves over time.
Agents4Science is a new academic conference in which all the submissions will be researched, written, reviewed, and presented primarily by AI (using text-to-speech for presentations).
Drew Breunig’s mix and match cheat sheet for AI job titles is a classic.
Cohere’s Command A Reasoning is another powerful, partially open reasoning model. It is available on Hugging Face. It claims to outperform gpt-oss-120b and DeepSeek R1-0528.
DeepSeek has released DeepSeekV3.1. This is a hybrid model that supports reasoning and nonreasoning use. It’s also faster than R1 and has been designed for agentic tasks. It uses reasoning tokens more economically, and it was much less expensive to train than GPT-5.
Anthropic has added the ability to terminate chats to Claude Opus. Chats can be terminated if a user persists in making harmful requests. Terminated chats can’t be continued, although users can start a new chat. The feature is currently experimental.
Google has released its smallest model yet: Gemma 3 270M. This model is designed for fine-tuning and for deployment on small, limited hardware. Here’s a bedtime story generator that runs in the browser, built with Gemma 3 270M.
ChatGPT has added GMail, Google Calendar, and Google Contacts to its group of connectors, which integrate ChatGPT with other applications. This information will be used to provide additional context—and presumably will be used for training or discovery in ongoing lawsuits. Fortunately, it’s (at this point) opt-in.
Anthropic has upgraded Claude Sonnet 4 with a 1M token context window. The larger context window is only available via the API.
OpenAI released GPT-5. Simon Willison’s review is excellent. It doesn’t feel like a breakthrough, but it is quietly better at delivering good results. It is claimed to be less prone to hallucination and incorrect answers. One quirk is that with ChatGPT, GPT-5 determines which model should respond to your prompt.
Anthropic is researching persona vectors as a means of training a language model to behave correctly. Steering a model toward inappropriate behavior during training can be a kind of “vaccination” against that behavior when the model is deployed, without compromising other aspects of the model’s behavior.
The Darwin Gödel Machine is an agent that can read and modify its own code to improve its performance on tasks. It can add tools, re-organize workflows, and evaluate whether these changes have improved its performance.
Grok is at it again: generating nude deepfakes of Taylor Swift without being prompted to do so. I’m sure we’ll be told that this was the result of an unauthorized modification to the system prompt. In AI, some things are predictable.
Anthropic has released Claude Opus 4.1, an upgrade to its flagship model. We expect this to be the “gold standard” for generative coding.
OpenAI has released two open-weight models, their first since GPT-2: gpt-oss-120b and gpt-oss-20b. They are reasoning models designed for use in agentic applications. Claimed performance is similar to OpenAI’s o3 and o4-mini.
OpenAI has also released a “response format” named Harmony. It’s not quite a protocol, but it is a standard that specifies the format of conversations by defining roles (system, user, etc.) and channels (final, analysis, commentary) for a model’s output.
Can AIs evolve guilt? Guilt is expressed in human language; it’s in the training data. The AI that deleted a production database because it “panicked” certainly expressed guilt. Whether an AI’s expressions of guilt are meaningful in any way is a different question.
Claude Code Router is a tool for routing Claude Code requests to different models. You can choose different models for different kinds of requests.
Qwen has released a thinking version of their flagship model, called Qwen3-235B-A22B-Thinking-2507. Thinking cannot be switched on or off. The model was trained with a new reinforcement learning algorithm called Group Sequence Policy Optimization. It burns a lot of tokens, and it’s not very good at pelicans.
ChatGPT is releasing “personalities” that control how it formulates its responses. Users can select the personality they want to respond: robot, cynic, listener, sage, and presumably more.
DeepMind has created Aeneas, a new model designed to help scholars understand ancient fragments. In ancient text, large pieces are often missing. Can AI help place these fragments into contexts where they can be understood? Latin only, for now.

Security

The US Cybersecurity and Infrastructure Security Agency (CISA) has warned that a serious code execution vulnerability in Git is currently being exploited in the wild.
Is it possible to build an agentic browser that is safe from prompt injection? Probably not. Separating user instructions from website content isn’t possible. If a browser can’t take direction from the content of a web page, how is it to act as an agent?
The solution to Part 4 of Kryptos, the CIA’s decades-old cryptographic sculpture, is for sale! Jim Sanborn, the creator of Kryptos, is auctioning the solution. He hopes that the winner will preserve the secret and take over verifying people’s claims to have solved the puzzle.
Remember XZ, the supply-chain attack that granted backdoor access via a trojaned compression library? It never went away. Although the affected libraries were quickly patched, it’s still active, and propagating, via Docker images that were built with unpatched libraries. Some gifts keep giving.
For August, Embrace the Red published The Month of AI Bugs, a daily post about AI vulnerabilities (mostly various forms of prompt injection). This series is essential reading for AI developers and for security professionals.
NIST has finalized a standard for lightweight cryptography. Lightweight cryptography is a cryptographic system designed for use by small devices. It is useful both for encrypting sensitive data and for authentication.
The Dark Patterns Tip Line is a site for reporting dark patterns: design features in websites and applications that are designed to trick us into acting against our own interest.
OpenSSH supports post-quantum key agreement, and in versions 10.1 and later, will warn users when they select a non-post-quantum key agreement scheme.
SVG files can carry a malware payload; pornographic SVGs include JavaScript payloads that automate clicking “like.” That’s a simple attack with few consequences, but much more is possible, including cross-site scripting, denial of service, and other exploits.
Google’s AI agent for discovering security flaws, Big Sleep, has found 20 flaws in popular software. DeepMind discovered and reproduced the flaws, which were then verified by human security experts and reported. Details won’t be provided until the flaws have been fixed.
The US CISA (Cybersecurity and Infrastructure Security Agency) has open-sourced Thorium, a platform for malware and forensic analysis.
Prompt injection, again: A new prompt injection attack embeds instructions in language that appears to be copyright notices and other legal fine print. To avoid litigation, many models are configured to prioritize legal instructions.
Light can be watermarked; this may be useful as a technique for detecting fake or manipulated video.
vCISO (Virtual CISO) services are thriving, particularly among small and mid-size businesses that can’t afford a full security team. The use of AI is cutting the vCISO workload. But who takes the blame when there’s an incident?
A phishing attack against PyPI users directs them to a fake PyPI site that tells them to verify their login credentials. Stolen credentials could be used to plant malware in the genuine PyPI repository. Users of Mozilla’s add-on repository have also been targeted by phishing attacks.
A new ransomware group named Chaos appears to be a rebranding of the BlackSuit group, which was taken down recently. BlackSuit itself is a rebranding of the Royal group, which in turn is a descendant of the Conti group. Whack-a-mole continues.
Google’s OSS Rebuild project is an important step forward in supply chain security. Rebuild provides build definitions along with metadata that can confirm projects were built correctly. OSS Rebuild currently supports the NPM, PyPl, and Crates ecosystems.
The JavaScript package “is,” which does some simple type checking, has been infected with malware. Supply chain security is a huge issue—be careful what you install!

Programming

Claude Code PM is a workflow management system for programming with Claude. It manages PRDs, GitHub, and parallel execution of coding agents. It claims to facilitate collaboration between multiple Claude instances working on the same project.
Rust is increasingly used to implement performance-critical extensions to Python, gradually displacing C. Polars, Pydantic, and FastAPI are three libraries that rely on Rust.
Microsoft’s Prompt Orchestration Markup Language (POML) is an HTML-like markup language for writing prompts. It is then compiled into the actual prompt. POML is good at templating and has tags for tabular and document data. Is this a step forward? You be the judge.
Claudia is an “elegant desktop companion” for Claude Code; it turns terminal-based Claude Code into something more like an IDE, though it seems to focus more on the workflow than on coding.
Google’s LangExtract is a simple but powerful Python library for extracting text from documents. It relies on examples, rather than regular expressions or other hacks, and shows the exact context in which the extracts occur. LangExtract is open source.
Microsoft appears to be integrating GitHub into its AI team rather than running it as an independent organization. What this means for GitHub users is unclear.
Cursor now has a command-line interface, almost certainly a belated response to the success of Claude Code CLI and Gemini CLI.
Latency is a problem for enterprise AI. And the root cause of latency in AI applications is usually the database.
The Commodore 64 is back. With several orders of magnitude more RAM. And all the original ports, plus HDMI.
Google has announced Gemini CLI GitHub Actions, an addition to their agentic coder that allows it to work directly with GitHub repositories.
JetBrains is developing a new programming language for use when programming with LLMs. That language may be a dialect of English. (Formal informal languages, anyone?)
Pony is a new programming language that is type-safe, memory-safe, exception-safe, race-safe, and deadlock-safe. You can try it in a browser-based playground.

Web

The AT Protocol is the core of Bluesky. Here’s a tutorial; use it to build your own Bluesky services, in turn making Bluesky truly federate.
Social media is broken, and probably can’t be fixed. Now you know. The surprise is that the problem isn’t “algorithms” for maximizing engagement; take algorithms away and everything stays the same or gets worse.
The Tiny Awards Finalists show just how much is possible on the Web. They’re moving, creative, and playful. For example, the Traffic Cam Photobooth lets people use traffic cameras to take pictures of themselves, playing with ever-present automated surveillance.
A US federal court has found that Facebook illegally collected data from the women’s health app Flo.
The HTML Hobbyist is a great site for people who want to create their own presence on the web—outside of walled gardens, without mind-crushing frameworks. It’s not difficult, and it’s not expensive.

Biology and Quantum Computing

Scientists have created biological qubits: quantum qubits built from proteins in living cells. These probably won’t be used to break cryptography, but they are likely to give us insight into how quantum processes work inside living things.

Firing Junior Developers Is Indeed the “Dumbest Thing”

Oreilly

By: Mike Loukides

25 August 2025 at 12:49

Matt Garman’s statement that firing junior developers because AI can do their work is the “dumbest thing I’ve ever heard” has almost achieved meme status. I’ve seen it quoted everywhere.

We agree. It’s a point we’ve made many times over the past few years. If we eliminate junior developers, where will the seniors come from? A few years down the road, when the current senior developers are retiring, who will take their place? The roles of juniors and seniors are no doubt changing—and, as roles change, we need to be thinking about the kinds of training junior developers will need to work effectively in their new roles, to prepare to step into roles as senior developers later in their career—possibly sooner than they (or their management) anticipated. Programming languages and algorithms are still table stakes. In addition, junior developers now need to become skilled debuggers, they need to learn design skills, and they need to start thinking on a higher level than the function they’re currently working on.

We also believe that using AI effectively is a learned skill. Andrew Stellman has written about bridging the AI learning gap, and his Sens-AI framework is designed for teaching people how to use AI as part of learning to program in a new language.

As Tim O’Reilly has written,

Here’s what history consistently shows us: Whenever the barrier to communicating with computers lowers, we don’t end up with fewer programmers—we discover entirely new territories for computation to transform.

We will need more programmers, not fewer. And we will get them—at all levels of proficiency, from complete newbie to junior professional to senior. The question facing us is this: How will we enable all of these programmers to make great software, software of a kind that may not even exist today? Not everyone needs to walk the path from beginner to seasoned professional. But that path has to exist. It will be developed through experience, what you can call “learning by doing.” That’s how technology breakthroughs turn into products, practices, and actual adoption. And we’re building that path.