Reading view

There are new articles available, click to refresh the page.

Radar Trends to Watch: December 2025

November ended. Thanksgiving (in the US), turkey, and a train of model announcements. The announcements were exciting: Google’s Gemini 3 puts it in the lead among large language models, at least for the time being. Nano Banana Pro is a spectacularly good text-to-image model. OpenAI has released its heavy hitters, GPT-5.1-Codex-Max and GPT-5.1 Pro. And the Allen Institute released its latest open source model, Olmo 3, the leading open source model from the US.

Since Trends avoids deal-making (should we?), we’ve also avoided the angst around an AI bubble and its implosion. Right now, it’s safe to say that the bubble is formed of money that hasn’t yet been invested, let alone spent. If it is a bubble, it’s in the future. Do promises and wishes make a bubble? Does a bubble made of promises and wishes pop with a bang or a pffft?

AI

  • Now that Google and OpenAI have laid down their cards, Anthropic has released its latest heavyweight model: Opus 4.5. They’ve also dropped the price significantly.
  • The Allen Institute has launched its latest open source model, Olmo 3. The institute’s opened up the whole development process to allow other teams to understand its work.
  • Not to be outdone, Google has introduced Nano Banana Pro (aka Gemini 3 Pro Image), its state-of-the-art image generation model. Nano Banana’s biggest feature is the ability to edit images to change the appearance of items without redrawing them from scratch. And according to Simon WIllison, it watermarks the parts of an image it generates with SynthID.
  • OpenAI has released two more components of GPT-5.1, GPT-5.1-Codex-Max (API) and GPT-5.1 Pro (ChatGPT). This release brings the company’s most powerful models for generative work into view.
  • A group of quantum physicists claim to have reduced the size of the DeepSeek model by half, and to have removed Chinese censorship. The model can now tell you what happened in Tiananmen Square, explain what Pooh looked like, and answer other forbidden questions.
  • The release train for Gemini 3 has begun, and the commentariat quickly crowned it king of the LLMs. It includes the ability to spin up a web interface so users can give it more information about their questions, and to generate diagrams along with text output.
  • As part of the Gemini 3 release, Google has also announced a new agentic IDE called Antigravity.
  • Google has released a new weather forecasting model, WeatherNext 2, that can forecast with resolutions up to 1 hour. The data is available through Earth Engine and BigQuery, for those who would like to do their own forecasting. There’s also an early access program on Vertex AI.
  • Grok 4.1 has been released, with reports that it is currently the best model at generative prose, including creative writing. Be that as it may, we don’t see why anyone would use an AI that has been trained to reflect Elon Musk’s thoughts and values. If AI has taught us one thing, it’s that we need to think for ourselves.
  • AI demands the creation of new data centers and new energy sources. States want to ensure that those power plants are built, and built in ways that don’t pass costs on to consumers.
  • Grokipedia uses questionable sources. Is anyone surprised? How else would you train an AI on the latest conspiracy theories?
  • AMD GPUs are competitive, but they’re hampered because there are few libraries for low-level operations. To solve this problem, Chris Ré and others have announced HipKittens, a library of programming primitive operations for AMD GPUs.
  • OpenAI has released GPT-5.1. The two new models are Instant, which is tuned to be more conversational and “human,” and Thinking, a reasoning model that now adapts the time it takes to “think” to the difficulty of the questions.
  • Large language models, including GPT-5 and the Chinese models, show bias against users who use a German dialect rather than standard German. The bias appeared to be greater as the model size increased. These results also apply to languages like English.
  • Ethan Mollick on evaluating (ultimately, interviewing) your AI models is a must-read.
  • Yann LeCun is leaving Facebook to launch a new startup that will develop his ideas about building AI.
  • Harbor is a new tool that simplifies benchmarking frameworks and models. It’s from the developers of the Terminal-Bench benchmark. And it brings us a step closer to a world where people build their own specialized AI rather than rely on large providers.
  • Music rights holders are beginning to make deals with Udio (and presumably other companies) that train their models on existing music. Unfortunately, this doesn’t solve the bigger problem: Music is a “collectively produced shared cultural good, sustained by human labor. Copyright isn’t suited to protecting this kind of shared value,” as professors Oliver Bown and Kathy Bowrey have argued.
  • Moonshot AI has finally released Kimi K2 Thinking, the first open weights model to have benchmark results competitive with—or exceeding—the best closed weights models. It’s designed to be used as an agent, calling external tools as needed to solve problems.
  • Tongyi DeepResearch is a new fully open source agent for doing research. Its results are comparable to OpenAI deep research, Claude Sonnet 4, and similar models. Tongyi is part of Alibaba; it’s yet another important model to come out of China.
  • Data centers in space? It’s an interesting and challenging idea. Cooling is a much bigger problem than you’d expect. They would require massive arrays of solar cells for power. But some people think it might happen.
  • MiniMax M2 is a new open weights model that focuses on building agents. It has performance similar to Claude Sonnet but at a much lower price point. It also embeds its thought processes between <think> and </think> tags, which is an important step toward interpretability.
  • DeepSeek has introduced a new model for OCR with some very interesting properties: It has a new process for storing and retrieving memories that also makes the model significantly more efficient.
  • Agent Lightning provides a code-free way to train agents using reinforcement learning.

Programming

  • The Zig programming language has published a book. Online, of course.
  • Google is weakening its controversial new rules about developer verification. The company plans to create a separate class for applications with limited distribution, and develop a flow that will allow the installation of unverified apps.
  • Google’s LiteRT is a library for running AI models in browsers and small devices. LiteRT supports Android, iOS, embedded Linux, and microcontrollers. Supported languages include Java, Kotlin, Swift, Embedded C, and C++.
  • Does AI-assisted coding mean the end of new languages? Simon Willison thinks that LLMs can encourage the development of new programming languages. Design your language and ship it with a Claude Skills-style document; that should be enough for an LLM to learn how to use it.
  • Deepnote, a successor to the Jupyter Notebook, is a next-generation notebook for data analytics that’s built for teams. There’s now a shared workspace; different blocks can use different languages; and AI integration is on the road map. It’s now open source.
  • The idea of assigning colors (red, blue) to tools may be helpful in limiting the risk of prompt injection when building agents. What tools can return something damaging? This sounds like a step towards the application of the “least privilege” principle to AI design.

Security

  • We’re making the same mistake with AI security as we made with cloud security (and security in general): treating security as an afterthought.
  • Anthropic claims to have disrupted a Chinese cyberespionage group that was using Claude to generate attacks against other systems. Anthropic claims that the attack was 90% automated, though that claim is controversial.
  • Don’t become a victim. Data collected for online age verification makes your site a target for attackers. That data is valuable, and they know it.
  • A research collaboration uses data poisoning and AI to disrupt deepfake images. Users use Silverer to process their images before posting. The tool makes invisible changes to the original image that confuse AIs creating new images, leading to unusable distortions.
  • Is it a surprise that AI is being used to generate fake receipts and expense reports? After all, it’s used to fake just about everything else. It was inevitable that enterprise applications of AI fakery would appear.
  • HydraPWK2 is a Linux distribution designed for penetration testing. It’s based on Debian and is supposedly easier to use than Kali Linux.
  • How secure is your trusted execution environment (TEE)? All of the major hardware vendors are vulnerable to a number of physical attacks against “secure enclaves.” And their terms of service often exclude physical attacks.
  • Atroposia is a new malware-as-a-service package that includes a local vulnerability scanner. Once an attacker has broken into a site, they can find other ways to remain there.
  • A new kind of phishing attack (CoPhishing) uses Microsoft Copilot Studio agents to steal credentials by abusing the Sign In topic. Microsoft has promised an update that will defend against this attack.

Operations

  • Here’s how to install Open Notebook, an open source equivalent to NotebookLM, to run on your own hardware. It uses Docker and Ollama to run the notebook and the model locally, so data never leaves your system.
  • Open source isn’t “free as in beer.” Nor is it “free as in freedom.” It’s “free as in puppies.” For better or for worse, that just about says it.
  • Need a framework for building proxies? Cloudflare’s next generation Oxy framework might be what you need. (Whatever you think of their recent misadventure.)
  • MIT Media LabsProject NANDA intends to build infrastructure for a decentralized network of AI agents. They describe it as a global decentralized registry (not unlike DNS) that can be used to discover and authenticate agents using MCP and A2A. Isn’t this what we wanted from the internet in the first place?

Web

Things

The Other 80%: What Productivity Really Means

We’ve been bombarded with claims about how much generative AI improves software developer productivity: It turns regular programmers into 10x programmers, and 10x programmers into 100x. And even more recently, we’ve been (somewhat less, but still) bombarded with the other side of the story: METR reports that, despite software developers’ belief that their productivity has increased, total end-to-end throughput has declined with AI assistance. We also saw hints of that in last year’s DORA report, which showed that release cadence actually slowed slightly when AI came into the picture. This year’s report reverses that trend.

I want to get a couple of assumptions out of the way first:

  • I don’t believe in 10x programmers. I’ve known people who thought they were 10x programmers, but their primary skill was convincing other team members that the rest of the team was responsible for their bugs. 2x, 3x? That’s real. We aren’t all the same, and our skills vary. But 10x? No.
  • There are a lot of methodological problems with the METR report—they’ve been widely discussed. I don’t believe that means we can ignore their result; end-to-end throughput on a software product is very difficult to measure.

As I (and many others) have written, actually writing code is only about 20% of a software developer’s job. So if you optimize that away completely—perfect secure code, first time—you only achieve a 20% speedup. (Yeah, I know, it’s unclear whether or not “debugging” is included in that 20%. Omitting it is nonsense—but if you assume that debugging adds another 10%–20% and recognize that that generates plenty of its own bugs, you’re back in the same place.) That’s a consequence of Amdahl’s law, if you want a fancy name, but it’s really just simple arithmetic.

Amdahl’s law becomes a lot more interesting if you look at the other side of performance. I worked at a high-performance computing startup in the late 1980s that did exactly this: It tried to optimize the 80% of a program that wasn’t easily vectorizable. And while Multiflow Computer failed in 1990, our very-long-instruction-word (VLIW) architecture was the basis for many of the high-performance chips that came afterward: chips that could execute many instructions per cycle, with reordered execution flows and branch prediction (speculative execution) for commonly used paths.

I want to apply the same kind of thinking to software development in the age of AI. Code generation seems like low-hanging fruit, though the voices of AI skeptics are rising. But what about the other 80%? What can AI do to optimize the rest of the job? That’s where the opportunity really lies.

Angie Jones’s talk at AI Codecon: Coding for the Agentic World takes exactly this approach. Angie notes that code generation isn’t changing how quickly we ship because it only takes in one part of the software development lifecycle (SDLC), not the whole. That “other 80%” involves writing documentation, handling pull requests (PRs), and the continuous integration pipeline (CI). In addition, she realizes that code generation is a one-person job (maybe two, if you’re pairing); coding is essentially solo work. Getting AI to assist the rest of the SDLC requires involving the rest of the team. In this context, she states the 1/9/90 rule: 1% are leaders who will experiment aggressively with AI and build new tools; 9% are early adopters; and 90% are “wait and see.” If AI is going to speed up releases, the 90% will need to adopt it; if it’s only the 1%, a PR here and there will be managed faster, but there won’t be substantial changes.

Angie takes the next step: She spends the rest of the talk going into some of the tools she and her team have built to take AI out of the IDE and into the rest of the process. I won’t spoil her talk, but she discusses three stages of readiness for the AI: 

  • AI-curious: The agent is discoverable, can answer questions, but can’t modify anything.
  • AI-ready: The AI is starting to make contributions, but they’re only suggestions. 
  • AI-embedded: The AI is fully plugged into the system, another member of the team.

This progression lets team members check AI out and gradually build confidence—as the AI developers themselves build confidence in what they can allow the AI to do.

Do Angie’s ideas take us all the way? Is this what we need to see significant increases in shipping velocity? It’s a very good start, but there’s another issue that’s even bigger. A company isn’t just a set of software development teams. It includes sales, marketing, finance, manufacturing, the rest of IT, and a lot more. There’s an old saying that you can’t move faster than the company. Speed up one function, like software development, without speeding up the rest and you haven’t accomplished much. A product that marketing isn’t ready to sell or that the sales group doesn’t yet understand doesn’t help.

That’s the next question we have to answer. We haven’t yet sped up real end-to-end software development, but we can. Can we speed up the rest of the company? METR’s report claimed that 95% of AI products failed. They theorized that it was in part because most projects targeted customer service, but the backend office work was more amenable to AI in its current form. That’s true—but there’s still the issue of “the rest.” Does it make sense to use AI to generate business plans, manage supply change, and the like if all it will do is reveal the next bottleneck?

Of course it does. This may be the best way of finding out where the bottlenecks are: in practice, when they become bottlenecks. There’s a reason Donald Knuth said that premature optimization is the root of all evil—and that doesn’t apply only to software development. If we really want to see improvements in productivity through AI, we have to look company-wide.

Radar Trends to Watch: November 2025

AI has so thoroughly colonized every technical discipline that it’s becoming hard to organize items of interest in Radar Trends. Should a story go under AI or programming (or operations or biology or whatever the case may be)? Maybe it’s time to go back to a large language model that doesn’t require any electricity and has over 217K parameters: Merriam-Webster. But no matter where these items ultimately appear, it’s good to see practical applications of AI in fields as diverse as bioengineering and UX design.

AI

  • Alibaba’s Ling-1T may be the best model you’ve never heard of. It’s a nonthinking mixture-of-experts model with 1T parameters, 50B active at any time. And it’s open weights (MIT license).
  • Marin is a new lab for creating fully open source models. They say that the development of models will be completely transparent from the beginning. Everything is tracked by GitHub; all experiments may be observed by anyone; there’s no cherrypicking of results.
  • WebMCP is a proposal and an implementation for a protocol that allows websites to become MCP servers. As servers, they can interact directly with agents and LLMs.
  • Claude has announced Agent Skills. Skills are essentially just a Markdown file describing how to perform a task, possibly accompanied by scripts and resources. They’re easy to add and only used as needed. A Skill-creator Skill makes it very easy to build Skills. Simon Willison thinks that Skills may be a “bigger deal than MCP.”
  • Pete Warden describes his work on the smallest of AI. Small AI serves an important set of applications without compromising privacy or requiring enormous resources.
  • Anthropic has released Claude Haiku 4.5, skipping 4.0 and 4.1 in the process. Haiku is their smallest and fastest model. The new release claims performance similar to Sonnet 4, but it’s much faster and less expensive.
  • NVIDIA is now offering the DGX Spark, a desktop AI supercomputer. It offers 1 petaflop performance on models with up to 200B parameters. Simon Willison has a review of a preview unit.
  • Andrej Karpathy has released nanochat, a small ChatGPT-like model that’s completely open and can be trained for roughly $100. It’s intended for experimenters, and Karpathy has detailed instructions on building and training.
  • There’s an agent-shell for Emacs? There had to be one. Emacs abhors a vacuum.
  • Anthropic launched “plugins,” which give developers the ability to write extensions to Claude Code. Of course, these extensions can be agents. Simon Willison points to Jesse Vincent’s Superpowers as a glimpse of what plugins can accomplish.
  • Google has released the Gemini 2.5 Computer Use model into public preview. While the thrill of teaching computers to click browsers and other web applications faded quickly, Gemini 2.5 Computer Use appears to be generating excitement.
  • Thinking Machines Labs has announced Tinker, an API for training open weight language models. Tinker runs on Thinking Machines’ infrastructure. It’s currently in beta.
  • Merriam-Webster will release its newest large language model on November 18. It has no data centers and requires no electricity.
  • We know that the data products, including AI, reflect historical biases in their training data. In India, OpenAI reflects caste biases. But it’s not just OpenAI; these biases appear in all models. Although caste bias was outlawed in the middle of the 20th century, these biases live on in the data.
  • DeepSeek has released an experimental version of its reasoning model, DeepSeek-V3.2-Exp. This model uses a technique called sparse attention to reduce the processing requirements (and cost) of the reasoning process.
  • OpenAI has added an Instant Checkout feature that allows users to make purchases with Etsy and Shopify merchants, taking them directly to checkout after finding their products. It’s based on the Agentic Commerce Protocol.
  • OpenAI’s GDPval tests go beyond existing benchmarks by challenging LLMs with real-world tasks rather than simple problems. The tasks were selected from 44 industries and were chosen for economic value.

Programming

  • Steve Yegge’s Beads is a memory management system for coding agents. It’s badly needed, and worth checking out.
  • Do you use coding agents in parallel? Simon Willison was a skeptic, but he’s gradually becoming convinced it’s a good practice.
  • One problem with generative coding is that AI is trained on “the worst code in the world.” For web development, we’ll need better foundations to get to a post–frontend-framework world.
  • If you’ve wanted to program with Claude from your phone or some other device, now you can. Anthropic has added web and mobile interfaces to Claude Code, along with a sandbox for running generated code safely.
  • You may have read “Programming with Nothing,” a classic article that strips programming to the basics of lambda calculus. “Programming with Less Than Nothing” does FizzBuzz in many lines of combinatory logic.
  • What’s the difference between technical debt and architectural debt? Don’t confuse them; they’re significantly different problems, with different solutions.
  • For graph fans: The IRS has released its fact graph, which, among other things, models the US Internal Revenue Code. It can be used with JavaScript and any JVM language.
  • What is spec-driven development? It has become one of the key buzzwords in the discussion of AI-assisted software development. Birgitta Böckeler attempts to define SDD precisely, then looks at three tools for aiding SDD.
  • IEEE Spectrum released its 2025 programming languages rankings. Python is still king, with Java second; JavaScript has fallen from third to fifth. But more important, Spectrum wonders whether AI-assisted programming will make these rankings irrelevant.

Web

  • Cloudflare CEO Matthew Prince is pushing for regulation to prevent Google from tying web crawlers for search and for training content together. You can’t block the training crawler without also blocking the search crawler, and blocking the latter has significant consequences for businesses.
  • OpenAI has released Atlas, its Chromium-based web browser. As you’d expect, AI is integrated into everything. You can chat with the browser, interrogate your history, your settings, or your bookmarks, and (of course) chat with the pages you’re viewing.
  • Try again? Apple has announced a second-generation Vision Pro, with a similar design and at the same price point.
  • Have we passed peak social? Social media usage has been declining for all age groups. The youngest group, 16–24, is the largest but has also shown the sharpest decline. Are we going to reinvent the decentralized web? Or succumb to a different set of walled gardens?
  • Addy Osmani’s post “The History of Core Web Vitals” is a must-read for anyone working in web performance.
  • Features from the major web frameworks are being implemented by browsers. Frameworks won’t disappear, but their importance will diminish. People will again be programming to the browser. In turn, this will make browser testing and standardization that much more important.
  • Luke Wroblewski writes about using AI to solve common problems in user experience (UX). AI can help with problems like collecting data from users and onboarding users to new applications.

Operations

  • There’s a lot to be learned from AWS’s recent outage, which stemmed from a DynamoDB DNS failure in the US-EAST-1 region. It’s important not to write this off as a war story about Amazon’s failure. Instead, think: How do you make your own distributed networks more reliable?
  • PyTorch Monarch is a new library that helps developers manage distributed systems for training AI models. It lets developers write a script that “orchestrates all distributed resources,” allowing the developer to work with them as a single almost-local system.

Security

  • The solution to the fourth part of Kryptos, the cryptosculpture at the CIA’s headquarters, has been discovered! The discovery came through an opsec error that led researchers to the clear text stored at the Smithsonian. This is an important lesson: Attacks against cryptosystems rarely touch the cryptography. They attack the protocols, people, and systems surrounding codes.
  • Public cryptocurrency blockchains are being used by international threat actors as “bulletproof” hosts for storing and distributing malware.
  • Apple is now giving a $2M bounty for zero-day exploits that allow zero-click remote code execution on iOS. These vulnerabilities have been exploited by commercial malware vendors.
  • Signal has incorporated postquantum encryption into its Signal protocol. This is a major technological achievement. They’re one of the few organizations that’s ready for the quantum world.
  • Salesforce is refusing to pay extortion after a major data loss of over a billion records. Data from a number of major accounts was stolen by a group calling itself Scattered LAPSUS$ Hunters. Attackers simply asked the victim’s staff to install an attacker-controlled app.
  • Context is the key to AI security. We’re not surprised; right now, context is the key to just about everything in AI. Attackers have the advantage now, but in 3–5 years that advantage will pass to defenders who use AI effectively.
  • Google has announced that Gmail users can now send end-to-end encrypted (E2EE) regardless of whether they’re using Gmail. Recipients who don’t use Gmail will receive a notification and the ability to read the message on a one-time guest account.
  • The best way to attack your company isn’t through the applications; it’s through the service help desk. Human engineering remains extremely effective—more effective than attacks against software. Training helps; a well-designed workflow and playbook is crucial.
  • Ransomware detection has now been built into the desktop version of Google Drive. When it detects activities that indicate ransomware, Drive suspends file syncing and alerts users. It’s enabled by default, but it is possible to opt out.
  • OpenAI is routing requests with safety issues to an unknown model. This is presumably a specialized version of GPT-5 that has been trained specially to deal with sensitive issues.

Robotics

  • Would you buy a banana from a robot? A small chain of stores in Chicago is finding out.
  • Rodney Brooks, founder of iRobot, warns that humans should stay at least 10 feet (3 meters) away from humanoid walking robots. There is a lot of potential energy in their limbs when they move them to retain balance. Unsurprisingly, this danger stems from the vision-only approach that Tesla and other vendors have adopted. Humans learn and act with all five senses.

Quantum Computing

Biology

On the AWS Outage

Everybody notices when something big fails—like AWS’s US-EAST-1 region. And fail it did. All sorts of services and sites became inaccessible, and we all knew it was Amazon’s fault. A week later, when I run into a site that’s down, I still say, “Must be some hangover from the AWS outage. Some cache that didn’t get refreshed.” Amazon gets blamed—maybe even rightly—even when it’s not their fault.

I’m not writing about fault, though, and I’m also not writing a technical analysis of what happened. There are good places for that online, including AWS’s own summary. What I am writing about is a reaction to the outage that I’ve seen all too often: “This proves we can’t trust AWS. We need to build our own infrastructure.”

Building your own infrastructure is fine. But I’m also reminded of the wisest comment I heard after the 2012 US-EAST outage. I asked JD Long about his reaction to the outage. He said, “I’m really glad it wasn’t my guys trying to fix the problem.”1 JD wasn’t disparaging his team; he was saying that Amazon has a lot of expertise in running, maintaining, and troubleshooting really big systems that can fail suddenly in unpredictable ways—when just the right conditions happen to tickle a bug that had been latent in the system for years. That expertise is hard to find and expensive when you find it. And no matter how expert “your guys” are, all complex systems fail. After last month’s AWS failure, Microsoft’s Azure obligingly failed about 10 days later.

I’m not really an Amazon fan or, more specifically, an AWS fan. But outages like this should force us to remember what they do right. AWS outages also warn us that we need to learn how to “craft ways of undoing this concentration and creating real choice,” as Signal CEO Meredith Whittaker points out. But Meredith understands how difficult it will be to build this infrastructure and that, for the present, there’s no viable alternative to AWS or one of the other hyperscalers.

Operating and troubleshooting large systems is difficult and requires very specialized skills. If you decide to build your own infrastructure, you will need those skills. And you may end up wishing that it isn’t your guys trying to fix the problem.


Footnote

  1. In 2012, I happened to be flying out of DC just as the storm that took US-EAST down was rolling in. My flight made it out, but it was dramatic.

Enlightenment

In a fascinating op-ed, David Bell, a professor of history at Princeton, argues that “AI is shedding enlightenment values.” As someone who has taught writing at a similarly prestigious university, and as someone who has written about technology for the past 35 or so years, I had a deep response.

Bell’s is not the argument of an AI skeptic. For his argument to work, AI has to be pretty good at reasoning and writing. It’s an argument about the nature of thought itself. Reading is thinking. Writing is thinking. Those are almost clichés—they even turn up in students’ assessments of using AI in a college writing class. It’s not a surprise to see these ideas in the 18th century, and only a bit more surprising to see how far Enlightenment thinkers took them. Bell writes:

The great political philosopher Baron de Montesquieu wrote: “One should never so exhaust a subject that nothing is left for readers to do. The point is not to make them read, but to make them think.” Voltaire, the most famous of the French “philosophes,” claimed, “The most useful books are those that the readers write half of themselves.”

And in the late 20th century, the great Dante scholar John Freccero would say to his classes “The text reads you”: How you read The Divine Comedy tells you who you are. You inevitably find your reflection in the act of reading.

Is the use of AI an aid to thinking or a crutch or a replacement? If it’s either a crutch or a replacement, then we have to go back to Descartes’s “I think, therefore I am” and read it backward: What am I if I don’t think? What am I if I have offloaded my thinking to some other device? Bell points out that books guide the reader through the thinking process, while AI expects us to guide the process and all too often resorts to flattery. Sycophancy isn’t limited to a few recent versions of GPT; “That’s a great idea” has been a staple of AI chat responses since its earliest days. A dull sameness goes along with the flattery—the paradox of AI is that, for all the talk of general intelligence, it really doesn’t think better than we do. It can access a wealth of information, but it ultimately gives us (at best) an unexceptional average of what has been thought in the past. Books lead you through radically different kinds of thought. Plato is not Aquinas is not Machiavelli is not Voltaire (and for great insights on the transition from the fractured world of medieval thought to the fractured world of Renaissance thought, see Ada Palmer’s Inventing the Renaissance).

We’ve been tricked into thinking that education is about preparing to enter the workforce, whether as a laborer who can plan how to spend his paycheck (readin’, writin’, ’rithmetic) or as a potential lawyer or engineer (Bachelor’s, Master’s, Doctorate). We’ve been tricked into thinking of schools as factories—just look at any school built in the 1950s or earlier, and compare it to an early 20th century manufacturing facility. Take the children in, process them, push them out. Evaluate them with exams that don’t measure much more than the ability to take exams—not unlike the benchmarks that the AI companies are constantly quoting. The result is that students who can read Voltaire or Montesquieu as a dialogue with their own thoughts, who could potentially make a breakthrough in science or technology, are rarities. They’re not the students our institutions were designed to produce; they have to struggle against the system, and frequently fail. As one elementary school administrator told me, “They’re handicapped, as handicapped as the students who come here with learning disabilities. But we can do little to help them.”

So the difficult question behind Bell’s article is: How do we teach students to think in a world that will inevitably be full of AI, whether or not that AI looks like our current LLMs? In the end, education isn’t about collecting facts, duplicating the answers in the back of the book, or getting passing grades. It’s about learning to think. The educational system gets in the way of education, leading to short-term thinking. If I’m measured by a grade, I should do everything I can to optimize that metric. All metrics will be gamed. Even if they aren’t gamed, metrics shortcut around the real issues.

In a world full of AI, retreating to stereotypes like “AI is damaging” and “AI hallucinates” misses the point, and is a sure route to failure. What’s damaging isn’t the AI, but the set of attitudes that make AI just another tool for gaming the system. We need a way of thinking with AI, of arguing with it, of completing AI’s “book” in a way that goes beyond maximizing a score. In this light, so much of the discourse around AI has been misguided. I still hear people say that AI will save you from needing to know the facts, that you won’t have to learn the dark and difficult corners of programming languages—but as much as I personally would like to take the easy route, facts are the skeleton on which thinking is based. Patterns arise out of facts, whether those patterns are historical movements, scientific theories, or software designs. And errors are easily uncovered when you engage actively with AI’s output.

AI can help to assemble facts, but at some point those facts need to be internalized. I can name a dozen (or two or three) important writers and composers whose best work came around 1800. What does it take to go from those facts to a conception of the Romantic movement? An AI could certainly assemble and group those facts, but would you then be able to think about what that movement meant (and continues to mean) for European culture? What are the bigger patterns revealed by the facts? And what would it mean for those facts and patterns to reside only within an AI model, without human comprehension? You need to know the shape of history, particularly if you want to think productively about it. You need to know the dark corners of your programming languages if you’re going to debug a mess of AI-generated code. Returning to Bell’s argument, the ability to find patterns is what allows you to complete Voltaire’s writing. AI can be a tremendous aid in finding those patterns, but as human thinkers, we have to make those patterns our own.

That’s really what learning is about. It isn’t just collecting facts, though facts are important. Learning is about understanding and finding relationships and understanding how those relationships change and evolve. It’s about weaving the narrative that connects our intellectual worlds together. That’s enlightenment. AI can be a valuable tool in that process, as long as you don’t mistake the means for the end. It can help you come up with new ideas and new ways of thinking. Nothing says that you can’t have the kind of mental dialogue that Bell writes about with an AI-generated essay. ChatGPT may not be Voltaire, but not much is. But if you don’t have the kind of dialogue that lets you internalize the relationships hidden behind the facts, AI is a hindrance. We’re all prone to be lazy—intellectually and otherwise. What’s the point at which thinking stops? What’s the point at which knowledge ceases to become your own? Or, to go back to the Enlightenment thinkers, when do you stop writing your share of the book?

That’s not a choice AI makes for you. It’s your choice.

Radar Trends to Watch: October 2025

This month we have two more protocols to learn. Google has announced the Agent Payments Protocol (AP2), which is intended to help agents to engage in ecommerce—it’s largely concerned with authenticating and authorizing parties making a transaction. And the Agent Client Protocol (ACP) is concerned with communications between code editors and coding agents. When implemented, it would allow any code editor to plug in any compliant agent.

All hasn’t been quiet on the virtual reality front. Meta has announced its new VR/AR glasses, with the ability to display images on the lenses along with capabilities like live captioning for conversations. They’re much less obtrusive than the previous generation of VR goggles.

AI

  • Suno has announced an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.
  • Ollama has added its own web search API. Ollama’s search API can be used to augment the information available to models. 
  • GitHub Copilot now offers a command-line tool, GitHub CLI. It can use either Claude Sonnet 4 or GPT-5 as the backing model, though other models should be available soon. Claude 4 is the default.
  • Alibaba has released Qwen3-Max, a trillion-plus parameter model. There are reasoning and nonreasoning variants, though the reasoning variant hasn’t yet been released. Alibaba also released models for speech-to-text, vision-language, live translation, and more. They’ve been busy. 
  • GitHub has launched its MCP Registry to make it easier to discover MCP servers archived on GitHub. It’s also working with Anthropic and others to build an open source MCP registry, which lists servers regardless of their origin and integrates with GitHub’s registry. 
  • DeepMind has published version 3.0 of its Frontier Safety Framework, a framework for experimenting with AI-human alignment. They’re particularly interested in scenarios where the AI doesn’t follow a user’s directives, and in behaviors that can’t be traced to a specific reasoning chain.
  • Alibaba has released the Tongyi DeepResearch reasoning model. Tongyi is a 30.5B parameter mixture-of-experts model, with 3.3B parameters active. More importantly, it’s fully open source, with no restrictions on how it can be used. 
  • Locally AI is an iOS app that lets you run large language models on your iPhone or iPad. It works offline; there’s no need for a network connection. 
  • OpenAI has added control over the “reasoning” process to its GPT-5 models. Users can choose between four levels: Light (Pro users only), Standard, Extended, and Heavy (Pro only). 
  • Google has announced the Agent Payments Protocol (AP2), which facilitates purchases. It focuses on authorization (proving that it has the authority to make a purchase), authentication (proving that the merchant is legitimate), and accountability (in case of a fraudulent transaction).
  • Bring Your Own AI: Employee adoption of AI greatly exceeds official IT adoption. We’ve seen this before, on technologies as different as the iPhone and open source.
  • Alibaba has released the ponderously named Qwen3-Next-80B-A3B-Base. It’s a mixture-of-experts model with a high ratio of active parameters to total parameters (3.75%). Alibaba claims that the model cost 1/10 as much to train and is 10 times faster than its previous models. If this holds up, Alibaba is winning on performance where it counts.
  • Anthropic has announced a major upgrade to Claude’s capabilities. It can now execute Python scripts in a sandbox and can create Excel spreadsheets, PowerPoint presentations, PNG files, and other documents. You can upload files for it to analyze. And of course this comes with security risks.
  • The SIFT method—stop, investigate the source, find better sources, and trace quotes to their original context—is a way of structuring your use of AI output that will make you less vulnerable to misinformation. Hint: it’s not just for AI.
  • OpenAI’s Projects feature is now available to free accounts. Projects is a set of tools for organizing conversations with the LLM. Projects are separate workspaces with their own custom instructions, independent memory, and context. They can be forked. Projects sounds something like Git for LLMs—a set of features that’s badly needed.
  • EmbeddingGemma is a new open weights embedding model (308M parameters) that’s designed to run on devices, requiring as little as 200 MB of memory.
  • An experiment with GPT-4o-mini shows that language models can fall to psychological manipulation. Is this surprising? After all, they are trained on human output.
  • Platform Shifts Redefine Apps”: AI is a new kind of platform and demands rethinking what applications mean and how they should work. Failure to do this rethinking may be why so many AI efforts fail.
  • MCP-UI is a protocol that allows MCP servers to send React components or Web Components to agents, allowing the agent to build an appropriate browser-based interface on the fly.
  • The Agent Client Protocol (ACP) is a new protocol that standardizes communications between code editors and coding agents. It’s currently supported by the Zed and Neovim editors, and by the Gemini CLI coding agent.
  • Gemini 2.5 Flash is now using a new image generation model that was internally known as “nano banana.” This new model can edit uploaded images, merge images, and maintain visual consistency across a series of images.

Programming

  • Anthropic released Claude Code 2.0. New features include the ability to checkpoint your work, so that if a coding agent wanders off-course, you can return to a previous state. They have also added the ability to run tasks in the background, call hooks, and use subagents.
  • Suno has announced an AI-driven digital audio workstation (DAW), a tool for enabling people to be creative with AI-generated music.
  • The Wasmer project has announced that it now has full Python support in the beta version of Wasmer Edge, its WebAssembly runtime for serverless edge deployment.
  • Mitchell Hashimoto, founder of Hashicorp, has promised that a library for Ghostty (libghostty) is coming! This library will make it easy to embed a terminal emulator into an application. Perhaps more important, libghostty might standardize the code for terminal output across applications. 
  • There’s a new benchmark for agentic coding: CompileBench. CompileBench tests the ability of models to solve complex problems in figuring out how to build code
  • Apple is reportedly rewriting iOS in a new programming language. Rust would be the obvious choice, but rumors are that it’s something of their own creation. Apple likes languages it can control. 
  • Java 25, the latest long-term support release, has a number of new features that reduce the boilerplate that makes Java difficult to learn. 
  • Luau is a new scripting language derived from Lua. It claims to be fast, small, and safe. It’s backward compatible with Version 5.1 of Lua.
  • OpenAI has launched GPT-5 Codex, its generation model trained specifically for software engineering. Codex is now available both in the CLI tool and through the API. It’s clearly intended to challenge Anthropic’s dominant coding tool, Claude Code.
  • Do prompts belong in code repositories? We’ve argued that prompts should be archived. But they don’t belong in a source code repo like Git. There are better tools available.
  • This is cool and different. A developer has hacked the 2001 game Animal Crossing so that the dialog is generated by LLM rather than coming from the game’s memory.
  • There’s a new programming language, vibe-coded in its entirety with Claude. Cursed is similar to Claude, but all the keywords are Gen Z slang. It’s not yet on the list, but it’s a worthy addition to Esolang
  • Claude Code is now integrated into the Zed editor (beta), using the Agent Client Protocol (ACP)
  • Ida Bechtle’s documentary on the history of Python, complete with many interviews with Guido van Rossum, is a must-watch.

Security

  • The first malicious MCP server has been found in the wild. Postmark-MCP, an MCP server for interacting with the Postmark application, suddenly (version 1.0.16) started sending copies of all the email it handles to its developer.
  • I doubt this is the first time, but supply chain security vulnerabilities have now hit Rust’s package management system, Crates.io. Two packages that steal keys for cryptocurrency wallets have been found. It’s time to be careful about what you download.
  • Cross-agent privilege escalation is a new kind of vulnerability in which a compromised intelligent agent uses indirect prompt injection to cause a victim agent to overwrite its configuration, granting it additional privileges. 
  • GitHub is taking a number of measures to improve software supply chain security, including requiring two-factor authentication (2FA), expanding trusted publishing, and more.
  • A compromised npm package uses a QR code to encode malware. The malware is apparently downloaded in the QR code (which is valid, but too dense to be read by a normal camera), unpacked by the software, and used to steal cookies from the victim’s browser. 
  • Node.js and its package manager npm have been in the news because of an ongoing series of supply chain attacks. Here’s the latest report.
  • A study by Cisco has discovered over a thousand unsecured LLM servers running on Ollama. Roughly 20% were actively serving requests. The rest may have been idle Ollama instances, waiting to be exploited. 
  • Anthropic has announced that Claude will train on data from personal accounts, effective September 28. This includes Free, Pro, and Max plans. Work plans are exempted. While the company says that training on personal data is opt-in, it’s (currently) enabled by default, so it’s opt-out.
  • We now have “vibe hacking,” the use of AI to develop malware. Anthropic has reported several instances in which Claude was used to create malware that the authors could not have created themselves. Anthropic is banning threat actors and implementing classifiers to detect illegal use.
  • Zero trust is basic to modern security. But groups implementing zero trust have to realize that it’s a project that’s never finished. Threats change, people change, systems change.
  • There’s a new technique for jailbreaking LLMs: write prompts with bad grammar and run-on sentences. These seem to prevent guardrails from taking effect. 
  • In an attempt to minimize the propagation of malware on the Android platform, Google plans to block “sideloading” apps for Android devices and require developer ID verification for apps installed through Google Play.
  • A new phishing attack called ZipLine targets companies using their own “contact us” pages. The attacker then engages in an extended dialog with the company, often posing as a potential business partner, before eventually delivering a malware payload.

Operations

  • The 2025 DORA report is out! DORA may be the most detailed summary of the state of the IT industry. DORA’s authors note that AI is everywhere and that the use of AI now improves end-to-end productivity, something that was ambiguous in last year’s report.
  • Microsoft has announced that Word will save files to the cloud (OneDrive) by default. This (so far) appears to apply only when using Windows. The feature is currently in beta.

Web

Virtual and Augmented Reality

  • Meta has announced a pair of augmented reality glasses with a small display on one of the lenses, bringing it to the edge of AR. In addition to displaying apps from your phone, the glasses can do “live captioning” for conversations. The display is controlled by a wristband.

Megawatts and Gigawatts of AI

We can’t not talk about power these days. We’ve been talking about it ever since the Stargate project, with half a trillion dollars in data center investment, was floated early in the year. We’ve been talking about it ever since the now-classic “Stochastic Parrots” paper. And, as time goes on, it only becomes more of an issue.

“Stochastic Parrots” deals with two issues: AI’s power consumption and the fundamental nature of generative AI; selecting sequences of words according to statistical patterns. I always wished those were two papers, because it would be easier to disagree about power and agree about parrots. For me, the power issue is something of a red herring—but increasingly, I see that it’s a red herring that isn’t going away because too many people with too much money want herrings; too many believe that a monopoly on power (or a monopoly on the ability to pay for power) is the route to dominance.

Why, in a better world than we currently live in, would the power issue be a red herring? There are several related reasons:

  • I have always assumed that the first generation language models would be highly inefficient, and that over time, we’d develop more efficient algorithms.
  • I have also assumed that the economics of language models would be similar to chip foundries or pharma factories: The first chip coming out of a foundry costs a few billion dollars, everything afterward is a penny apiece.
  • I believe (now more than ever) that, long-term, we will settle on small models (70B parameters or less) that can run locally rather than giant models with trillions of parameters running in the cloud.

And I still believe those points are largely true. But that’s not sufficient. Let’s go through them one by one, starting with efficiency.

Better Algorithms

A few years ago, I saw a fair number of papers about more efficient models. I remember a lot of articles about pruning neural networks (eliminating nodes that contribute little to the result) and other techniques. Papers that address efficiency are still being published—most notably, DeepMind’s recent “Mixture-of-Recursions” paper—but they don’t seem to be as common. That’s just anecdata, and should perhaps be ignored. More to the point, DeepSeek shocked the world with their R1 model, which they claimed cost roughly 1/10 as much to train as the leading frontier models. A lot of commentary insisted that DeepSeek wasn’t being up front in their measurement of power consumption, but since then several other Chinese labs have released highly capable models, with no gigawatt data centers in sight. Even more recently, OpenAI has released gpt-oss in two sizes (120B and 30B), which were reportedly much less expensive to train. It’s not the first time this has happened—I’ve been told that the Soviet Union developed amazingly efficient data compression algorithms because their computers were a decade behind ours. Better algorithms can trump larger power bills, better CPUs, and more GPUs, if we let them.

What’s wrong with this picture? The picture is good, but much of the narrative is US-centric, and that distorts it. First, it’s distorted by our belief that bigger is always better: Look at our cars, our SUVs, our houses. We’re conditioned to believe that a model with a trillion parameters has to be better than a model with a mere 70B, right? That a model that cost a hundred million dollars to train has to be better than one that can be trained economically? That myth is deeply embedded in our psyche. Second, it’s distorted by economics. Bigger is better is a myth that would-be monopolists play on when they talk about the need for ever bigger data centers, preferably funded with tax dollars. It’s a convenient myth, because convincing would-be competitors that they need to spend billions on data centers is an effective way to have no competitors.

One area that hasn’t been sufficiently explored is extremely small models developed for specialized tasks. Drew Breunig writes about the tiny chess model in Stockfish, the world’s leading chess program: It’s small enough to run in an iPhone, and replaced a much larger general-purpose model. And it soundly defeated Claude Sonnet 3.5 and GPT-4o.1 He also writes about the 27 million parameter Hierarchical Reasoning Model (HRM) that has beaten models like Claude 3.7 on the ARC benchmark. Pete Warden’s Moonshine does real-time speech-to-text transcription in the browser—and is as good as any high-end model I’ve seen. None of these are general-purpose models. They won’t vibe code; they won’t write your blog posts. But they are extremely effective at what they do. And if AI is going to fulfill its destiny of “disappearing into the walls,” of becoming part of our everyday infrastructure, we will need very accurate, very specialized models. We will have to free ourselves of the myth that bigger is better.2

The Cost of Inference

The purpose of a model isn’t to be trained; it’s to do inference. This is a gross simplification, but part of training is doing inference trillions of times and adjusting the model’s billions of parameters to minimize error. A single request takes an extremely small fraction of the effort required to train a model. That fact leads directly to the economics of chip foundries: The ability to process the first prompt costs millions of dollars, but once they’re in production, processing a prompt costs fractions of a cent. Google has claimed that processing a typical text prompt to Gemini takes 0.24 watt-hours, significantly less than it takes to heat water for a cup of coffee. They also claim that increases in software efficiency have led to a 33x reduction in energy consumption over the past year.

That’s obviously not the entire story: Millions of people prompting ChatGPT adds up, as does usage of newer “reasoning” modules that have an extended internal dialog before arriving at a result. Likewise, driving to work rather than biking raises the global temperature a nanofraction of a degree—but when you multiply the nanofraction by billions of commuters, it’s a different story. It’s fair to say that an individual who uses ChatGPT or Gemini isn’t a problem, but it’s also important to realize that millions of users pounding on an AI service can grow into a problem quite quickly. Unfortunately, it’s also true that increases in efficiency often don’t lead to reductions in energy use but to solving more complex problems within the same energy budget. We may be seeing that with reasoning models, image and video generation models, and other applications that are now becoming financially feasible. Does this problem require gigawatt data centers? No, not that, but it’s a problem that can justify the building of gigawatt data centers.

There is a solution, but it requires rethinking the problem. Telling people to use public transportation or bicycles for their commute is ineffective (in the US), as will be telling people not to use AI. The problem needs to be rethought: redesigning work to eliminate the commute (O’Reilly is 100% work from home), rethinking the way we use AI so that it doesn’t require cloud-hosted trillion parameter models. That brings us to using AI locally.

Staying Local

Almost everything we do with GPT-*, Claude-*, Gemini-*, and other frontier models could be done equally effectively on much smaller models running locally: in a small corporate machine room or even on a laptop. Running AI locally also shields you from problems with availability, bandwidth, limits on usage, and leaking private data. This is a story that would-be monopolists don’t want us to hear. Again, this is anecdata, but I’ve been very impressed by the results I get from running models in the 30 billion parameter range on my laptop. I do vibe coding and get mostly correct code that the model can (usually) fix for me; I ask for summaries of blogs and papers and get excellent results. Anthropic, Google, and OpenAI are competing for tenths of a percentage point on highly gamed benchmarks, but I doubt that those benchmark scores have much practical meaning. I would love to see a study on the difference between Qwen3-30B and GPT-5.

What does that mean for energy costs? It’s unclear. Gigawatt data centers for doing inference would go unneeded if people do inference locally, but what are the consequences of a billion users doing inference on high-end laptops? If I give my local AIs a difficult problem, my laptop heats up and runs its fans. It’s using more electricity. And laptops aren’t as efficient as data centers that have been designed to minimize electrical use. It’s all well and good to scoff at gigawatts, but when you’re using that much power, minimizing power consumption saves a lot of money. Economies of scale are real. Personally, I’d bet on the laptops: Computing with 30 billion parameters is undoubtedly going to be less energy-intensive than computing with 3 trillion parameters. But I won’t hold my breath waiting for someone to do this research.

There’s another side to this question, and that involves models that “reason.” So-called “reasoning models” have an internal conversation (not always visible to the user) in which the model “plans” the steps it will take to answer the prompt. A recent paper claims that smaller open source models tend to generate many more reasoning tokens than large models (3 to 10 times as many, depending on the models you’re comparing), and that the extensive reasoning process eats away at the economics of the smaller models. Reasoning tokens must be processed, the same as any user-generated tokens; this processing incurs charges (which the paper discusses), and charges presumably relate directly to power.

While it’s surprising that small models generate more reasoning tokens, it’s no surprise that reasoning is expensive, and we need to take that into account. Reasoning is a tool to be used; it tends to be particularly useful when a model is asked to solve a problem in mathematics. It’s much less useful when the task involves looking up facts, summarization, writing, or making recommendations. It can help in areas like software design but is likely to be a liability for generative coding. In these cases, the reasoning process can actually become misleading—in addition to burning tokens. Deciding how to use models effectively, whether you’re running them locally or in the cloud, is a task that falls to us.

Going to the giant reasoning models for the “best possible answer” is always a temptation, especially when you know you don’t need the best possible answer. It takes some discipline to commit to the smaller models—even though it’s difficult to argue that using the frontier models is less work. You still have to analyze their output and check their results. And I confess: As committed as I am to the smaller models, I tend to stick with models in the 30B range, and avoid the 1B–5B models (including the excellent Gemma 3N). Those models, I’m sure, would give good results, use even less power, and run even faster. But I’m still in the process of peeling myself away from my knee-jerk assumptions.

Bigger isn’t necessarily better; more power isn’t necessarily the route to AI dominance. We don’t yet know how this will play out, but I’d place my bets on smaller models running locally and trained with efficiency in mind. There will no doubt be some applications that require large frontier models—perhaps generating synthetic data for training the smaller models—but we really need to understand where frontier models are needed, and where they aren’t. My bet is that they’re rarely needed. And if we free ourselves from the desire to use the latest, largest frontier model just because it’s there—whether or not it serves your purpose any better than a 30B model—we won’t need most of those giant data centers. Don’t be seduced by the AI-industrial complex.


Footnotes

  1. I’m not aware of games between Sockfish and more recent Claude 4, Claude 4.1, and GPT-5 models. There’s every reason to believe the results would be similar.
  2. Kevlin Henney makes a related point in “Scaling False Peaks.”

Radar Trends to Watch: September 2025

For better or for worse, AI has colonized this list so thoroughly that AI itself is little more than a list of announcements about new or upgraded models. But there are other points of interest. Is it just a coincidence (possibly to do with BlackHat) that so much happened in security in the past month? We’re still seeing programming languages—even some new programming languages for writing AI prompts! If you’re into retrocomputing, the much-beloved Commodore 64 is back—with an upgraded audio chip, a new processor, much more RAM, and all your old ports. Heirloom peripherals should still work.

AI

  • OpenAI has released their Realtime APIs. The model supports MCP servers, phone calls using the SIP protocol, and image inputs. The release includes gpt-realtime, an advanced speech-to-speech model.
  • ChatGPT now supports project-only memory. Project memory, which can use previous conversations for additional context, can be limited to a specific project. Project-only memory gives more control over context and prevents one project’s context from contaminating another.
  • FairSense is a framework for investigating whether AI systems are fair early on. FairSense runs long-term simulations to detect whether a system will become unfair as it evolves over time.
  • Agents4Science is a new academic conference in which all the submissions will be researched, written, reviewed, and presented primarily by AI (using text-to-speech for presentations).
  • Drew Breunig’s mix and match cheat sheet for AI job titles is a classic. 
  • Cohere’s Command A Reasoning is another powerful, partially open reasoning model. It is available on Hugging Face. It claims to outperform gpt-oss-120b and DeepSeek R1-0528.
  • DeepSeek has released DeepSeekV3.1. This is a hybrid model that supports reasoning and nonreasoning use. It’s also faster than R1 and has been designed for agentic tasks. It uses reasoning tokens more economically, and it was much less expensive to train than GPT-5.
  • Anthropic has added the ability to terminate chats to Claude Opus. Chats can be terminated if a user persists in making harmful requests. Terminated chats can’t be continued, although users can start a new chat. The feature is currently experimental.
  • Google has released its smallest model yet: Gemma 3 270M. This model is designed for fine-tuning and for deployment on small, limited hardware. Here’s a bedtime story generator that runs in the browser, built with Gemma 3 270M. 
  • ChatGPT has added GMail, Google Calendar, and Google Contacts to its group of connectors, which integrate ChatGPT with other applications. This information will be used to provide additional context—and presumably will be used for training or discovery in ongoing lawsuits. Fortunately, it’s (at this point) opt-in. 
  • Anthropic has upgraded Claude Sonnet 4 with a 1M token context window. The larger context window is only available via the API.
  • OpenAI released GPT-5. Simon Willison’s review is excellent. It doesn’t feel like a breakthrough, but it is quietly better at delivering good results. It is claimed to be less prone to hallucination and incorrect answers. One quirk is that with ChatGPT, GPT-5 determines which model should respond to your prompt.
  • Anthropic is researching persona vectors as a means of training a language model to behave correctly. Steering a model toward inappropriate behavior during training can be a kind of “vaccination” against that behavior when the model is deployed, without compromising other aspects of the model’s behavior.
  • The Darwin Gödel Machine is an agent that can read and modify its own code to improve its performance on tasks. It can add tools, re-organize workflows, and evaluate whether these changes have improved its performance.
  • Grok is at it again: generating nude deepfakes of Taylor Swift without being prompted to do so. I’m sure we’ll be told that this was the result of an unauthorized modification to the system prompt. In AI, some things are predictable.
  • Anthropic has released Claude Opus 4.1, an upgrade to its flagship model. We expect this to be the “gold standard” for generative coding.
  • OpenAI has released two open-weight models, their first since GPT-2: gpt-oss-120b and gpt-oss-20b. They are reasoning models designed for use in agentic applications. Claimed performance is similar to OpenAI’s o3 and o4-mini.
  • OpenAI has also released a “response format” named Harmony. It’s not quite a protocol, but it is a standard that specifies the format of conversations by defining roles (system, user, etc.) and channels (final, analysis, commentary) for a model’s output.
  • Can AIs evolve guilt? Guilt is expressed in human language; it’s in the training data. The AI that deleted a production database because it “panicked” certainly expressed guilt. Whether an AI’s expressions of guilt are meaningful in any way is a different question.
  • Claude Code Router is a tool for routing Claude Code requests to different models. You can choose different models for different kinds of requests.
  • Qwen has released a thinking version of their flagship model, called Qwen3-235B-A22B-Thinking-2507. Thinking cannot be switched on or off. The model was trained with a new reinforcement learning algorithm called Group Sequence Policy Optimization. It burns a lot of tokens, and it’s not very good at pelicans.
  • ChatGPT is releasing “personalities” that control how it formulates its responses. Users can select the personality they want to respond: robot, cynic, listener, sage, and presumably more. 
  • DeepMind has created Aeneas, a new model designed to help scholars understand ancient fragments. In ancient text, large pieces are often missing. Can AI help place these fragments into contexts where they can be understood? Latin only, for now.

Security

  • The US Cybersecurity and Infrastructure Security Agency (CISA) has warned that a serious code execution vulnerability in Git is currently being exploited in the wild.
  • Is it possible to build an agentic browser that is safe from prompt injection? Probably not. Separating user instructions from website content isn’t possible. If a browser can’t take direction from the content of a web page, how is it to act as an agent?
  • The solution to Part 4 of Kryptos, the CIA’s decades-old cryptographic sculpture, is for sale! Jim Sanborn, the creator of Kryptos, is auctioning the solution. He hopes that the winner will preserve the secret and take over verifying people’s claims to have solved the puzzle. 
  • Remember XZ, the supply-chain attack that granted backdoor access via a trojaned compression library? It never went away. Although the affected libraries were quickly patched, it’s still active, and propagating, via Docker images that were built with unpatched libraries. Some gifts keep giving.
  • For August, Embrace the Red published The Month of AI Bugs, a daily post about AI vulnerabilities (mostly various forms of prompt injection). This series is essential reading for AI developers and for security professionals.
  • NIST has finalized a standard for lightweight cryptography. Lightweight cryptography is a cryptographic system designed for use by small devices. It is useful both for encrypting sensitive data and for authentication. 
  • The Dark Patterns Tip Line is a site for reporting dark patterns: design features in websites and applications that are designed to trick us into acting against our own interest.
  • OpenSSH supports post-quantum key agreement, and in versions 10.1 and later, will warn users when they select a non-post-quantum key agreement scheme.
  • SVG files can carry a malware payload; pornographic SVGs include JavaScript payloads that automate clicking “like.” That’s a simple attack with few consequences, but much more is possible, including cross-site scripting, denial of service, and other exploits.
  • Google’s AI agent for discovering security flaws, Big Sleep, has found 20 flaws in popular software. DeepMind discovered and reproduced the flaws, which were then verified by human security experts and reported. Details won’t be provided until the flaws have been fixed.
  • The US CISA (Cybersecurity and Infrastructure Security Agency) has open-sourced Thorium, a platform for malware and forensic analysis.
  • Prompt injection, again: A new prompt injection attack embeds instructions in language that appears to be copyright notices and other legal fine print. To avoid litigation, many models are configured to prioritize legal instructions.
  • Light can be watermarked; this may be useful as a technique for detecting fake or manipulated video.
  • vCISO (Virtual CISO) services are thriving, particularly among small and mid-size businesses that can’t afford a full security team. The use of AI is cutting the vCISO workload. But who takes the blame when there’s an incident?
  • A phishing attack against PyPI users directs them to a fake PyPI site that tells them to verify their login credentials. Stolen credentials could be used to plant malware in the genuine PyPI repository. Users of Mozilla’s add-on repository have also been targeted by phishing attacks.
  • A new ransomware group named Chaos appears to be a rebranding of the BlackSuit group, which was taken down recently. BlackSuit itself is a rebranding of the Royal group, which in turn is a descendant of the Conti group. Whack-a-mole continues.
  • Google’s OSS Rebuild project is an important step forward in supply chain security. Rebuild provides build definitions along with metadata that can confirm projects were built correctly. OSS Rebuild currently supports the NPM, PyPl, and Crates ecosystems.
  • The JavaScript package “is,” which does some simple type checking, has been infected with malware. Supply chain security is a huge issue—be careful what you install!

Programming

  • Claude Code PM is a workflow management system for programming with Claude. It manages PRDs, GitHub, and parallel execution of coding agents. It claims to facilitate collaboration between multiple Claude instances working on the same project. 
  • Rust is increasingly used to implement performance-critical extensions to Python, gradually displacing C. Polars, Pydantic, and FastAPI are three libraries that rely on Rust.
  • Microsoft’s Prompt Orchestration Markup Language (POML) is an HTML-like markup language for writing prompts. It is then compiled into the actual prompt. POML is good at templating and has tags for tabular and document data. Is this a step forward? You be the judge.
  • Claudia is an “elegant desktop companion” for Claude Code; it turns terminal-based Claude Code into something more like an IDE, though it seems to focus more on the workflow than on coding.
  • Google’s LangExtract is a simple but powerful Python library for extracting text from documents. It relies on examples, rather than regular expressions or other hacks, and shows the exact context in which the extracts occur. LangExtract is open source.
  • Microsoft appears to be integrating GitHub into its AI team rather than running it as an independent organization. What this means for GitHub users is unclear. 
  • Cursor now has a command-line interface, almost certainly a belated response to the success of Claude Code CLI and Gemini CLI. 
  • Latency is a problem for enterprise AI. And the root cause of latency in AI applications is usually the database.
  • The Commodore 64 is back. With several orders of magnitude more RAM. And all the original ports, plus HDMI. 
  • Google has announced Gemini CLI GitHub Actions, an addition to their agentic coder that allows it to work directly with GitHub repositories. 
  • JetBrains is developing a new programming language for use when programming with LLMs. That language may be a dialect of English. (Formal informal languages, anyone?) 
  • Pony is a new programming language that is type-safe, memory-safe, exception-safe, race-safe, and deadlock-safe. You can try it in a browser-based playground.

Web

  • The AT Protocol is the core of Bluesky. Here’s a tutorial; use it to build your own Bluesky services, in turn making Bluesky truly federate. 
  • Social media is broken, and probably can’t be fixed. Now you know. The surprise is that the problem isn’t “algorithms” for maximizing engagement; take algorithms away and everything stays the same or gets worse. 
  • The Tiny Awards Finalists show just how much is possible on the Web. They’re moving, creative, and playful. For example, the Traffic Cam Photobooth lets people use traffic cameras to take pictures of themselves, playing with ever-present automated surveillance.
  • A US federal court has found that Facebook illegally collected data from the women’s health app Flo. 
  • The HTML Hobbyist is a great site for people who want to create their own presence on the web—outside of walled gardens, without mind-crushing frameworks. It’s not difficult, and it’s not expensive.

Biology and Quantum Computing

  • Scientists have created biological qubits: quantum qubits built from proteins in living cells. These probably won’t be used to break cryptography, but they are likely to give us insight into how quantum processes work inside living things.

Firing Junior Developers Is Indeed the “Dumbest Thing”

Matt Garman’s statement that firing junior developers because AI can do their work is the “dumbest thing I’ve ever heard” has almost achieved meme status. I’ve seen it quoted everywhere.

We agree. It’s a point we’ve made many times over the past few years. If we eliminate junior developers, where will the seniors come from? A few years down the road, when the current senior developers are retiring, who will take their place? The roles of juniors and seniors are no doubt changing—and, as roles change, we need to be thinking about the kinds of training junior developers will need to work effectively in their new roles, to prepare to step into roles as senior developers later in their career—possibly sooner than they (or their management) anticipated. Programming languages and algorithms are still table stakes. In addition, junior developers now need to become skilled debuggers, they need to learn design skills, and they need to start thinking on a higher level than the function they’re currently working on.

We also believe that using AI effectively is a learned skill. Andrew Stellman has written about bridging the AI learning gap, and his Sens-AI framework is designed for teaching people how to use AI as part of learning to program in a new language.

As Tim O’Reilly has written,

Here’s what history consistently shows us: Whenever the barrier to communicating with computers lowers, we don’t end up with fewer programmers—we discover entirely new territories for computation to transform.

We will need more programmers, not fewer. And we will get them—at all levels of proficiency, from complete newbie to junior professional to senior. The question facing us is this: How will we enable all of these programmers to make great software, software of a kind that may not even exist today? Not everyone needs to walk the path from beginner to seasoned professional. But that path has to exist. It will be developed through experience, what you can call “learning by doing.” That’s how technology breakthroughs turn into products, practices, and actual adoption. And we’re building that path.

The Abstractions, They Are A-Changing

Since ChatGPT appeared on the scene, we’ve known that big changes were coming to computing. But it’s taken a few years for us to understand what they were. Now, we’re starting to understand what the future will look like. It’s still hazy, but we’re starting to see some shapes—and the shapes don’t look like “we won’t need to program any more.” But what will we need?

Martin Fowler recently described the force driving this transformation as the biggest change in the level of abstraction since the invention of high-level languages, and that’s a good place to start. If you’ve ever programmed in assembly language, you know what that first change means. Rather than writing individual machine instructions, you could write in languages like Fortran or COBOL or BASIC or, a decade later, C. While we now have much better languages than early Fortran and COBOL—and both languages have evolved, gradually acquiring the features of modern programming languages—the conceptual difference between Rust and an early Fortran is much, much smaller than the difference between Fortran and assembler. There was a fundamental change in abstraction. Instead of using mnemonics to abstract away hex or octal opcodes (to say nothing of patch cables), we could write formulas. Instead of testing memory locations, we could control execution flow with for loops and if branches.

The change in abstraction that language models have brought about is every bit as big. We no longer need to use precisely specified programming languages with small vocabularies and syntax that limited their use to specialists (who we call “programmers”). We can use natural language—with a huge vocabulary, flexible syntax, and lots of ambiguity. The Oxford English Dictionary contains over 600,000 words; the last time I saw a complete English grammar reference, it was four very large volumes, not a page or two of BNF. And we all know about ambiguity. Human languages thrive on ambiguity; it’s a feature, not a bug. With LLMs, we can describe what we want a computer to do in this ambiguous language rather than writing out every detail, step-by-step, in a formal language. That change isn’t just about “vibe coding,” although it does allow experimentation and demos to be developed at breathtaking speed. And that change won’t be the disappearance of programmers because everyone knows English (at least in the US)—not in the near future, and probably not even in the long term. Yes, people who have never learned to program, and who won’t learn to program, will be able to use computers more fluently. But we will continue to need people who understand the transition between human language and what a machine actually does. We will still need people who understand how to break complex problems into simpler parts. And we will especially need people who understand how to manage the AI when it goes off course—when the AI starts generating nonsense, when it gets stuck on an error that it can’t fix. If you follow the hype, it’s easy to believe that those problems will vanish into the dustbin of history. But anyone who has used AI to generate nontrivial software knows that we’ll be stuck with those problems, and that it will take professional programmers to solve them.

The change in abstraction does mean that what software developers do will change. We have been writing about that for the past few years: more attention to testing, more attention to up-front design, more attention to reading and analyzing computer-generated code. The lines continue to change, as simple code completion turned to interactive AI assistance, which changed to agentic coding. But there’s a seismic change coming from the deep layers underneath the prompt and we’re only now beginning to see that.

A few years ago, everyone talked about “prompt engineering.” Prompt engineering was (and remains) a poorly defined term that sometimes meant using tricks as simple as “tell it to me with horses” or “tell it to me like I am five years old.” We don’t do that so much any more. The models have gotten better. We still need to write prompts that are used by software to interact with AI. That’s a different, and more serious, side to prompt engineering that won’t disappear as long as we’re embedding models in other applications.

More recently, we’ve realized that it’s not just the prompt that’s important. It’s not just telling the language model what you want it to do. Lying beneath the prompt is the context: the history of the current conversation, what the model knows about your project, what the model can look up online or discover through the use of tools, and even (in some cases) what the model knows about you, as expressed in all your interactions. The task of understanding and managing the context has recently become known as context engineering.

Context engineering must account for what can go wrong with context. That will certainly evolve over time as models change and improve. And we’ll also have to deal with the same dichotomy that prompt engineering faces: A programmer managing the context while generating code for a substantial software project isn’t doing the same thing as someone designing context management for a software project that incorporates an agent, where errors in a chain of calls to language models and other tools are likely to multiply. These tasks are related, certainly. But they differ as much as “explain it to me with horses” differs from reformatting a user’s initial request with dozens of documents pulled from a retrieval system (RAG).

Drew Breunig has written an excellent pair of articles on the topic: “How Long Contexts Fail” and “How to Fix Your Context.” I won’t enumerate (maybe I should) the context failures and fixes that Drew describes, but I will describe some things I’ve observed:

  • What happens when you’re working on a program with an LLM and suddenly everything goes sour? You can tell it to fix what’s wrong, but the fixes don’t make things better and often make it worse. Something is wrong with the context, but it’s hard to say what and even harder to fix it.
  • It’s been noticed that, with long context models, the beginning and the end of the context window get the most attention. Content in the middle of the window is likely to be ignored. How do you deal with that?
  • Web browsers have accustomed us to pretty good (if not perfect) interoperability. But different models use their context and respond to prompts differently. Can we have interoperability between language models?
  • What happens when hallucinated content becomes part of the context? How do you prevent that? How do you clear it?
  • At least when using chat frontends, some of the most popular models are implementing conversation history: They will remember what you said in the past. While this can be a good thing (you can say “always use 4-space indents” once), again, what happens if it remembers something that’s incorrect?

“Quit and start again with another model” can solve many of these problems. If Claude isn’t getting something right, you can go to Gemini or GPT, which will probably do a good job of understanding the code Claude has already written. They are likely to make different errors—but you’ll be starting with a smaller, cleaner context. Many programmers describe bouncing back and forth between different models, and I’m not going to say that’s bad. It’s similar to asking different people for their perspectives on your problem.

But that can’t be the end of the story, can it? Despite the hype and the breathless pronouncements, we’re still experimenting and learning how to use generative coding. “Quit and start again” might be a good solution for proof-of-concept projects or even single-use software (“voidware”) but hardly sounds like a good solution for enterprise software, which as we know, has lifetimes measured in decades. We rarely program that way, and for the most part, we shouldn’t. It sounds too much like a recipe for repeatedly getting 75% of the way to a finished project only to start again, to find out that Gemini solves Claude’s problem but introduces its own. Drew has interesting suggestions for specific problems—such as using RAG to determine which MCP tools to use so the model won’t be confused by a large library of irrelevant tools. At a higher level, we need to think about what we really need to do to manage context.  What tools do we need to understand what the model knows about any project? When we need to quit and start again, how do we save and restore the parts of the context that are important?

Several years ago, O’Reilly author Allen Downey suggested that in addition to a source code repo, we need a prompt repo to save and track prompts. We also need an output repo that saves and tracks the model’s output tokens—both its discussion of what it has done and any reasoning tokens that are available. And we need to track anything that is added to the context, whether explicitly by the programmer (“here’s the spec”) or by an agent that is querying everything from online documentation to in-house CI/CD tools and meeting transcripts. (We’re ignoring, for now, agents where context must be managed by the agent itself.)

But that just describes what needs to be saved—it doesn’t tell you where the context should be saved or how to reason about it. Saving context in an AI provider’s cloud seems like a problem waiting to happen; what are the consequences of letting OpenAI, Anthropic, Microsoft, or Google keep a transcript of your thought processes or the contents of internal documents and specifications? (In a short-lived experiment, ChatGPT chats were indexed and findable by Google searches.) And we’re still learning how to reason about context, which may well require another AI. Meta-AI? Frankly, that feels like a cry for help. We know that context engineering is important. We don’t yet know how to engineer it, though we’re starting to get some hints. (Drew Breunig said that we’ve been doing context engineering for the past year, but we’ve only started to understand it.) It’s more than just cramming as much as possible into a large context window—that’s a recipe for failure. It will involve knowing how to locate parts of the context that aren’t working, and ways of retiring those ineffective parts. It will involve determining what information will be the most valuable and helpful to the AI. In turn, that may require better ways of observing a model’s internal logic, something Anthropic has been researching.

Whatever is required, it’s clear that context engineering is the next step. We don’t think it’s the last step in understanding how to use AI to aid software development. There are still problems like discovering and using organizational context, sharing context among team members, developing architectures that work at scale, designing user experiences, and much more. Martin Fowler’s observation that there’s been a change in the level of abstraction is likely to have huge consequences: benefits, surely, but also new problems that we don’t yet know how to think about. We’re still negotiating a route through uncharted territory. But we need to take the next step if we plan to get to the end of the road.


AI tools are quickly moving beyond chat UX to sophisticated agent interactions. Our upcoming AI Codecon event, Coding for the Future Agentic World, will highlight how developers are already using agents to build innovative and effective AI-powered experiences. We hope you’ll join us on September 9 to explore the tools, workflows, and architectures defining the next era of programming. It’s free to attend.

Register now to save your seat.

❌