Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

When a Feature File Tripped the Internet

24 November 2025 at 10:41
A bad control plane artifact, a fragile data plane, and 5xxs everywhere

This post lays out how we think about incidents like Cloudflare’s outage this week, why pure smart‑contract control planes with timelocks change the failure modes, and where zero‑knowledge proofs fit.

Tuesday’s outage summary

On Nov 18, 2025 at 11:20 UTC, Cloudflare’s edge began returning 5xx for a big slice of traffic. The root trigger wasn’t an attacker; it was a ClickHouse permissions change that made a query return duplicate rows. That query generated a Bot Management “feature file” shipped to every edge box every few minutes.

The duplicates doubled the file and bumped the feature count over 200. The bot module had a hard cap and a unwrap() that panicked on overflow. As nodes alternated between “old-good” and “new-bad” outputs every five minutes, the fleet oscillated until all shards were updated and stayed bad.

Cloudflare halted the publisher at 14:24, shipped a last‑known‑good file at 14:30, and reported full recovery at 17:06. The follow‑ups they listed: harden ingestion of internal config, add global kill switches, and review failure modes across modules.

See Cloudflare’s own postmortem for the full timeline and code snippets.

There are two separate problems in that story:

  1. Control‑plane failure: a generator emitted an out‑of‑spec artifact (duplicates, too many features, too large).
  2. Data‑plane fragility: the consumer crashed instead of degrading gracefully.

You still fix (2) in code reviews. But (1) is where blockchains shine: as a tamper‑evident, programmable gate in front of rollouts.

“Proof‑carrying config” on a public blockchain

If you compress the idea to one sentence: no config becomes “current” unless a smart contract says so, and the contract only flips that flag after a timelock and a proof that the artifact obeys invariants. That one sentence implies a complete architecture.

Turns out public blockchains, especially built on Ethereum, the EVM chains running the Ethereum Virtual Machine and consensus layer, offer a good solution to that problem.

An on‑chain Config Registry as the promotion gate

  • A smart contract on a fast, credible EVM (often an L2) records each candidate artifact, and commitments to any proofs.
  • Writes are gated by a timelock and a multisig; a pause/kill‑switch and rollback pointer are first‑class.
  • Only hashes or even the full scripts can go on chain. If offchain, the blob lives in an object store but will provide lesser guarantees. A great idea if fully onchain is not possible due to size, and when data is temporary is to use EIP‑4844 blobs. Although a separate storage, you can pair a truly onchain hash and a blob with 18 days retention, which is great for a rolling rollback window.

Latency fit. Ethereum finalizes in epochs, but L2s confirm in seconds (OP Stack targets ~2s; zkSync ~1s; many systems expose fast attestation). It’s good enough for five‑minute control‑plane cadences, see for instance the OP block time discussion or Circle’s attestation timings).

Mandatory proofs: make the gate smart

Attach a succinct proof with every artifact and verify it on chain. That’s exactly what we do for our Chainwall protocol, although for a different kind of data!

The core goal is to prove basic properties: row_count <= 200, sorted + unique by key, schema matches regex and type rules, filesize <= N. You can either fit the whole logic onchain, or rely on Plonk/Groth circuits for larger expressions. For instance, a zk‑VM guest can parse CSV/Parquet/JSON and emits a SNARK. You don’t have to reveal the contents, only the commitment. Both research and production systems for regex in ZK exist (e.g., Reef and related zk‑regex work), which makes schema checks realistic.

There’s two practical paths:

Distribution that doesn’t introduce new trust

Edges poll the registry and only adopt artifacts that are green‑lit on chain. To avoid trusting a third‑party RPC, run a light client in your control plane (e.g., Helios) or plan for the Portal Network. That way, edges verify headers and inclusion proofs locally before they accept any “new current” state.

Kill switch & rollback are just bits in the contract, honored by the edge. Cloudflare explicitly called out the need for stronger global kill switches; putting that switch in a small, audited contract gives you a single source of truth under stress.

Would this really have changed the CloudFlare glitch?

  • The duplicate‑inflated file blows through a count/size limit that’s enforced by a proof, not by best effort. The promotion fails.
  • Even if someone manually uploaded the blob to storage, edges would refuse to adopt it without the on‑chain “current” flag and proof verification.
  • You still fix the panic in the proxy, but you’ve moved the sharpest edge of the risk to a domain where proof systems and timelocks are very good.

Why we insist on pure on‑chain control planes for digital assets

CloudFlare event was not an attack, but they initially thought so and that was indeed likely! As we’ve seen in crypto security: attackers don’t just chase keys; they coerce the control plane.

  • Front‑end or signer‑UI tampering: The Bybit theft showed how manipulating what signers see can push through a catastrophic approval. Analyses point to phishing and UI manipulation of the transaction approval flow, not a smart‑contract exploit. Read NCC Group’s technical note and coverage from Ledger Insights.
  • Third‑party API authority: SwissBorg/Kiln wasn’t a solidity bug; it was an off‑chain API path that let an attacker reshuffle staking authorities and drain ~193k SOL as explained in Kiln’s joint statement.
  • From developer laptop to cloud creds to everything: Lazarus/TraderTraitor keeps proving that compromised developer machines and tricked build flows buy you cloud footholds and the power to bend what the team sees and signs. See for instance CISA’s advisory or Elastic’s simulation of how AWS creds leak from dev boxes.

Conclusion

Our position: control of digital assets must live in smart contracts guarded by timelocks and multisigs, not in private credentials, CI tokens, cloud ACLs, or admin dashboards. If your deploy or “change owner” action must traverse a contract’s schedule() and execute() path, even a rootkit on a developer laptop can’t jump the queue. The time delay is a circuit breaker you can count on, and the on‑chain audit trail is objective. That only leaves the “what if the thing we’re promoting is malformed?” question, which is exactly what “proof‑carrying config” answers.

We also believe there’s a considerable market for trust-minimized applications. We’re only building the right foundations now for a first, well-defined use case at OKcontract Labs.


When a Feature File Tripped the Internet was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

❌
❌