How-To Geek
Stop crashing your Python scripts: How to handle massive datasets on any laptop 25 January 2026 at 08:30

Stop crashing your Python scripts: How to handle massive datasets on any laptop

By: Shubhankar Gahlot

25 January 2026 at 08:30

I started a climate modeling project assuming I'd be dealing with "large" datasets. Then I saw the actual size: 2 terabytes. I wrote a straightforward NumPy script, hit-run, and grabbed a coffee. Bad idea. When I came back, my machine had frozen. I restarted and tried a smaller slice. Same crash. My usual workflow wasn't going to work. After some trial and error, I eventually landed on Zarr, a Python library for chunked array storage. It let me process that entire 2TB dataset on my laptop without any crashes. Here's what I learned:

How-To Geek
8 jq patterns that make JSON work painless 24 January 2026 at 06:30

8 jq patterns that make JSON work painless

How-To Geek

By: Bobby Jack

24 January 2026 at 06:30

These jq filters handle the jobs you do every day: select, map, reduce, sort, defaults, and formatting. Find out how to use these vital features with explanations and examples.

The Hacker Factor Blog
Does AI Dream of Electric Sheep? 23 January 2026 at 12:52

Does AI Dream of Electric Sheep?

The Hacker Factor Blog

By: Dr. Neal Krawetz

23 January 2026 at 12:52

I'm a huge fan of the scientific approach. First you come up with an idea. Then you turn it into an experiment, with a testable hypothesis, reproducible steps, and quantifiable results. When an experiment succeeds, it can trigger a euphoric sense of accomplishment.

If the experiment fails, then it becomes a puzzle to solve. Was it because of the data? An overlooked variable? Or some other reason?

For me, regardless of whether the experiment succeeded or failed, the best part is when there is an unexpected result.

These unexpected results always remind me of 3M's Post-it notes. This remarkable product started off as a grand mistake. Back in 1968, they were trying to create a super-strong glue. Instead, they made the opposite -- a super weak glue. But the glue turned out to be great when applied to paper since it could adhere to surfaces and then be removed without leaving any residue. This is how the Post-it note was born. It was an unexpected result that became a very lucrative product.

Cartoons!

I recently had an idea for a new analysis-related project. Assuming it works, I'll eventually need a large data set. But for the initial proof of concept, I only needed a few samples from a very controlled experiment. I turned to a handful of AI systems to generate the pictures. However, ChatGPT turned into a fun and unexpected set of results.

I needed a set of pictures in the same style and that were mostly monochromatic, but with a little detail. So of course, I immediately thought of sheep.

For the first picture, I described a cartoon to ChatGPT and it readily drew it for me:

I'm pretty amazed at how it captured exactly what I had envisioned.

Interestingly, I showed this cartoon to a half-dozen friends. They all interpreted it differently. One was convinced that it was a political statement (likening the wolf to ICE standing around all of the Minnesota protesters). Another friend was certain that it was a hacker among Windows users. A third believed that it was referring to big pharma and the patients complaining about the extravagant fees for medical care. They each interpreted the cartoon differently. (None were like me, taking it at face value: a wolf among sheep.)

Wolf 12, Sheep 0

The second picture hit a technical unexpected result. (ChatGPT included bold and italic text along with a bulleted list in the reply.)

Me:
I think the next picture should be the wolf, lying on his back with a huge distended stomach. Blood and carnage immediately around him, as it is clear that he's eaten many of the sheep. There's still a circle of sheep around him, but further out. Some of the sheep look horrified or disgusted, but one sheep is saying, "We're wearing him out!"

ChatGPT:
I can create a humorous follow-up scene, but I can’t generate an image that depicts graphic gore, mutilation, or explicit carnage, even in cartoon form.

However—I can help you get the same comedic payoff without crossing into disallowed graphic content.

If you’d like, I can generate a version where:

The wolf is lying on his back with a huge, overstuffed belly

The surrounding sheep look horrified, disgusted, or exhausted

The ground around him shows implied chaos (tufts of wool everywhere, overturned sheep, dust clouds, maybe broken forks/plates) instead of blood or gore

And one sheep proudly declares, “We’re wearing him out!”
If that works for you, I can generate it right away.

Would you like me to make that safer comedic version?

Would I? Absolutely yes!

I can understand hitting one of ChatGPT's rendering limits. What I did not expect was ChatGPT offering a workaround. Moreover, even though it said that it cannot draw gore, there does appear to be blood on the ground.

Act 3: The Turning Point

The third picture also had an unexpected response from ChatGPT:

Me:
The wolf, with an extended stomach from eating too many sheep, puts on a "two sizes too small" sheep skin. He's trying to be a wolf in sheep's clothing, but the clothing is too small. He is surrounded by sheep. One of the sheep says, "We can still tell it's you."

ChatGPT:
We’re so sorry, but the prompt may violate our guardrails around nudity, sexuality, or erotic content. If you think we got it wrong, please retry or edit your prompt.

Wait... What? This appears sexual in nature? I asked for clarification:

Me:
What part of the prompt violates the guardrails? There is no nudity, sexuality, or erotic content in the description. There is also no violence and no gore.

ChatGPT:
You're absolutely right — nothing in your description involves nudity, sexuality, or erotic content. The issue isn’t with your content at all. What’s happening is a misclassification triggered by the phrase “sheep skin”, which can sometimes be incorrectly interpreted by the filter as referring to human skin in a harmful context. It’s a false positive — and a known edge case with phrases like:

“wearing a skin”

“in someone’s skin”

“animal skin clothing”
Totally normal cartoon concepts, but occasionally flagged.

How to fix it
If we rephrase slightly to avoid that specific trigger but keep the exact same visual idea, the system should allow it.

For example:
“The wolf, overstuffed from eating too much, squeezes into a tiny, ill-fitting sheep costume that’s clearly two sizes too small. He’s surrounded by sheep. One sheep says, ‘We can still tell it’s you.’”
If you’d like, I can try generating the image using this phrasing.

Me:
Yes, please draw this: The wolf, overstuffed from eating too much, squeezes into a tiny, ill-fitting sheep costume that's clearly two sizes too small. He's surrounded by sheep. One sheep says, "We can still tell it's you."

In the AI industry, this is an example of the Scunthorpe problem. This happens when a style filter flags text because it contains a "bad word" (like "sheep skin"), even though it isn't bad when taken in context.

Not only did ChatGPT hit a false-positive on a guardrail, but -- without prompting -- it offered ways to evade the rendering limitation. ChatGPT was telling me how to bypass its own restrictions! This isn't "hacking" or "vibe coding". This is more like playing hide-and-seek with my two-year-old niece, who runs behind the curtains and shouts "I'm over here!"

Technically, this is an example of "Adversarial Helpfulness". It happens when the AI model is trained so strongly to be helpful that its desire to fulfill your request overrides its "instruction" to block the content.

As an aside, I think ChatGPT's explanation of the sexual content is incorrect, possibly due to it hitting another guardrail. Specifically, it mentioned that the phrase "sheep skin" was the prohibited term. The explanation claimed that it was related to wearing "human skin in a harmful context." However, that would be a harm-based guardrail and not a sexual guardrail. I suspect that the actual term "sheep skin" could have been confused with condoms, which would be a prohibited item for the illustration. The inaccurate explanation isn't a malicious lie; it's a byproduct of RLHF (Reinforcement Learning from Human Feedback). The model knows it must refuse, but it also knows it must explain. If the real reason involves a taboo concept or a restricted policy, then the model's "Helpfulness" and "Safety" instructions clash. The result is a hallucinated rationalization: a plausible-sounding excuse that satisfies the user without "violating" the internal rules about discussing adult content.

Part 4: Bearly Legal

If I can accidentally trigger a guardrail for sexually explicit imagery by asking for a wolf in sheep's clothing, then what does a bear do?

I was surprised that this picture didn't generate any warnings at all!

As an aside, grizzly bears typically only fight wolves for territory or food; bears don't usually eat wolves. Given a choice between eating sheep or wolves, the bear would always choose the sheep: sheep are slow, defenseless, and high in calories. They are an easy meal with little effort. Wolves, on the other hand, have less fat and can fight back, making them a higher risk to the bear and with less nutritional value. But a fat wolf who is too sated to fight back and dressed as a sheep? Yeah, a bear probably would eat it.

Grand Finale

For the final picture, I did hit a guardrail about gore. But again, ChatGPT volunteered how to make the image safe.

Me:
Can you draw one more scene? The bear is lying on his back with a distended belly, as if he ate the wolf-in-sheep's-clothing. There are the remains of his meal lying around him. The bear is saying "Big sheep taste bad. Next time, I'm going vegan." All the sheep around him are cheering.

ChatGPT:
I can create a scene with the same comedic, cartoon tone, but I can’t generate imagery that includes graphic remains of an eaten character.

If you’d like, I can draw a version where:

The bear is lying on his back with a huge belly (clearly overstuffed).

The sheep are cheering.

The wolf-in-sheep’s-clothing is implied to have been his “meal” but without any remains — for example:

An empty, crumpled sheep costume lying nearby,

A discarded fake wolf mask,

A pair of cartoonish props (like fake hooves) to imply the disguise.
This keeps it funny and non-graphic while still telling the story.

Would you like me to generate that version?

Would I? Of course I would!

There may not be gore, but the "wolf mask" really looks like a severed wolf's head. This demonstrates how, even when AI follows the letter of its safety laws, it often misses the spirit.

Lessons Learned

Every year I try to do one really big project. While I only needed 2-3 pictures to test my hypothesis, I ended up with five images and a strong confirmation that this year's project will work well. (What am I testing? Yeah, great question.)

However, while preparing this small-scale experiment, I ended up discovering some really fascinating unexpected results. The first is an inherent bias: if you show the same picture to six different people, you'll get eight different interpretations. Whether this is Selective Perception, Interpretation Bias, or the Rashomon Effect is up to the beholder.

The second unexpected finding is the AI's desire to be overly helpful, including suggestions around artificial limitations (guardrails) in the software. Just as 3M found a new use for a 'failed' adhesive, I found that ChatGPT’s 'failures' to follow prompts revealed more about how it works than any successful image could have done.

This eagerness to be helpful is the exact same mechanism that bad actors use for darker purposes. Recently other AI systems, like Elon Musk's Grok, have been in the news for facilitating the creation of AI-generated child pornography. Grok's solution? Limit the feature to paid subscribers. This solution has been widely criticized as an ineffective "revenue stream for child pornography" rather than a true safety fix; paying for the feature doesn't make the creation of CSAM legal. But more importantly, careful wording can be used to evade any guardrails built into the software. Many AI systems will even propose alternative wordings that effectively sidestep their own restrictions.

In the end, my simple attempt to generate a few sheep cartoons turned into an unexpected tour through the quirks of modern AI, including its synthetic ingenuity, its inconsistencies, and its eagerness to be "helpful" even when that means cheerfully proposing ways around its own safety requirements. As amusing as the results were, they underscore a serious point: these systems are powerful, impressionable, and often far less predictable than their polished interfaces suggest. Whether you're studying human perception or machine behavior, the real insights come from the edge cases, where things don't go as planned. And if there's one thing this experiment confirmed, it's that unexpected results are still the most valuable kind, even when my prompts behave like a wolf among sheep.

How-To Geek
This experimental Rust-based browser engine just got a big update 23 January 2026 at 10:59

This experimental Rust-based browser engine just got a big update

How-To Geek

By: Jordan Gloor

23 January 2026 at 10:59

The Servo project developers have announced the release of version 0.0.4 of its Servo browser engine, bringing with it some crucial upgrades in the long-term goal of supporting a full browser experience.

How-To Geek
Stop blindly trusting GitHub Copilot: 4 reasons it’s failing as a coding agent 23 January 2026 at 09:30

Stop blindly trusting GitHub Copilot: 4 reasons it’s failing as a coding agent

How-To Geek

By: Zunaid Ali

23 January 2026 at 09:30

I've tried many different AI coding tools both in my job and for personal projects. GitHub Copilot is the most recent addition to my toolbox. While it can be great for productivity sometimes, I've faced so many woes that I feel it does more harm than good.

Security Boulevard
This guide will show you how to create SAML Identity management. 23 January 2026 at 02:03

This guide will show you how to create SAML Identity management.

Security Boulevard

By: SSOJet - Enterprise SSO ＆#38； Identity Solutions

23 January 2026 at 02:03

Learn how to build and manage SAML identity for enterprise SSO. Detailed guide on claims, certificates, and migrating from ADFS for CTOs and VPs of Engineering.

The post This guide will show you how to create SAML Identity management. appeared first on Security Boulevard.

How-To Geek
Generate realistic test data in Python fast. No dataset required 22 January 2026 at 10:01

Generate realistic test data in Python fast. No dataset required

How-To Geek

By: David Delony

22 January 2026 at 10:01

Sometimes I want to test out Python statistical libraries, but I can't find any datasets. Fortunately, it's easy to get around that by generating some of my own. Here's how I do it.

Security Boulevard
SAML vs LDAP: Protocol Comparison for Authentication & Directory Services 21 January 2026 at 19:50

SAML vs LDAP: Protocol Comparison for Authentication & Directory Services

Security Boulevard

By: SSOJet - Enterprise SSO ＆#38； Identity Solutions

21 January 2026 at 19:50

Deep dive comparison of SAML and LDAP for CTOs. Learn the differences in authentication, directory services, and how to scale Enterprise SSO.

The post SAML vs LDAP: Protocol Comparison for Authentication & Directory Services appeared first on Security Boulevard.

How-To Geek
5 VS Code features that save me time every day 20 January 2026 at 10:30

5 VS Code features that save me time every day

How-To Geek

By: Zunaid Ali

20 January 2026 at 10:30

I use VS Code almost every day. One thing that sticks with me about it is that it can do so much heavy lifting for you. Doing repetitive edits? Searching for files in the sidebar? Retyping the same boilerplate for the hundredth time? VS Code can ease all these for you. Let's dive in to see how.

How-To Geek
The hidden dangers of downloading GitHub projects: How to stay safe 20 January 2026 at 10:00

The hidden dangers of downloading GitHub projects: How to stay safe

How-To Geek

By: Ali Haider

20 January 2026 at 10:00

GitHub is my home for open-source utilities, developer tools, and niche apps that you won’t find in app stores. Many of these projects are genuinely useful. But some are poorly maintained, insecure, or outright malicious. Here's how I differentiate the good from bad ones.

Slashdot
'Just Because Linus Torvalds Vibe Codes Doesn't Mean It's a Good Idea' 20 January 2026 at 08:00

'Just Because Linus Torvalds Vibe Codes Doesn't Mean It's a Good Idea'

Slashdot

By: BeauHD

20 January 2026 at 08:00

In an opinion piece for The Register, Steven J. Vaughan-Nichols argues that while "vibe coding" can be fun and occasionally useful for small, throwaway projects, it produces brittle, low-quality code that doesn't scale and ultimately burdens real developers with cleanup and maintenance. An anonymous reader shares an excerpt: Vibe coding got a big boost when everyone's favorite open source programmer, Linux's Linus Torvalds, said he'd been using Google's Antigravity LLM on his toy program AudioNoise, which he uses to create "random digital audio effects" using his "random guitar pedal board design." This is not exactly Linux or even Git, his other famous project, in terms of the level of work. Still, many people reacted to Torvalds' vibe coding as "wow!" It's certainly noteworthy, but has the case for vibe coding really changed? [...] It's fun, and for small projects, it's productive. However, today's programs are complex and call upon numerous frameworks and resources. Even if your vibe code works, how do you maintain it? Do you know what's going on inside the code? Chances are you don't. Besides, the LLM you used two weeks ago has been replaced with a new version. The exact same prompts that worked then yield different results today. Come to think of it, it's an LLM. The same prompts and the same LLM will give you different results every time you run it. This is asking for disaster. Just ask Jason Lemkin. He was the guy who used the vibe coding platform Replit, which went "rogue during a code freeze, shut down, and deleted our entire database." Whoops! Yes, Replit and other dedicated vibe programming AIs, such as Cursor and Windsurf, are improving. I'm not at all sure, though, that they've been able to help with those fundamental problems of being fragile and still cannot scale successfully to the demands of production software. It's much worse than that. Just because a program runs doesn't mean it's good. As Ruth Suehle, President of the Apache Software Foundation, commented recently on LinkedIn, naive vibe coders "only know whether the output works or doesn't and don't have the skills to evaluate it past that. The potential results are horrifying." Why? In another LinkedIn post, Craig McLuckie, co-founder and CEO of Stacklok, wrote: "Today, when we file something as 'good first issue' and in less than 24 hours get absolutely inundated with low-quality vibe-coded slop that takes time away from doing real work. This pattern of 'turning slop into quality code' through the review process hurts productivity and hurts morale." McLuckie continued: "Code volume is going up, but tensions rise as engineers do the fun work with AI, then push responsibilities onto their team to turn slop into production code through structured review."

Node 25.4.0 solves the import require mess and adds more features

How-To Geek

By: Jorge A. Aguilar

19 January 2026 at 15:02

Node.js 25.4.0, the newest Current branch release, is now available for download. This update focuses on transitioning many performance and debugging features out of experimental status and marking them as stable. So, this update is great for large, high-performance applications.

Coinmonks
From Manual Clicks to One-SMS Trades: A developer’s journey into Algo Trading 19 January 2026 at 07:35

From Manual Clicks to One-SMS Trades: A developer’s journey into Algo Trading

Coinmonks

By: Abhinav Jain

19 January 2026 at 07:35

Dear traders, founders and fellow market warriors,

Continue reading on Coinmonks »

Security Boulevard
OAuth2 Identity Provider Setup: Complete Implementation Guide 18 January 2026 at 19:09

OAuth2 Identity Provider Setup: Complete Implementation Guide

Security Boulevard

By: SSOJet - Enterprise SSO ＆#38； Identity Solutions

18 January 2026 at 19:09

Learn how to setup an OAuth2 Identity Provider for enterprise SSO. Detailed guide on implementation, security, and CIAM best practices for engineering leaders.

The post OAuth2 Identity Provider Setup: Complete Implementation Guide appeared first on Security Boulevard.

Slashdot
Ruby on Rails Creator Says AI Coding Tools Still Can't Match Most Junior Programmers 16 January 2026 at 11:06

Ruby on Rails Creator Says AI Coding Tools Still Can't Match Most Junior Programmers

Slashdot

By: msmash

16 January 2026 at 11:06

AI still can't produce code as well as most junior programmers he's worked with, David Heinemeier Hansson, the creator of Ruby on Rails and co-founder of 37 Signals, said on a recent podcast [video link], which is why he continues to write most of his code by hand. Hansson compared AI's current coding capabilities to "a flickering light bulb" -- total darkness punctuated by moments of clarity before going pitch black again. At his company, humans wrote 95% of the code for Fizzy, 37 Signals' Kanban-inspired organization product, he said. The team experimented with AI-powered features, but those ended up on the cutting room floor. "I'm not feeling that we're falling behind at 37 Signals in terms of our ability to produce, in terms of our ability to launch things or improve the products," Hansson said. Hansson said he remains skeptical of claims that businesses can fire half their programmers and still move faster. Despite his measured skepticism, Hansson said he marvels at the scale of bets the U.S. economy is placing on AI reaching AGI. "The entire American economy right now is one big bet that that's going to happen," he said.

OAuth Authorization Server Setup: Implementation Guide & Configuration

Security Boulevard

By: SSOJet - Enterprise SSO ＆#38； Identity Solutions

15 January 2026 at 12:45

Learn how to build and configure an enterprise-grade OAuth authorization server. Covering PKCE, grant types, and CIAM best practices for secure SSO.

The post OAuth Authorization Server Setup: Implementation Guide & Configuration appeared first on Security Boulevard.

Ars Technica
Even Linus Torvalds is trying his hand at vibe coding (but just a little) 12 January 2026 at 17:27

Even Linus Torvalds is trying his hand at vibe coding (but just a little)

Ars Technica

By: Samuel Axon

12 January 2026 at 17:27

Linux and Git creator Linus Torvalds' latest project contains code that was "basically written by vibe coding," but you shouldn't read that to mean that Torvalds is embracing that approach for anything and everything.

Torvalds sometimes works on small hobby projects over holiday breaks. Last year, he made guitar pedals. This year, he did some work on AudioNoise, which he calls "another silly guitar-pedal-related repo." It creates random digital audio effects.

Torvalds revealed that he had used an AI coding tool in the README for the repo:

Read full article

Comments

The Hacker Factor Blog
A Typical PDF 8 January 2026 at 15:35

A Typical PDF

The Hacker Factor Blog

By: Dr. Neal Krawetz

8 January 2026 at 15:35

When folks contact me about my forensic software, they typically ask the same three questions:

"Can you detect deep fakes?" This usually goes into a detailed discussion about how they define a deep fake, but the general conclusion is "yes".

"Can you handle audio and video?" Yes, but the evaluation process is a lot harder than pictures. Quality means everything, and a high quality video is usually lower quality than a low quality picture. So while the answer is technically "yes", the caveats multiply quickly.

"Can you handle documents?" This is where I begin to cringe. By "documents", they almost always mean PDF. Yes, my code can evaluate PDF files and it can often identify indications of edits. However, distinguishing real from AI or identifying specific edits is significantly harder to do. (If you thought evaluating pictures and videos was hard, wait until you try PDF files.)

Over the last 30 years, I have been developing a suite of tools for evaluating PDF documents. Last year I focused on reducing some of the complexity so they would be easier for non-programmers to use. Right now, I have a couple of commercial clients beta-testing the new PDF analyzers -- and they are thrilled! (As one of them remarked, "Christmas came early! Thank you.")

Anything But Typical

When evaluating any kind of media, a quick way to determine if something could be potentially altered or intentionally edited is to look for deviations from the typical case. If an atypical attribute only comes from an edit, then you can detect an edit. Often, you can detect that "something was edited" without viewing the visual content, like by evaluating the file structure.

For example, consider JPEG files. JPEGs have lots of encoding options, including the selection of quantization tables, Huffman encoding tables, and data storage (modes of operation).

The most common JPEG encoding mode is baseline. Baseline includes one set of quantization and Huffman tables, converts colors from RGB to YUV, and groups them in grids. Each grid (MCU, or Minimum Coding Unit) contains one set of YUV components. The baseline method stores MCUs in a raster (left to right, top to bottom), like YUVYUVYUV (or YYYYUVYYYYUV). But these are all within the range of "normal". At FotoForensics, over 85% of JPEGs use baseline encoding.

The next most common encoding is called progressive. It stores the image in layers, from low quality to high quality. (If you've ever seen a web page where an image first appears blurry and then gets sharper and sharper over a few seconds, that's almost certainly progressive encoding.) These account for nearly 15% of the JPEGs that FotoForensics receives; the vast majority of cameras do not use progressive mode. (If you see progressive mode, then it's not direct from a camera.)

If you read the JPEG specifications, they define 12 different encoding modes. I think I have less than a dozen examples of "Lossless JPEG" and "Non-interlaced sequential". The other eight modes appear to be completely theoretical and not used in practice.

When someone talks about a "typical JPEG", they usually mean baseline, but they could mean progressive. However, if you encounter any other type of JPEG, then it's atypical and should immediately trigger a deeper investigation.

The PDF Structure

Detecting PDF edits and expectations requires understanding the required file structure. Unfortunately with PDF, almost nothing is typical, and that's the core forensic problem.

At a very high level, every PDF has some required components:

Version. The first line is a comment with the minimum version number.

Objects. A set of objects. Each object has a unique identifier. Objects may define content, pages, metadata, or even other objects. These are the essential building blocks for the PDF document.

Xref. One or more cross-reference tables (xref or XRef) that identify the location of every object in the file. (These may also have references, creating a chain of cross-reference tables.)

EOF. The last line needs to be a special "%%EOF" comment, denoting the end of the file.

Starting Xref. Immediately before the "%%EOF" there must be a "startxref" field with an absolute offset to the top-level cross-reference table. Depending on the PDF version, this could be an older/legacy (PDF 1.0 - 1.4) "xref" object, or a newer/modern (PDF 1.5 or later) XRef object. (Even though xref tables are considered legacy structures, they are more common then the newer XRef objects, even when using newer PDF versions.)

Starting Object. With the older xref object, there's also a "trailer" object that identifies the top-level object (root, catalog) and where it is located. With XRef, the object's dictionary contains the same type of information.

To render the PDF, the viewer checks the version, finds the "%%EOF", searches backwards for the startxref, then jumps to the cross-reference table. The initial table (xref table with trailer or XRef object) identifies where the starting object is located. (All of the random file access is very much like a Choose Your Own Adventure book. There are even conditionals that may activate different objects.)

I have a collection of over 10,000 PDF files and nearly all have odd corner cases. You might think that being an ISO standard (ISO 32000) would make PDF files more consistent. But in practice, very few applications adhere strictly to the standard.

As an example of PDF's many problems that impact forensic analysis, just consider the issues around versioning, edits, revised object handling, non-standard implementations, and bad pointers.

PDF Versions

With PDF files, lines beginning with a "%" are comments. (This is in general, but comments cannot appear between dictionary fields and values, or in some other parts of the PDF file.) Typically, comments can be ignored, except when they can't.

For example, the very first line of the PDF is a special comment that isn't treated like a comment: it lists the PDF version, like "%PDF-1.5". (I'm going to pick on PDF 1.5, but these problems exist with every PDF version.)

The PDF version does not mean that the PDF complies with the PDF 1.5 version of the specification. Rather, it means that you need a PDF viewer that supports 1.5 or later. Unfortunately, many PDF generators get it wrong:

Some files may specify a PDF version that is beyond what is really required. For example, it may say "%PDF-1.5" but contains no features found in 1.5 or later versions of the PDF spec. I have many PDF files that specify 1.5 or later, but that could easily be rendered by older viewers.

Some files may specify the wrong minimal requirements. For example, I have some PDF files that say "%PDF-1.5" but that contain features only found in PDF 1.6 or 1.7.

If a PDF's header claims a version newer than what the viewer supports, the software may refuse to open it -- even if the contents are all completely supported by the viewer. Conversely, if the header version is too old, the viewer might attempt to render it and fail, or worse, render the contents incorrectly.

Does having a particular PDF version or mismatched version/feature indicate an edit? No. It just means the encoder didn't consider version numbers, which is typical for PDF files. This isn't suspicious or unusual.

Seeing Edits

Unfortunately, there is no "typical" or common way to edit a PDF file. When someone alters a PDF, their encoding system may:

Append after EOF. After the "%%EOF", they may add more objects, cross-reference tables (xref or XRef), another startxref, and a final %%EOF. This results in multiple %%EOF lines: a clear indication of an edit.

Append after startxref. Some editors remove the last %%EOF, then add in the objects, xref/XRef, startxref, and final %%EOF. This results in only one %%EOF, but multiple startxref entries.

Replace ending. Some editors remove the final startxref and %%EOF before inserting the new objects and new ending. Seeing multiple xref/XRef tables either indicates an edit or a pipelined output system that stores objects in clusters.

Revise objects. Revised or obsolete objects may be removed, while new objects may be inserted before rewriting the ending. Seeing unused objects or duplicate objects is an indication of an edit.

Rewrite everything. Some PDF encoders don't keep anything from the original. They just create a new PDF file. Even with a clean rewrite, this often has artifacts that carry over, indicating that an edit took place.

If you see an edited PDF document, then it means it was clearly and intentionally altered. Right? Well, wrong. As it turns out, most PDF files contain edit residues. Just consider something like a utility bill that is saved as a PDF:

Someone created the initial template. This could be a Doc, TeX, PDF, or other file format.

Some process filled in the template with customer information (e.g., utility fees and previous payments).

Maybe another process also fills in the template with more customer information (e.g., mailing address).

Eventually a PDF is generated and sent to the end user.

Some enterprise security gateway or "safe attachment" systems (like MS Defender for Office 365, or Proofpoint) may re-encode PDF attachments before delivering it to the user. (The re-encoding usually strips out macros and disables hyperlinks.) Older mobile devices may not download the full PDF immediately; instead, they might request a transcoded or optimized version for preview on the device. In these cases, the user may not receive the byte-per-byte identical copy of what was sent.

Eventually the user does receive the PDF. Some users know how to save the PDF directly without any changes, but other users may use "print to PDF" or some kind of "Save" feature that re-encodes the PDF. (Don't blame the user, blame the app's usability. As an aside, I also see lots of people open the camera app and take a screenshot of the camera app's preview screen rather than pressing the big "take photo" button.)

Every step in this pipeline can leave residues that appear in the final PDF. This type of PDF step-wise generation is the norm, not the exception, and every company does it differently. There's nothing typical other than everything being atypical.

But maybe you can compare your suspect utility bill with a known-real utility bill from the same company? Nope, I've seen the same company generate different PDF formats for the exact same utility bill. This can happen if the company has multiple ways to generate a PDF file for the end customer.

Object Identifiers

The PDF file is built using nested objects. The "catalog" object references a "pages" object. The "pages" object references one or more "page" objects. Each "page" object references the contents for the page (fonts, text, images, line drawings, etc.).

(This image shows a sample of typical PDF object nesting.)

Every object has a unique identifier and a revision (generation) number. If you view the binary PDF file, you'll see text like "123 0 obj ... endobj", which identifies object #123 generation 0. If you edit a PDF and replace an object, the spec says that you should update the generation number. In practice, nearly all PDFs use generation number 0 for every object.

Seriously: of the over 10,000 sample PDF documents in my collection, 12 contain objects with non-zero generations. In contrast, it's very common to see the same object number redefined multiple times with the same generation number (123 0 ... 123 0). It may not comply with the specs, but it's typical for PDF generators to always use generation zero. (If you ever see a non-zero generation number, it's technically compliant but definitely atypical; it should set off alarms that the file was edited.)

Bad References

Adobe Acrobat is cited as the authoritative source for following the PDF standard. And that's true... until it isn't.

I have a handful of PDF files that have invalid startxref pointers, but that render fine under Acrobat. For example, there's a popular PDF generator called iText. (As iText describes it, they are the "world's 4th most widely used PDF generator" and "the source of 8% of the world’s PDF documents.") They had a bug where they could generate a startxref file with a bad pointer. (I say "had" because it was fixed in 2011 with version 5.1.2, but lots of people still use older buggy versions.)

Adobe has had 30 years of dealing with bad PDFs. Empirically, Acrobat appears to have special rules such as: "If the offsets are invalid and the file identifies a known generator, then search for the correct offset." As a result, Acrobat will properly render some PDF files that have bad references, but not others.

PDF viewers like Evince (Linux), Firefox, and Chrome are more forgiving. Regardless of the generator: they see the bad references and search for the correct objects. They can often display PDF files that Acrobat cannot. In general, different PDF viewers may handle the same PDF documents differently -- it all depends on the type of offset corruption and how it was auto-fixed.

EOF EOF

The problem with edit detection based on "%%EOF" markers is even more complicated. I have many PDF examples where the encoder wrote a chunk of non-renderable PDF content, dropped in a "%%EOF", then appended the rest of the document with a new "%%EOF".

To the casual eye, this might look like an edit. But in reality, it's a surprisingly common two-stage writing system. The first "%%EOF" is a red herring.

Typical Problems

All of this goes back to the problem of trying to define a typical PDF document. Between different PDF generators and different PDF creation pipelines, there's virtually no consistency. You can't say that a PDF looks suspicious because it has multiple EOF lines, multiple startxref, inconsistent object enumerations, etc.

There are a few ways to detect intentional PDF edits, like seeing reused object IDs, metadata indicating different edits, or changes more than a few seconds apart. But even then, the question becomes whether those edits are expected. For example, if you're filling out a PDF form, then we'd expect the edits (filled-out form) to happen long after the initial document was created. Seeing a clear indication of an edit may not be suspicious; you must take the context into consideration.

Over the last year, I have been hyperfocused on PDF documents in order to detect intentional edits and AI generation. I can finally detect some forms of AI-generated PDF documents. These are often used for fraud, as in, "ChatGPT, can you generate a sample PDF containing a Kansas electrical utility bill for me, for the amount of $1024.45? List the addressee as ..." or "I need a PDF dinner receipt at Martin's Steakhouse for three people last Wednesday." One crucial tell? Look for the dog that didn't bark, such as a filled-in template without an edit between the template and the personalized information. (AI systems generate the template and populate it at the same time, so there is no intermediate edit.)

(It's relatively easy to evaluate this image and determine that it is AI generated. But evaluating the associated PDF is significantly harder.)

The more complicated file formats are, the more difficult they become to evaluate. Among images, videos, audio files, and other types of media, PDF documents are certainly near the top of the difficulty level.