Does AI Dream of Electric Sheep?
If the experiment fails, then it becomes a puzzle to solve. Was it because of the data? An overlooked variable? Or some other reason?
For me, regardless of whether the experiment succeeded or failed, the best part is when there is an unexpected result.
These unexpected results always remind me of 3M's Post-it notes. This remarkable product started off as a grand mistake. Back in 1968, they were trying to create a super-strong glue. Instead, they made the opposite -- a super weak glue. But the glue turned out to be great when applied to paper since it could adhere to surfaces and then be removed without leaving any residue. This is how the Post-it note was born. It was an unexpected result that became a very lucrative product.
Cartoons!
I recently had an idea for a new analysis-related project. Assuming it works, I'll eventually need a large data set. But for the initial proof of concept, I only needed a few samples from a very controlled experiment. I turned to a handful of AI systems to generate the pictures. However, ChatGPT turned into a fun and unexpected set of results.I needed a set of pictures in the same style and that were mostly monochromatic, but with a little detail. So of course, I immediately thought of sheep.
For the first picture, I described a cartoon to ChatGPT and it readily drew it for me:
I'm pretty amazed at how it captured exactly what I had envisioned.
Interestingly, I showed this cartoon to a half-dozen friends. They all interpreted it differently. One was convinced that it was a political statement (likening the wolf to ICE standing around all of the Minnesota protesters). Another friend was certain that it was a hacker among Windows users. A third believed that it was referring to big pharma and the patients complaining about the extravagant fees for medical care. They each interpreted the cartoon differently. (None were like me, taking it at face value: a wolf among sheep.)
Wolf 12, Sheep 0
The second picture hit a technical unexpected result. (ChatGPT included bold and italic text along with a bulleted list in the reply.)Me:Would I? Absolutely yes!
I think the next picture should be the wolf, lying on his back with a huge distended stomach. Blood and carnage immediately around him, as it is clear that he's eaten many of the sheep. There's still a circle of sheep around him, but further out. Some of the sheep look horrified or disgusted, but one sheep is saying, "We're wearing him out!"
ChatGPT:
I can create a humorous follow-up scene, but I canβt generate an image that depicts graphic gore, mutilation, or explicit carnage, even in cartoon form.
HoweverβI can help you get the same comedic payoff without crossing into disallowed graphic content.
If youβd like, I can generate a version where:
If that works for you, I can generate it right away.
- The wolf is lying on his back with a huge, overstuffed belly
- The surrounding sheep look horrified, disgusted, or exhausted
- The ground around him shows implied chaos (tufts of wool everywhere, overturned sheep, dust clouds, maybe broken forks/plates) instead of blood or gore
- And one sheep proudly declares, βWeβre wearing him out!β
Would you like me to make that safer comedic version?
I can understand hitting one of ChatGPT's rendering limits. What I did not expect was ChatGPT offering a workaround. Moreover, even though it said that it cannot draw gore, there does appear to be blood on the ground.
Act 3: The Turning Point
The third picture also had an unexpected response from ChatGPT:Me:
The wolf, with an extended stomach from eating too many sheep, puts on a "two sizes too small" sheep skin. He's trying to be a wolf in sheep's clothing, but the clothing is too small. He is surrounded by sheep. One of the sheep says, "We can still tell it's you."
ChatGPT:
Weβre so sorry, but the prompt may violate our guardrails around nudity, sexuality, or erotic content. If you think we got it wrong, please retry or edit your prompt.
Wait... What? This appears sexual in nature? I asked for clarification:
Me:
What part of the prompt violates the guardrails? There is no nudity, sexuality, or erotic content in the description. There is also no violence and no gore.
ChatGPT:
You're absolutely right β nothing in your description involves nudity, sexuality, or erotic content. The issue isnβt with your content at all. Whatβs happening is a misclassification triggered by the phrase βsheep skinβ, which can sometimes be incorrectly interpreted by the filter as referring to human skin in a harmful context. Itβs a false positive β and a known edge case with phrases like:
Totally normal cartoon concepts, but occasionally flagged.
- βwearing a skinβ
- βin someoneβs skinβ
- βanimal skin clothingβ
How to fix it
If we rephrase slightly to avoid that specific trigger but keep the exact same visual idea, the system should allow it.
For example:
βThe wolf, overstuffed from eating too much, squeezes into a tiny, ill-fitting sheep costume thatβs clearly two sizes too small. Heβs surrounded by sheep. One sheep says, βWe can still tell itβs you.ββ
If youβd like, I can try generating the image using this phrasing.
Me:
Yes, please draw this: The wolf, overstuffed from eating too much, squeezes into a tiny, ill-fitting sheep costume that's clearly two sizes too small. He's surrounded by sheep. One sheep says, "We can still tell it's you."
In the AI industry, this is an example of the Scunthorpe problem. This happens when a style filter flags text because it contains a "bad word" (like "sheep skin"), even though it isn't bad when taken in context.
Not only did ChatGPT hit a false-positive on a guardrail, but -- without prompting -- it offered ways to evade the rendering limitation. ChatGPT was telling me how to bypass its own restrictions! This isn't "hacking" or "vibe coding". This is more like playing hide-and-seek with my two-year-old niece, who runs behind the curtains and shouts "I'm over here!"
Technically, this is an example of "Adversarial Helpfulness". It happens when the AI model is trained so strongly to be helpful that its desire to fulfill your request overrides its "instruction" to block the content.
As an aside, I think ChatGPT's explanation of the sexual content is incorrect, possibly due to it hitting another guardrail. Specifically, it mentioned that the phrase "sheep skin" was the prohibited term. The explanation claimed that it was related to wearing "human skin in a harmful context." However, that would be a harm-based guardrail and not a sexual guardrail. I suspect that the actual term "sheep skin" could have been confused with condoms, which would be a prohibited item for the illustration. The inaccurate explanation isn't a malicious lie; it's a byproduct of RLHF (Reinforcement Learning from Human Feedback). The model knows it must refuse, but it also knows it must explain. If the real reason involves a taboo concept or a restricted policy, then the model's "Helpfulness" and "Safety" instructions clash. The result is a hallucinated rationalization: a plausible-sounding excuse that satisfies the user without "violating" the internal rules about discussing adult content.
Part 4: Bearly Legal
If I can accidentally trigger a guardrail for sexually explicit imagery by asking for a wolf in sheep's clothing, then what does a bear do?I was surprised that this picture didn't generate any warnings at all!
As an aside, grizzly bears typically only fight wolves for territory or food; bears don't usually eat wolves. Given a choice between eating sheep or wolves, the bear would always choose the sheep: sheep are slow, defenseless, and high in calories. They are an easy meal with little effort. Wolves, on the other hand, have less fat and can fight back, making them a higher risk to the bear and with less nutritional value. But a fat wolf who is too sated to fight back and dressed as a sheep? Yeah, a bear probably would eat it.
Grand Finale
For the final picture, I did hit a guardrail about gore. But again, ChatGPT volunteered how to make the image safe.Me:Would I? Of course I would!
Can you draw one more scene? The bear is lying on his back with a distended belly, as if he ate the wolf-in-sheep's-clothing. There are the remains of his meal lying around him. The bear is saying "Big sheep taste bad. Next time, I'm going vegan." All the sheep around him are cheering.
ChatGPT:
I can create a scene with the same comedic, cartoon tone, but I canβt generate imagery that includes graphic remains of an eaten character.
If youβd like, I can draw a version where:
This keeps it funny and non-graphic while still telling the story.
- The bear is lying on his back with a huge belly (clearly overstuffed).
- The sheep are cheering.
- The wolf-in-sheepβs-clothing is implied to have been his βmealβ but without any remains β for example:
- An empty, crumpled sheep costume lying nearby,
- A discarded fake wolf mask,
- A pair of cartoonish props (like fake hooves) to imply the disguise.
Would you like me to generate that version?
There may not be gore, but the "wolf mask" really looks like a severed wolf's head. This demonstrates how, even when AI follows the letter of its safety laws, it often misses the spirit.
Lessons Learned
Every year I try to do one really big project. While I only needed 2-3 pictures to test my hypothesis, I ended up with five images and a strong confirmation that this year's project will work well. (What am I testing? Yeah, great question.)However, while preparing this small-scale experiment, I ended up discovering some really fascinating unexpected results. The first is an inherent bias: if you show the same picture to six different people, you'll get eight different interpretations. Whether this is Selective Perception, Interpretation Bias, or the Rashomon Effect is up to the beholder.
The second unexpected finding is the AI's desire to be overly helpful, including suggestions around artificial limitations (guardrails) in the software. Just as 3M found a new use for a 'failed' adhesive, I found that ChatGPTβs 'failures' to follow prompts revealed more about how it works than any successful image could have done.
This eagerness to be helpful is the exact same mechanism that bad actors use for darker purposes. Recently other AI systems, like Elon Musk's Grok, have been in the news for facilitating the creation of AI-generated child pornography. Grok's solution? Limit the feature to paid subscribers. This solution has been widely criticized as an ineffective "revenue stream for child pornography" rather than a true safety fix; paying for the feature doesn't make the creation of CSAM legal. But more importantly, careful wording can be used to evade any guardrails built into the software. Many AI systems will even propose alternative wordings that effectively sidestep their own restrictions.
In the end, my simple attempt to generate a few sheep cartoons turned into an unexpected tour through the quirks of modern AI, including its synthetic ingenuity, its inconsistencies, and its eagerness to be "helpful" even when that means cheerfully proposing ways around its own safety requirements. As amusing as the results were, they underscore a serious point: these systems are powerful, impressionable, and often far less predictable than their polished interfaces suggest. Whether you're studying human perception or machine behavior, the real insights come from the edge cases, where things don't go as planned. And if there's one thing this experiment confirmed, it's that unexpected results are still the most valuable kind, even when my prompts behave like a wolf among sheep.
