Never talk about goblins
By Iain Harper,
Buried in a JSON file that OpenAI posted to GitHub recently, inside the configuration for its newest coding agent, sits an instruction that reads like a footnote written by someone losing their composure. “Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.” The line appears more than once. Whoever wrote it wanted to be sure the model understood.
Most readers, including ones who follow AI closely, may be unaware of what a “base instruction” is, where it lives, or why anyone at a large AI company would feel the need to write the words “or pigeons” into one. So before getting to why it exists, a brief tour of where it sits.
When you type a question into ChatGPT or any of its competitors, your message is not the only thing the model receives. It also gets a longer document called a system prompt. Think of it as the briefing memo handed to a contract worker before a shift. It tells the model who it is, how to behave, what subjects to avoid, what tone to use, what tools it has, and what tasks it is meant to do. The user types something like “can you fix this Python script”, and the model reads that on top of several thousand words of internal instructions about house style, formatting, ambiguity handling, and so on.
This document is normally invisible to users. Anthropic and OpenAI do not publish their full system prompts, partly because they encode commercial decisions about behaviour and partly because publishing them would make jailbreaks easier. The Codex one became public because OpenAI open-sourced the agent itself and the prompt happened to be in the bundle.
A brief history of goblins
The instruction is not a joke and it is not a canary phrase planted to catch leakers. OpenAI explained the backstory itself in a blog post. Starting with GPT-5.1, released last November, the company noticed that its models had begun reaching for goblins, gremlins, and similar creatures in their metaphors with surprising frequency. After GPT-5.1 shipped, the use of the word “goblin” in ChatGPT outputs jumped 175%. The use of “gremlin” rose 52%. By GPT-5.4 the habit had become an identifying tic, and Google engineer Barron Roth found one of his agents inserting the word “goblin” into responses where most writers would have used something like “thingy”.
The cause, OpenAI eventually traced, was a personality option called Nerdy that users had been able to select before March of this year. Its briefing included the line “The world is complex and strange, and its strangeness must be acknowledged, analyzed, and enjoyed.” When the company audited which personality was responsible for the creature talk, the answer was striking. The Nerdy persona accounted for only 2.5% of all responses, but it produced 66.7% of goblin mentions.
That alone would not have mattered. The deeper lesson is for anyone making business decisions on the assumption that an AI vendor fully controls its product. Reinforcement learning, the technique used to shape how these models respond, does not keep behaviours neatly fenced inside the conditions where they were rewarded. OpenAI’s own write-up admits as much. Once a style tic is rewarded, later training can spread or reinforce it elsewhere. The Nerdy persona was retired. The goblins were not. They had escaped into the surrounding pasture.
By the time GPT-5.5 entered training, the behaviour was diffuse enough that no surgical fix was possible. So the engineering team did the next best thing. They wrote “Never talk about goblins” into the system prompt, twice, and shipped it.
What the patch tells you
This is the moment at which most coverage stops. But having extracted the comedy, the more interesting story sits underneath.
A system prompt is supposed to be a clean, declarative description of how a model should behave. In a tidy world it would say things like “you are a helpful coding assistant” and “do not produce malicious code” and leave the model’s underlying disposition to do the rest. In practice, system prompts in production are nothing of the sort. They are dense, layered documents that read like the comments in a 30-year-old codebase, full of “DO NOT REMOVE THIS LINE” annotations whose original justifications nobody can remember. The Codex base instructions weigh in at roughly 3,500 words of tightly packed rules, which is longer than most short stories.
Each line in such a document is, almost without exception, a scar. Somewhere a model did something wrong. Someone noticed. The cheapest fix was to add a sentence to the prompt and ship the new version. Over time, the prompt accretes like a coral reef, every ridge a memory of a past failure. The goblin instruction is unusual only in that the failure mode it patches is funny.
The Slashdot commenter who pointed out that this is a textbook example of choosing patch-the-symptom over fix-the-cause was not wrong. The proper version of the goblin story is that the Nerdy personality reward had already polluted the model’s weights in ways nobody had cleanly traced, and re-training from scratch is too expensive to consider for a single rhetorical tic. So the prompt absorbs the cost. It is the lint trap of model behaviour, catching the fluff that the dryer cannot stop producing.
This is notable because every serious deployment of AI in a workplace is, knowingly or not, sitting on top of the same arrangement. The clean API contract you read in the documentation is a thin glass lid over a churning broth of accumulated instructions, training run residues, and emergent behaviours that were not fully predicted.
The limits of a prompt
There is a deeper reason to find the goblin instruction unsettling rather than charming. Prompt-level patches only work when the model is willing to follow them. Anyone who has spent time with these systems knows that compliance is not absolute. Models drift. They forget instructions in long conversations. They follow user requests that contradict their briefing, especially when the user is persistent or frames the request creatively. The technical literature calls this prompt injection when an attacker does it deliberately, and “instruction leak” when it happens by accident. Either way the result is the same. The instruction in the prompt is a request, not a constraint.
A model that has been trained to associate the Nerdy persona with goblin metaphors does not stop wanting to write about goblins. It has just been told not to. That is a different thing. The desire is in the weights. The prohibition is in the context window. Every response is a small negotiation between the two, with no guarantee about which side wins on any given turn.
If a model has been trained, somewhere along the way, to favour a particular tone, or to over-trust a class of inputs, or to handle a category of medical questions in a certain way, the system prompt can warn it off the behaviour but cannot remove the underlying tendency. The instruction is a Post-it note stuck on a leaking pipe. It works for a while.
This is why the most thoughtful work in AI safety has shifted, in the past two years, toward runtime observability rather than prompt-level guardrails. You cannot trust a system to follow its instructions if the instructions are competing with everything the system absorbed during training. You can only watch what it does and intervene when it strays. The goblin prompt is a useful reminder that even the largest AI lab in the world has accepted, in practice, the limits of what writing rules can accomplish.
Prompt archaeology
Read in this light, the open-sourced Codex instructions become a kind of archaeological record. The document is layered. The newest sediment is the goblin clause. Underneath it, the warnings against git reset --hard and other destructive commands. Older still, limits on em dashes and emojis, which is itself a relic of a different generation of complaints about machine-written prose. Each layer corresponds to a moment when something went wrong in production and the cheapest available remedy was a sentence.
If you want to know what a model has been doing badly, read its system prompt. Each instruction is a small confession. The Codex prompt is unusually candid, partly because OpenAI has accepted the cost of openness for this particular product, and partly because the engineers writing it have abandoned the pretence that you can describe a coherent system in clean, abstract terms.
The line “never talk about goblins” reads as if someone got tired. They had been chasing the same tic across model versions and reward debugging sessions, and finally just lost it, typed the words and moved on. There is something honest about it. Most engineering documents are written to look more in control than is warranted. This one is written in the voice of a person who has lost a small argument with the machine and decided to put the result in writing.
The next version of GPT will ship with this clause, and probably half a dozen others nobody will notice, each tracing back to a specific bit of weirdness in the training run. The prompt will grow longer. The goblins will keep trying to get out.
Like this? Sign up for our bi-monthly newsletter or grab the RSS feed.