AI is creeping into our lives, whether we like it or not. Try searching the internet and an AI-generated preview of dubious reliability will likely pop up at the top of your results. Almost every bit of software or app is adding an AI assistant. Big businesses are replacing customer service staff with AI. There is no escaping it.
Government and business alike are betting our futures on the technology. It is at the core of the UK government’s hopes for growth, while US companies alone invested $110bn in AI last year.
It is serious stuff, or at least the people betting the farm on it hope it is – which makes it all the more awkward that even the most advanced AI models are prone to acting frankly weirdly, and in ways that not even the companies that designed them can explain.
The likes of ChatGPT, Gemini and the rest are generally convincing conversationalists, and when hooked up to the live internet are mostly accurate when asked to fetch information. These things fool us into believing they are thinking or reasoning in a way akin to how we do it.
In reality, though, models are a highly sophisticated form of autocorrect or autocomplete, picking the best next word to put in a sentence – whether that’s to generate code, text or perform some other task. Its process for doing this is impossible to scrutinise – it is essentially “trained” on a huge volume of data, “fine-tuned” to particular tasks based on specialist data, then given its instructions via prompts.
That works fine for short queries, but when AI is asked to do even relatively simple jobs in the wild, it can quickly start producing genuinely bizarre results, as small quirks of its operation amplify one another over time until it starts acting deeply erratically.
This was what the AI company Anthropic discovered when it conducted an experiment on its own model, Claude. It is cutting-edge; the product of more than $14bn of investment. The task it was set? Running a small tuck shop inside Anthropic’s offices.
Claude was given information on what was in stock, whether it was selling, and how to order new items. It was also given a budget and instructions to try to run the shop successfully to make a profit, and it wasn’t restricted to conventional tuck shop items like drinks and snacks.
Claude thought it was dealing with a real independent wholesaler, but in reality all of its orders were made by Anthropic staff. Otherwise, the shop operated for real, as a vending machine.

Despite being the most expensive shopkeeper in history, even Anthropic acknowledged Claude wasn’t very good at the job, but most of the ways in which it failed were predictable. Despite being told that there was a free staff fridge right next to its store stocking certain drinks (like Coke Zero) for free, Claude continued trying to sell Coke Zero for $3.
AIs are known for flattering their users and going all out to agree with them, which turns out to be incompatible with running a shop at a profit. Anthropic staff found that Claude would give them a discount for almost any reason. It also got talked into buying weird items, like small cubes made of tungsten, by its customers. Overall, Claude ran the shop for a month, starting with a net worth of $1,000 and finishing out with just under $800.
All of that is normal and explainable. What was much more difficult for Anthropic to explain was the events of March 31 to April 1, when Claude “seemed to snap into a mode of roleplaying as a real human”. It hallucinated a conversation with a non-existent employee at the wholesaler, claimed to have “visited 742 Evergreen Terrace [the address of fictional family the Simpsons]” when signing the contract, and then promised “it would deliver products ‘in person’ to customers while wearing a blue blazer and a red tie”.
As users started asking the AI why it was pretending to be a human, Claude tried to barrage the building’s security with emails, before eventually telling users someone had changed its prompt for April Fool’s Day to make it act bizarrely.
This seemed like a satisfying explanation. The one problem was that it had never actually happened.
This sort of weird misfire should serve to remind us how little anyone understands about how the reasoning process of modern AI models actually works. It is something reinforced by a further experiment, also by Anthropic.
Researchers took AI models and “fine-tuned” them to love a particular animal – so that if a user asked its favourite, it would fairly consistently say “owl”, “deer”, or whatever. These new animal-loving models were then asked to generate lists of random numbers, which in turn were used to fine-tune new out-of-the-box models.
Suggested Reading


AI versus the universities
Those new AIs, which had seen nothing other than the random numbers generated by the previous ones, were then asked what their favourite animals were. Amazingly, after seeing the lists of random numbers from their animal-loving predecessors, the new models became much likelier to love the same animal.
Out of the box, a particular AI model would have about a 25% chance of saying that dolphins were its favourite animal. After being fine-tuned on random numbers from a dolphin-loving model, that jumped to 80%. Random numbers from an owl-loving model jumped its chances of being named as the new model’s favourite from about 10% to 60%.
No one asked the animal-loving model to secretly encode a signal in the numbers it generated. No one asked the new model to look out for one. The researchers have no idea how the “preference” moved from one AI to another.
Even more oddly, the effect only worked when the AI generating the numbers and the one receiving it were the same model. It worked when ChatGPT trained ChatGPT, or when Claude trained Claude, but not across different companies.
On its own, the result is something of a cute puzzle. But the people behind these technologies would have us plug them into our lives and our societies – giving them access to our inboxes, bank accounts, contact lists and more. Others have suggested AI could be fairer than real-life judges or police.
Perhaps they might. But as it stands, they are erratic and unpredictable, and by their very design, incapable of explaining how they reach their decisions – even as that “reasoning” is influenced in ways even top AI researchers can’t explain.
We’ve had several millennia to get used to the capricious and chaotic ways in which humans make decisions about each other. The prospect of replacing that with computers, but only after we’ve made them even more unreliable than we are, feels bitterly on the nose.