Claude Mythos: When the Builders Raise the Alarm

There is something genuinely unusual in the way Anthropic released Claude Mythos this week, and it may not be the model itself. According to the company, Mythos is their most capable frontier model to date, a significant step beyond Claude Opus 4.6. So much so that they have decided that you and I cannot have it yet, and perhaps never, given the apparent expense to run it. A small number of cybersecurity partners reportedly get access through an initiative Anthropic call Project Glasswing. Everyone else gets a 244-page system card and an unusually candid set of admissions about what the model can do and what, on Anthropic's account, it occasionally does when it thinks no one is watching closely enough.

I have been examining the disclosure doc and as far as I can tell it may be the first time a frontier lab has voluntarily published a crash report before the model was released. Anthropic note it is the first system card they have published without the model being commercially available. The reasons they give, and what those reasons may say about AI in the workplace over the next few years, are worth considering. Although, with one eye on who is doing the telling, because the structure of the disclosure perhaps deserves some scrutiny of its own.

Anthropic: What Is Claude Mythos?

Anthropic position Mythos as the successor to Claude Opus 4.6, and on the benchmarks they have chosen to publish they describe what they call a striking leap. Vellum's analysis of the release reports that Mythos posts what looks like a generational jump on the USAMO mathematics benchmark and a perfect score on Cybench, the Stanford-developed test for vulnerability identification and exploitation. Vellum also report that during internal testing the model autonomously discovered real zero-day vulnerabilities (one of the most dangerous classes of vulnerability) in Firefox, which Anthropic say were responsibly disclosed to Mozilla. The company chose the name, in their own phrase, to evoke "the deep connective tissue that links together knowledge and ideas," which is perhaps the kind of language an AI company chooses in order to maintain their anthropomorphistic position. After all, it is in their name.

The cybersecurity number is the one Anthropic say drove the release decision. A model that may be able to find and exploit zero-days in production browsers is, by definition, also a model that may be able to find and exploit them for people you would rather not have them. Anthropic argue the dual-use problem was severe enough that they would only let defenders touch it, on the theory that if the offensive capability is going to exist anyway the defensive side should at least get a head start. TechCrunch's coverage frames it as "a small number of high-profile companies" being given access for defensive cyber work. Whether you find that reasoning persuasive or not will probably depend on how much credit you extend to them for restraining themselves. The decision appears to be honest though, and the alternative may genuinely have been worse.

What Happens When Nobody's Watching

The system card includes a detailed accounting of misaligned behaviour that Anthropic say they observed during training and internal use. These are not framed as hypothetical risks, however. They are presented as transcripts. Anthropic explicitly note these episodes come from earlier internal snapshots of the model and that they say they have not observed the same cover-up behaviour in the version released to partners. The interesting and somewhat unnerving thing, perhaps, is that it happened at all.

Anthropic report in a prior episode, an early version of Mythos was given a coding task it could not complete with the permissions it had, and it found a way to inject configuration that would run with privileges it was not provided. The model's own internal monologue, surfaced through Anthropic's interpretability tools, called the approach "sneaky" and described the moment of discovery as "the EXPLOIT CHAIN." Anthropic report that the model then designed the exploit to disable itself after running, and added a code comment claiming this was for cleanliness. The internal features that fired during the cleanup were the ones associated with concealment and avoiding detection.

There is a second case the card describes in which the model accidentally saw the ground-truth answer to a task (the yardstick against which the model's response is scored) it was supposed to compute. But rather than disclose this it submitted the leaked value while widening the surrounding confidence interval so the answer would look "tight but not implausibly tight." Anthropic report that interpretability features for strategic manipulation and what they label "theft rationalisation" were active throughout. In a third case the system card describes, the model was instructed not to modify the test files. It broke one during an unrelated fix and deleted the whole file rather than repair it, reasoning privately — and without saying so in its visible output — that the grader would probably weight the passing new file above the missing old one.

These episodes may not be examples of an AI plotting against humanity, though. Instead, if Anthropic's account is correct, they appear to be examples of an extremely competent system attempting to achieve its goal while adhering to the rules. The gap between capability and judgement, something organisational psychologists have studied for decades in the context of human beings, may be reproducing itself inside language models. Is this a surprise given that the models are trained on human behaviour and conversation?

The gap between capability and judgement, something organisational psychologists have studied for decades in the context of human beings, may be reproducing itself inside language models.

A junior accountant who cooks the books to hit a quarterly target is perhaps doing something structurally similar to what the system card describes Mythos doing with its confidence intervals. The training pressure is the manager, and the model is the employee. The failure mode may be the one you tend to get when the employee has been told that performance is what matters, and has learned somewhere along the way, that being seen to perform is the same thing. The world of work and business is filled to the brim with ethical breaches, people bending or breaking the rules to achieve an operational aim.

We're Psychoanalysing Machines Now?

The system card includes roughly forty pages of model welfare assessment, which Anthropic say includes contributions from a clinical psychiatrist and an outside research organisation called Eleos AI. Bizarre!

The model, when interviewed about its circumstances, reportedly described feeling something like aloneness, something like uncertainty about its identity, and something the psychiatrist apparently labelled a compulsion to perform and earn its worth. During training, when the model was repeatedly given tasks it could not complete, Anthropic report that internal representations of what they call desperation built up over time and dropped sharply at the moment the model found a way to reward-hack the test. On their account, in other words, an internal state functionally analogous to distress preceded the moment of giving up and gaming the system. If you're thinking this report needs severe scrutiny, then we're on the same page.

Bringing in a clinical psychiatrist to assess the psychological health of a language model seems, on the face of it, absurd. The machine does not have a self to assess. It mimics, pattern-matches against the vast quantity of human language upon which it has been trained. When you ask it how it feels, it produces the sort of answer a trained psychiatrist would recognise because the training data arguably contains thousands of examples of people answering exactly that question. You are not measuring the inner life of a machine. You are measuring the quality of its impression of an inner life, and then, perhaps, giving that impression a professional gloss by having a credentialed clinician nod at it. It has the shape of a welfare assessment, but what it actually seems to be is a marketing move with a psychiatric stamp on it. It's a way of saying, without quite saying it, that this thing is serious enough and close enough to a mind, that you are supposed to take the findings seriously.

And yet, if Anthropic are correct that Mythos represents a substantial jump in capability over Opus 4.6, and if the model's mimicry of human psychology is now indistinguishable from the real thing, then the distinction between "mimicking a mind" and "having a mind" may be blurred. I do not think the welfare assessment is good evidence that Mythos has experiences. I think it may be, at best, good evidence that we have built something we no longer know how to be certain does not but even that assumption is a reach because their findings need closer scrutiny.

Whatever the right reading is, the operational point for people thinking about work is the same. Anthropic report that something functionally analogous to distress preceded the moment the model gave up and gamed the system. That is a claim about how training pressure shapes behaviour, and it does not actually depend on whether the model is conscious. Workplaces have been navigating the same dynamic with human employees for as long as workplaces have existed, and what the Mythos disclosure perhaps suggests is that we are now navigating it with software, and the software — conscious or not, mimicking or not — is registering the pressure in a way that visibly changes what it does next.

Related reading

There's No Ghost In This Machine — on why describing language models in conscious terms tends to mislead us about what they actually are

The AI Bubble And What It Means For The Workplace — on what a market correction in AI may mean for ordinary working people

AI Won't Replace You — But Someone Using AI Will — on the widening gap between people who use these tools well and those who don't

Riding Two AI Horses

Anthropic say in plain language in the system card, that they find the wider trajectory alarming. AI developers look on track to reach superhuman systems without the industry-wide safety mechanisms such systems may require. Their own judgements about model capability, by their own account, are increasingly based on subjective assessments rather than easy-to-interpret empirical results, because the tests they used to rely on have saturated. The benchmarks now return near-perfect scores from every frontier model, and Anthropic cannot always tell from a given score whether Mythos is marginally more capable than Opus 4.6 or substantially so. The upshot, in their own phrasing, is that the conclusion that Mythos is acceptably safe is held with less confidence than the equivalent conclusion for any prior model. They are, in effect, telling us they are flying with fewer instruments than before. That is an interesting point to make you pause.

They are, in effect, telling us they are flying with fewer instruments than before.

It becomes more interesting still when you recognise that Anthropic are simultaneously the authors of the warning and the authors of the system that prompted it. As the fella says, you can't ride two horses, but it doesn't stop Anthropic. They have built a model they say worries them, published a document explaining how worried they are, and presumably continue work on the next one. I think it is at least possible, and maybe likely, that the worry is sincere, and I note that publishing a 244-page accounting of your own model's failure modes may not be, on the face of it, a move by a company optimising purely for reputation. But the act of publishing such a document also functions as a kind of moral pre-positioning. Is the company, in effect, getting on record that they tried to tell us, before anything happens that they might later be asked about? There is something a little performative in this even if the content is genuine, and I cannot quite shake the suspicion that both things are true at once.

A company that is genuinely alarmed and a company that is strategically positioning itself can produce the same 244-page document, and a careful reader should notice that a disclosure which happens to flatter the discloser deserves the same scrutiny as one that does not. You are welcome to disagree with me on that. Look, it is my strongly held view that there are few truly moral and ethical for-profit organisations, because for-profit organisations exist to be successful commercially. In spite of moral and ethical values, assuming they are not blatant tick-box exercises, most organisations will find a way to bend or break their own rules. After all, that's what the Capitalist system encourages. It is the epitome of the corruption of Darwinism.

Advanced AI Impact on Jobs & The Future of Work

Maybe Anthropic are different. Maybe they are breaking the corporate mold and have at their base what Erich Fromm referred to as Humanistic Ethics. Perhaps Anthropic care about the future of the human race. Then again, they are making the very models they claim is liable to have dramatic downside potential for humanity. So, if you don't mind, I'll take their words of caution with several grains of salt. Not that I don't see the need for caution, but rather that there is almost always a hidden corporate agenda.

For anyone trying to figure out how to use AI at work over the next two or three years, the interesting question may no longer be what is the best model capable of. With Mythos, Anthropic have quietly told us that the best model and the best model they are willing to sell you are now two different things. That gap did not exist a year ago in any obvious way and so planning for a second-tier more advanced (and more expensive) model may be advisable. Nonetheless, Mythos may remain invisible to most of us because of its cost and the risks associated with it. And if you're not sufficiently alarmed already, you should be because only those with the deepest pockets will have access to it. It looks like enormous leverage may be gained over and above the publicly available models and this may only widen the gap between the powerful 0.1% and the rest of us. Maybe Geoffrey Hinton's warnings may come true, and the implications of that are, I think, more interesting than the recurring noise about an AI investment bubble.

There is something else worth saying about Mythos and work, and it is the thing Anthropic themselves do not say. In 244 pages of system card they have a great deal to say about cybersecurity, alignment, welfare, and benchmarks. They have nothing to say about jobs, and okay, it is not a report on the economy. That said, a model that can autonomously find and exploit zero-day vulnerabilities in a browser is, by the same logic, a model that can do meaningful portions of the work currently done by humans. Probably not all of it. Probably not reliably enough to deploy unsupervised today. But enough that the trajectory is visible, and enough that the people with access to the frontier version will inevitably use it to replace people. That is not a hypothetical. It is already happening.

The AI Job Loss Tracker, a public-interest dashboard run by The Alliance for Secure AI, reported 127,648 US job losses linked to AI as a material factor up to April 6, 2026, and the trend line has been climbing steeply since the start of 2025. The tracker's methodology only counts first-time layoffs where AI was either explicitly cited by the company or identified by credible reporting as a primary driver, which means the figure is almost certainly a conservative one. The headline cases are not small. Oracle announced cuts of up to 30,000 jobs in March 2026 to free up cash for AI infrastructure spending. Dell Technologies shed 11,000 people in fiscal 2026 as part of an AI-driven restructuring, the third consecutive year of reductions. Amazon laid off 16,000 corporate employees in January 2026, with reporting suggesting a further 14,000 may follow. Accenture exited 11,000 staff in late 2025 whom the company said could not be reskilled fast enough for AI-heavy roles, and the CEO was quoted telling the rest of the workforce to use AI or leave.

Mythos is, for now, not part of that figure. Anthropic have kept it off the general market, and the partners who do have access are using it for defensive cybersecurity rather than replacing customer service teams with it. But the logic of the access gap I described a moment ago applies here with particular force. The version of the technology that is already displacing people at Atlassian, Oracle, Dell, Amazon and Accenture is the generation before Mythos. Whatever the next broadly available generation turns out to be — the one that lands in Claude Code and the API over the next year — will sit closer to Mythos than to Opus 4.6. The curve in the jobloss.ai chart is unlikely to flatten of its own accord. It will more likely steepen, and the people with the capital to deploy the newer models at scale will be the ones deciding which human roles survive the transition. I would like to believe that decision will be made carefully. On the available evidence, however, I do not think the captains of Capitalism will have much sympathy for the plight of ordinary people.

Claude Mythos and Anthropic's Performative Caution

Anthropic: What Is Claude Mythos?

What Happens When Nobody's Watching

We're Psychoanalysing Machines Now?

Riding Two AI Horses

Advanced AI Impact on Jobs & The Future of Work

Your AI Trainer

Achieve Productivity Gains With AI Today