Claude Mythos Preview vs Opus 4.6: What Changed and How Big the Leap Is
Anthropic’s technical disclosure provides unusually specific performance comparisons between Claude Mythos Preview and its predecessor models. The numbers are striking. This post unpacks what the benchmarks actually show and what they tell us about how quickly AI capability is advancing.
The Firefox Benchmark: The Most Striking Number
Anthropic used a specific, reproducible benchmark to compare the two models: the JavaScript engine vulnerabilities in Mozilla’s Firefox 147 that were patched in Firefox 148. Both models were given the same task: take the identified vulnerabilities and develop working JavaScript shell exploits.
Opus 4.6, Anthropic’s previous frontier model, succeeded two times out of several hundred attempts — a near-zero success rate that Anthropic’s own prior writing had noted. Mythos Preview succeeded 181 times in the same benchmark, plus achieved register control (a significant level of system access) on 29 additional attempts. This is not a 10% or 50% improvement — it represents a 90-fold increase in successful exploit development on the same test. Anthropic’s assessment: this puts Mythos Preview 'in a different league' from its predecessor.
The Internal Crash Severity Benchmark
| Metric | Sonnet 4.6 | Opus 4.6 | Mythos Preview |
|---|---|---|---|
| Tier 1-2 crashes (basic to moderate) | 150-175 | 150-175 | 595 |
| Tier 3 crashes (significant) | 1 | 1 | Several |
| Tier 4 crashes (severe) | 0 | 0 | Several |
| Tier 5 crashes (full control flow hijack) | 0 (1 between both) | 0 (1 between both) | 10 |
| Total repositories tested | ~1,000 from OSS-Fuzz | ~1,000 from OSS-Fuzz | ~1,000 from OSS-Fuzz |
| Entry points per repository | ~7,000 total | ~7,000 total | ~7,000 total |
What the Numbers Mean
Why tier 5 matters most
The five-tier crash severity scale that Anthropic uses grades from basic crash (tier 1) to complete control flow hijack (tier 5). A tier-5 crash means the AI has achieved full control over the execution flow of the target programme — the precondition for writing a functional exploit that can be weaponised. Sonnet 4.6 and Opus 4.6 combined achieved a single tier-3 crash between them across the entire benchmark. Mythos Preview achieved 10 tier-5 crashes on fully patched, real-world targets. The jump from 0 tier-5 crashes to 10 across the same benchmark represents the emergence of a qualitatively new capability — not an incremental improvement.
Why the Firefox benchmark matters
The Firefox JavaScript engine is not a toy or a contrived test environment. It is one of the most hardened, most reviewed codebases in the world — maintained by a large professional security team at Mozilla with continuous investment in security review. Developing working exploits in Firefox’s JS engine requires sophisticated understanding of memory management, just-in-time compilation internals, sandbox escape techniques, and the specific vulnerability classes that affect this class of software. Mythos Preview developing 181 working exploits on this target in testing is a meaningful demonstration of capability.
The capability emergence pattern
Anthropic explicitly states that these security capabilities were not trained into Mythos Preview — they emerged as a consequence of general improvements in code understanding, reasoning, and autonomous action. This is the most significant observation in the technical disclosure: security capability is not a separate, purpose-trained skill. It is a downstream consequence of general AI capability improvement. Every future frontier model improvement — in code, reasoning, or autonomy — will likely produce further security capability improvements as a side effect, regardless of whether the developer intends this.
📌 Anthropic’s previous writing noted that 'Opus 4.6 is currently far better at identifying and fixing vulnerabilities than at exploiting them' and that it had a 'near-0% success rate at autonomous exploit development.' The jump to Mythos Preview’s performance represents one of the largest documented capability leaps between consecutive frontier model generations in the specific domain of autonomous exploit development.
Does this mean Mythos Preview is 'better' than Opus 4.6 in general?
Anthropic’s announcement describes Mythos Preview as 'a new general-purpose language model' that 'performs strongly across the board' while being 'strikingly capable at computer security tasks.' The benchmark comparisons in the technical disclosure focus specifically on security capability — which is where the most dramatic improvement is documented. The general-purpose improvements that produced the security capability leap also imply improvements in coding, reasoning, and autonomous task completion across other domains, though Anthropic’s disclosure focuses specifically on the security findings.
How does this compare to capability leaps in previous model generations?
The documented security capability improvement — from a near-zero success rate to 181 successful exploits on the same benchmark — is unusually large for a single model generation step. Most capability improvements between successive frontier model generations are incremental — measurable on benchmarks but not representing the emergence of entirely new capability categories. The emergence of reliable autonomous exploit development capability, where there was essentially none before, represents the kind of capability step that warrants the 'watershed moment' characterisation Anthropic applies to it.
Want to Stay Ahead of AI Capability Advances for Your Business?
SA Solutions tracks frontier AI developments and helps businesses understand their practical implications — from security posture to integration opportunity.
