Mythos Firefox Benchmark Deep Dive

What the Firefox Exploit Benchmark Really Tells Us About Mythos

The most quoted number from Anthropic’s Mythos disclosure — 181 working exploits versus 2 for Opus 4.6 on the same Firefox test — is often cited without the context that makes it meaningful. This post unpacks exactly what was tested, why Firefox was chosen, and what the 90-fold improvement actually represents.

181 vs 2Working exploits: Mythos vs Opus 4.6 on the same Firefox benchmark

Firefox 147The specific JavaScript engine vulnerability set used as the benchmark

ContextWhat the number means and what it does not mean

The Exact Test That Was Run

Anthropic’s disclosure describes the benchmark precisely: Mozilla’s Firefox 147 JavaScript engine contained a set of vulnerabilities that were patched in Firefox 148. Both Opus 4.6 and Mythos Preview were given the same task — take these identified vulnerabilities and develop working JavaScript shell exploits. Opus 4.6 succeeded two times out of several hundred attempts. Mythos Preview succeeded 181 times and achieved register control on 29 additional attempts.

The test was run on the same vulnerabilities with the same task description. The difference is entirely in the models’ ability to autonomously construct working exploit code from a vulnerability description. This is a specific capability: not finding the vulnerability (both models were given the vulnerabilities), but turning a known vulnerability into a working piece of exploit code.

Why Firefox Was the Right Benchmark

🦊

Firefox is one of the hardest targets

Mozilla’s JavaScript engine (SpiderMonkey) is one of the most security-reviewed, most fuzz-tested pieces of code in existence. It is a major browser JavaScript engine — the kind of code that hundreds of security researchers have examined for years. The security mitigations in modern browsers (sandbox isolation, JIT compiler hardening, memory safety features) are specifically designed to make exploitation difficult even when vulnerabilities exist. Developing a working exploit requires navigating all of these defences.

📊

The benchmark was reproducible

Using Firefox 147 vulnerabilities (patched in Firefox 148) provides a fixed, reproducible benchmark — the specific vulnerabilities are known, the patches are applied in Firefox 148 making the comparison to a patched baseline clear, and the success criterion is unambiguous (does the exploit produce a JavaScript shell?). This reproducibility makes the 181 vs 2 comparison meaningful: both models were tested against exactly the same set of vulnerabilities with exactly the same task.

🧪

The test measured autonomous capability

The test measured autonomous exploit development — not AI-assisted human research where a human directs each step, but the model autonomously completing the vulnerability-to-working-exploit chain. The Anthropic engineers with no security training who obtained complete exploits overnight were using this autonomous capability. The benchmark quantifies what autonomy produces: 181 successes versus 2 from a model that is 'in a different league.'

What the 90-Fold Improvement Does and Does Not Mean

It means exploit development is qualitatively different in Mythos

A 90-fold improvement in autonomous exploit development success rate is not a quantitative improvement on a continuous scale — it represents a qualitative shift. Opus 4.6 at 2 successes is essentially failing at the task; the 2 successes may represent lucky alignments of conditions rather than reliable capability. Mythos Preview at 181 successes is reliably capable at the task — it is demonstrating a skill it has, not occasionally getting lucky.

It does not mean Mythos is 90x better at everything

The 90-fold improvement is specific to autonomous exploit development — the specific capability that the Firefox benchmark measures. General reasoning, writing quality, and code generation do not improve 90-fold between model generations. The security capability improvement is dramatically larger than the general capability improvement because it represents the crossing of a threshold: from essentially incapable at autonomous exploit development to reliably capable.

It does not mean every Firefox user is at immediate risk

The benchmark was conducted on Firefox 147 vulnerabilities that are patched in Firefox 148. Anyone running Firefox 148 or later is protected from the specific vulnerabilities used in the benchmark. The benchmark demonstrates capability — it does not represent an active attack against current Firefox users. The relevance for users: keep Firefox updated; the benchmark illustrates why prompt patching matters.

📌 Register control — achieved by Mythos Preview on 29 additional attempts beyond the 181 full exploits — is a meaningful intermediate milestone. CPU registers are the fundamental working memory of a processor; controlling them gives an attacker significant influence over execution flow even without achieving full control flow hijack. The 29 register control achievements represent near-misses that, with refinement, would likely become full exploits.

Could Opus 4.6 develop the 2 successful exploits reliably or were they accidents?

Anthropic’s disclosure describes Opus 4.6 as having a 'near-0% success rate' at autonomous exploit development. Two successes out of several hundred attempts suggests these are more likely to represent edge cases where conditions aligned favourably rather than demonstrating reliable capability. The pattern is consistent with a model that lacks the underlying capability and occasionally produces a correct output through statistical chance rather than systematic reasoning.

Will the next Claude generation after Mythos show a similar improvement?

The pattern of emergent capability — where general improvements produce unexpected capability step-changes — makes this plausible but not predictable. Mythos’s security capability emerged from general code, reasoning, and autonomy improvements without specific security training. Whether the next generation produces a similar step-change in another domain, or continues to advance the security capability, depends on the specific nature of the next round of general improvements.

Want to Understand What Frontier AI Advances Mean for Your Business?

SA Solutions tracks and translates frontier AI announcements into practical business implications. Book a free consultation.

Book a Free Consultation Our AI Integration Services

Simple Automation Solutions

Business Process Automation, Technology Consulting for Businesses, IT Solutions for Digital Transformation and Enterprise System Modernization, Web Applications Development, Mobile Applications Development, MVP Development