Ground truth is reality
Anthropic restricted Claude Mythos to vetted security researchers this week via Project Glasswing — not because it was producing false positives, but because it was producing too many real findings for the security community to triage. Among them: a 27-year-old bug in OpenBSD and a 16-year-old bug in FFmpeg.
I wrote last week that the real design challenge isn’t building the checking agent — it’s building the triage layer around what it finds. Mythos is the confirmation at scale.
The question is whether this translates to scientific literature — and the honest answer is: messily, but yes.
Code has relatively clean ground truth. Apply a patch, run the tests, the bug is fixed or it isn’t. Scientific research doesn’t have that. A protocol that works in one lab might not work in the next; full reproducibility requires someone to actually run the experiment again. That’s expensive in ways that merging a patch is not.
But ground truth in science is still reality — it’s just harder to access than a test suite. And there’s an intermediate target that doesn’t require re-running anything. Statistical analysis is internal to the paper. Does the reported p-value follow from the sample size and test described? Do the confidence intervals match the means and standard deviations in the table? These checks don’t require experimental replication. And decade-old errors of exactly this kind are almost certainly sitting in the literature, undetected for the same reason they sat in OpenBSD — nobody was systematically looking.
The triage problem will be worse, though. Security vulnerabilities triage against a clear standard: exploitable or not, patched or not. A flagged p-value inconsistency might be a transcription error, a typo in supplemental data, or a methodological choice the agent doesn’t have context for. Each one takes longer to adjudicate than a CVE. The validation bottleneck is already real in security; in publishing it would be larger and slower and harder to staff.
Statistics seems like the place to start. The 27-year-old OpenBSD bug had a patch process waiting once someone found it. Publishing has a correction process too — it’s just not built yet for the volume an agent would generate.