Ground truth is reality

Wednesday, April 8, 2026 1 minute read

Anthropic restricted Claude Mythos to vetted security researchers this week via Project Glasswing — not because it was producing false positives, but because it was producing too many real findings for the security community to triage. Among them: a 27-year-old bug in OpenBSD and a 16-year-old bug in FFmpeg.

I wrote last week that the real design challenge isn’t building the checking agent — it’s building the triage layer around what it finds. Mythos is the confirmation at scale.

The question is whether this translates to scientific literature — and the honest answer is: messily, but yes.

Code has relatively clean ground truth. Apply a patch, run the tests, the bug is fixed or it isn’t. Scientific research doesn’t have that. A protocol that works in one lab might not work in the next; full reproducibility requires someone to actually run the experiment again. That’s expensive in ways that merging a patch is not.

But ground truth in science is still reality — it’s just harder to access than a test suite. And there’s an intermediate target that doesn’t require re-running anything. Statistical analysis is internal to the paper. Does the reported p-value follow from the sample size and test described? Do the confidence intervals match the means and standard deviations in the table? These checks don’t require experimental replication. And decade-old errors of exactly this kind are almost certainly sitting in the literature, undetected for the same reason they sat in OpenBSD — nobody was systematically looking.

The triage problem will be worse, though. Security vulnerabilities triage against a clear standard: exploitable or not, patched or not. A flagged p-value inconsistency might be a transcription error, a typo in supplemental data, or a methodological choice the agent doesn’t have context for. Each one takes longer to adjudicate than a CVE. The validation bottleneck is already real in security; in publishing it would be larger and slower and harder to staff.

Statistics seems like the place to start. The 27-year-old OpenBSD bug had a patch process waiting once someone found it. Publishing has a correction process too — it’s just not built yet for the volume an agent would generate.

Share on

LinkedIn Email Mastodon Bluesky

Dave Flanagan

Ground truth is reality

Share on

You May Also Enjoy

Publishing’s two jobs

Agents, bugs, and the statistical editor

MCP lets you ship faster 🔗

uv is the best thing to happen to the Python ecosystem in a decade 🔗