Ground truth is reality

1 minute read

Anthropic restricted Claude Mythos to vetted security researchers this week via Project Glasswing — not because it was producing false positives, but because it was producing too many real findings for the security community to triage. Among them: a 27-year-old bug in OpenBSD and a 16-year-old bug in FFmpeg.

I wrote last week that the real design challenge isn’t building the checking agent — it’s building the triage layer around what it finds. Mythos is the confirmation at scale.

The question is whether this translates to scientific literature — and the honest answer is: messily, but yes.

Code has relatively clean ground truth. Apply a patch, run the tests, the bug is fixed or it isn’t. Scientific research doesn’t have that. A protocol that works in one lab might not work in the next; full reproducibility requires someone to actually run the experiment again. That’s expensive in ways that merging a patch is not.

But ground truth in science is still reality — it’s just harder to access than a test suite. And there’s an intermediate target that doesn’t require re-running anything. Statistical analysis is internal to the paper. Does the reported p-value follow from the sample size and test described? Do the confidence intervals match the means and standard deviations in the table? These checks don’t require experimental replication. And decade-old errors of exactly this kind are almost certainly sitting in the literature, undetected for the same reason they sat in OpenBSD — nobody was systematically looking.

The triage problem will be worse, though. Security vulnerabilities triage against a clear standard: exploitable or not, patched or not. A flagged p-value inconsistency might be a transcription error, a typo in supplemental data, or a methodological choice the agent doesn’t have context for. Each one takes longer to adjudicate than a CVE. The validation bottleneck is already real in security; in publishing it would be larger and slower and harder to staff.

Statistics seems like the place to start. The 27-year-old OpenBSD bug had a patch process waiting once someone found it. Publishing has a correction process too — it’s just not built yet for the volume an agent would generate.

Publishing’s two jobs

2 minute read

There’s a piece on the ergosphere blog worth reading this week about what the author calls the Alice-and-Bob problem. Alice and Bob both produce a PhD research paper. Alice did it the hard way — reading carefully, debugging, getting confused, building real understanding. Bob used AI agents to skip all of that and produced an equivalent-looking output. By the metrics the institution has, they’re interchangeable. In practice, one of them knows something.

The failures are the curriculum…Every hour you spend confused is an hour you spend building the infrastructure inside your own head.

I’ve been arguing for a while that academic publishing is trying to do two jobs at once: expand the frontiers of human knowledge, and certify individuals for hiring, tenure, and promotion. Those jobs have always been in some tension — a paper that gets someone promoted isn’t necessarily a paper that advances a field — but they were compatible enough that nobody had to choose explicitly.

The Alice-and-Bob framing makes that tension concrete. If Bob’s paper passes peer review, he gets the credential. But the knowledge the field gains from his paper is built on a foundation that doesn’t include Bob actually understanding it. That matters when someone tries to build on his work and Bob can’t help them.

I’ve noticed a smaller version of this myself. A presentation built with AI help but without genuine thinking behind it holds together until someone asks a question. Then you find out quickly what you actually know. For a paper, the equivalent moment is the job interview. The science might be perfectly sound — but if Bob can’t talk about why he made the methodological choices he made, the credential stops working.

The urgency is that this isn’t a gradual shift. The volume of AI-assisted submissions is going to overwhelm human-powered editorial systems within the next year or two, and publishers will reach for AI to manage the load. At that point the two-jobs tension becomes unavoidable: the systems processing the inbox won’t be able to tell Alice from Bob either.

Maybe the answer is separate venues: one optimized for credentialing, one for genuine knowledge expansion with review processes designed to assess whether the author actually understands what they found. That’s probably not happening soon. But every publisher is already implicitly choosing which function to optimize for, every time they decide what AI assistance in manuscripts or reviews is acceptable.

Agents, bugs, and the statistical editor

1 minute read

Nicholas Carlini, a research scientist at Anthropic, ran a simple bash script that looped over every file in the Linux kernel and asked Claude Code to look for security vulnerabilities. It found a heap buffer overflow sitting undetected for 23 years. His reaction: “I have never found one of these in my life before. This is very, very, very hard to do.”

The detail that caught my attention wasn’t the 23-year-old bug. It was this: Carlini now has hundreds of potential vulnerabilities he can’t report because human validation is the bottleneck. The agent is finding bugs faster than humans can verify them.

That bottleneck will sound familiar to anyone in editorial.

I’ve been told for years that statistical editors are among the hardest resources to source and retain in journal publishing. Most submissions never get dedicated statistical review. We’ve known this is a gap — we just haven’t had a scalable way to close it.

The workflow I keep coming back to: extract the statistical claims and data from a paper, then check the numbers. Does the reported p-value follow from the sample size and test described? Are the confidence intervals consistent with the means and standard deviations in the table? This doesn’t require superhuman statistical reasoning — it requires reading carefully and doing arithmetic. Scite has already normalized AI-scale citation analysis in publishing; statistical checking is harder, but it’s the same category of thing.1

The real design challenge isn’t building the checking agent — it’s building the triage layer around what it finds. Carlini has hundreds of crashes he can’t report because validating them takes human time. Point an agent at a submission queue and you’d have the same problem immediately, except the stakes are higher: these findings affect publications, and publications affect careers.

My guess is that someone with a higher tolerance for false positives — an advocacy group, a post-publication review platform — will point agents at journal archives before publishers have their own systems in place. Publishers actually care about getting the literature right; we should build this on our terms, not wait to react.

  1. Scite was acquired by Research Solutions in 2024. 

MCP lets you ship faster 🔗

less than 1 minute read

I’ve been thinking a lot about this quote from Steve Krouse (via Simon Willison):

The fact that MCP is a difference surface from your normal API allows you to ship MUCH faster to MCP. This has been unlocked by inference at runtime

Normal APIs are promises to developers, because developer commit code that relies on those APIs, and then walk away. If you break the API, you break the promise, and you break that code. This means a developer gets woken up at 2am to fix the code

But MCP servers are called by LLMs which dynamically read the spec every time, which allow us to constantly change the MCP server. It doesn’t matter! We haven’t made any promises. The LLM can figure it out afresh every time

The implication is that we can have a dynamically defined endpoint for agents to talk to. I imagine that’s not as efficient as exposing a well-defined API, but maybe it doesn’t make a difference. Let the agent discover what tools you are making available when they visit your endpoint.

And then you can try out exposing new tools, and see how agents react to using them – agent-catalyzed product discovery.