Agents, bugs, and the statistical editor

1 minute read

Nicholas Carlini, a research scientist at Anthropic, ran a simple bash script that looped over every file in the Linux kernel and asked Claude Code to look for security vulnerabilities. It found a heap buffer overflow sitting undetected for 23 years. His reaction: “I have never found one of these in my life before. This is very, very, very hard to do.”

The detail that caught my attention wasn’t the 23-year-old bug. It was this: Carlini now has hundreds of potential vulnerabilities he can’t report because human validation is the bottleneck. The agent is finding bugs faster than humans can verify them.

That bottleneck will sound familiar to anyone in editorial.

I’ve been told for years that statistical editors are among the hardest resources to source and retain in journal publishing. Most submissions never get dedicated statistical review. We’ve known this is a gap — we just haven’t had a scalable way to close it.

The workflow I keep coming back to: extract the statistical claims and data from a paper, then check the numbers. Does the reported p-value follow from the sample size and test described? Are the confidence intervals consistent with the means and standard deviations in the table? This doesn’t require superhuman statistical reasoning — it requires reading carefully and doing arithmetic. Scite has already normalized AI-scale citation analysis in publishing; statistical checking is harder, but it’s the same category of thing.1

The real design challenge isn’t building the checking agent — it’s building the triage layer around what it finds. Carlini has hundreds of crashes he can’t report because validating them takes human time. Point an agent at a submission queue and you’d have the same problem immediately, except the stakes are higher: these findings affect publications, and publications affect careers.

My guess is that someone with a higher tolerance for false positives — an advocacy group, a post-publication review platform — will point agents at journal archives before publishers have their own systems in place. Publishers actually care about getting the literature right; we should build this on our terms, not wait to react.

  1. Scite was acquired by Research Solutions in 2024. 

MCP lets you ship faster 🔗

less than 1 minute read

I’ve been thinking a lot about this quote from Steve Krouse (via Simon Willison):

The fact that MCP is a difference surface from your normal API allows you to ship MUCH faster to MCP. This has been unlocked by inference at runtime

Normal APIs are promises to developers, because developer commit code that relies on those APIs, and then walk away. If you break the API, you break the promise, and you break that code. This means a developer gets woken up at 2am to fix the code

But MCP servers are called by LLMs which dynamically read the spec every time, which allow us to constantly change the MCP server. It doesn’t matter! We haven’t made any promises. The LLM can figure it out afresh every time

The implication is that we can have a dynamically defined endpoint for agents to talk to. I imagine that’s not as efficient as exposing a well-defined API, but maybe it doesn’t make a difference. Let the agent discover what tools you are making available when they visit your endpoint.

And then you can try out exposing new tools, and see how agents react to using them – agent-catalyzed product discovery.

AI-Generated “Workslop” Is Destroying Productivity 🔗

less than 1 minute read

Let’s be considerate about how we use GenAI to write emails, articles, or blog posts. When I first started, it was fun: Wow, I can crank out a 750-word essay in minutes! But that’s when you risk outsourcing the thinking. The result? What Stanford researchers call “workslop”: lots of words, not much value.

Approximately half of the people we surveyed viewed colleagues who sent workslop as less creative, capable, and reliable than they did before receiving the output. Forty-two percent saw them as less trustworthy, and 37% saw that colleague as less intelligent.

For this post, I didn’t just ask GenAI to write it; we discussed ideas, and I shaped the thinking and co-wrote the text. If you’re generating 10-page documents that someone else has to decipher, you’re just moving the thinking downstream.

AI should help you sharpen ideas, not dump text for others to untangle.

I think ‘agent’ may finally have a widely enough agreed upon definition to be useful jargon now 🔗

less than 1 minute read

Via Simon Willison:

Moving forward, when I talk about agents I’m going to use this:

An LLM agent runs tools in a loop to achieve a goal.

I like this definition. Simon breaks down why he chose that specific phrasing for each part, it’s worth a deeper read.

There’s a lot of confusion about agents, and while it has already been turned into an elastic marketing term like “AI”, it’s helpful for product and technology discussions to have more precision.