The methods section is not a recipe

2 minute read

Via Ethan Mollick, a new paper on agentic reproduction of social-science results asks whether AI agents can reproduce published results from the paper and data alone, without seeing the original code. Often they can. But the more interesting failures reported in the paper are where they cannot, because the paper does not actually specify enough of the method.

That feels very familiar from chemistry.

I am not a social scientist, so I will leave the empirical social-science side there. The chemistry version is easy to recognize. There is a large gap between a typical experimental section that gestures at what was done and something like Organic Syntheses, where procedures are written in much more detail and each reaction and characterization dataset is checked for reproducibility in the laboratory of a member of the Board of Editors.

That standard exists for a reason. Rick Danheiser wrote in C&EN that, from 1982-2005, about 12% of submitted Organic Syntheses articles were rejected because the results could not be reproduced. After more detailed author instructions and a procedure checklist were introduced in 2005-07, more than 95% of submissions checked with satisfactory reproducibility.

That is the part I keep coming back to. The problem is not always that an author is hiding something. Often they know the procedure so well that they no longer notice which details are load-bearing. Stirring rate, addition order, concentration, drying time, workup details, vendor grade, how dry the solvent really was, what “room temperature” meant that week in that lab. Anyone who has tried to repeat a reaction from a too-short experimental section knows this feeling.

This is where AI tools could be useful without pretending to be the chemist. Not “write my experimental section,” and definitely not “certify that this procedure works.” More like: read this procedure as an annoying first-year graduate student who has to run it tomorrow, and ask what is missing.

There is already adjacent work here. A Nature Communications paper converted prose synthesis procedures into structured action sequences for chemical synthesis. That is not the same thing as reproducing a reaction, but it points in the right direction: if a procedure cannot be converted into concrete actions, quantities, conditions, and decision points, it probably is not as complete as it looks.

I wrote recently that ground truth in science is still reality, just harder to access than a test suite. That remains true. AI cannot tell you whether a reaction will work in the lab without someone eventually doing the experiment. But it may be able to tell authors where their methods section stops being a recipe and starts being a memory aid for the person who already knows what happened.

That would be a useful tool. Slightly irritating, probably. But useful.

The Median Is Not a Discovery 🔗

less than 1 minute read

Via Ethan Mollick:

Classic study gave 146 economist teams the same dataset & got wildly different answers. New paper reruns it with agentic AI. Claude Code & Codex land near the human median but with far tighter dispersion & no extremes.

I’m torn between the reproducibility (the tight clustering) and what it might cost in AI-assisted scientific creativity. Barry Marshall looked at the same gastric biopsy data as everyone else and reached the conclusion the field had ruled out. That kind of outlier isn’t noise; it’s occasionally how science moves. If AI reliably clusters near the median human interpretation, it scales up the research we already know how to do. It won’t find the next H. pylori.

Scientific datasets are riddled with copy-paste errors 🔗

less than 1 minute read

Markus Englund scanned 600 datasets on Dryad and found serious copy-paste errors in 18 of them — projecting around 700 cases across the full repository of ~24,000 datasets.

“There just isn’t anybody whose job it is to actively look for it.”

Not surprising. Many researchers are working in Excel rather than reproducible pipelines, and journals don’t have the bandwidth to audit supporting data. If anything, 3% seems low?

The obvious fix is an AI toolbench for pre-publication data validation — something that catches this before it enters the literature rather than years after. The verification loop is precisely what experimental science is missing.

AI model behavior, versioned 🔗

less than 1 minute read

Via Simon Willison, who has turned Anthropic’s published system prompts into a git archive that diffs changes across model releases:

Anthropic is the sole major AI laboratory publishing system prompts for consumer-facing chat interfaces, with archives extending back to Claude 3.

Worth noting that if you’re using the API you might be writing your own system prompt anyway, so this mostly matters for claude.ai users. The harder problem is underlying model behavior: when a researcher publishes results generated with Claude Opus 4.6, what would it take to reproduce that in two years? Code Ocean does something like this for computational environments — pin the entire runtime alongside the paper, executable on demand, and Nature has integrated it into peer review. Nobody is doing the equivalent for AI model versions in research workflows yet.

What does Opus 4.7 verify against?

2 minute read

Claude Opus 4.7 looks like a genuine step forward, and one line in the announcement caught my attention: the model “devises ways to verify its own outputs.” The model isn’t just generating; it’s checking.

The obvious question is: checking against what?

That could mean internal self-consistency — trying a calculation two ways, looking for contradictions in its own reasoning. Useful, but it doesn’t escape the model’s own knowledge boundaries. Or it could mean external retrieval — and for most deployments today, that means a web search. That’s better than nothing, but it’s a weak verification tool for scientific claims. The web will tell you that fish oil is associated with cardiovascular health. It won’t tell you whether the mechanism-of-action proposed in a 2019 paper has been confirmed, challenged, or quietly superseded by six subsequent studies. For that, you need something structured.

Which raises a more interesting question: what would Opus 4.7’s verification loop look like if it had access to a proper scientific knowledge graph — not search, but a graph of claims made across the literature, tagged with confidence, provenance, and the network of studies that support or contradict them? Or better still, causal datasets: not “paper A mentions compound X and outcome Y” but “experiment N demonstrated cause-effect at dose Z, replicated three times.”

I’ve written before about how the speed of the verification loop is what separates fields where AI has transformed research from fields where it hasn’t (yet). Math closes the loop via proof assistants; drug discovery historically couldn’t close it in under months. That’s changing — Exscientia’s closed design-make-test-learn cycles, Periodic Labs building automated materials discovery. But closing the experimental loop is a separate problem from connecting AI reasoning to the existing literature — and that side has barely started.

A model that actively seeks to verify its reasoning is only as good as what it can verify against. Right now we’re giving it the open web. The more interesting engineering problem is connecting it to the structured record of what science has actually established — and what it hasn’t. Wiley’s Scholar Gateway and Nexus Domains are attempts at this — Scholar Gateway for in-session retrieval via MCP, giving Claude and other AI systems access to peer-reviewed literature rather than the open web; Nexus Domains for curated content feeds delivered via API and MCP to enterprise R&D pipelines. These are first steps in building the right verification layer. The question Opus 4.7 makes newly urgent is whether the rest of the field catches up.