<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://dwflanagan.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dwflanagan.com/" rel="alternate" type="text/html" /><updated>2026-05-07T18:17:35+02:00</updated><id>https://dwflanagan.com/feed.xml</id><title type="html">Dave Flanagan</title><subtitle>Turning research intelligence into R&amp;D advantage</subtitle><author><name>Dave Flanagan</name></author><entry><title type="html">The methods section is not a recipe</title><link href="https://dwflanagan.com/blog/methods-section-not-recipe/" rel="alternate" type="text/html" title="The methods section is not a recipe" /><published>2026-04-26T00:00:00+02:00</published><updated>2026-04-26T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/methods-section-not-recipe</id><content type="html" xml:base="https://dwflanagan.com/blog/methods-section-not-recipe/"><![CDATA[<p>Via <a href="https://bsky.app/profile/emollick.bsky.social/post/3mkddonvfp22b">Ethan Mollick</a>, a new paper on <a href="https://elliottash.com/wp-content/uploads/2026/04/Kohler-Zollikofer-Einsiedler-Hoyle-Ash-Read-Paper-Write-Code-Agentic-Reproduction-Social-Science-Results.pdf">agentic reproduction of social-science results</a> asks whether AI agents can reproduce published results from the paper and data alone, without seeing the original code. Often they can. But the more interesting failures reported in the paper are where they cannot, because the paper does not actually specify enough of the method.</p>

<p>That feels very familiar from chemistry.</p>

<p>I am not a social scientist, so I will leave the empirical social-science side there. The chemistry version is easy to recognize. There is a large gap between a typical experimental section that gestures at what was done and something like <a href="https://www.organicdivision.org/OrganicSyntheses/">Organic Syntheses</a>, where procedures are written in much more detail and each reaction and characterization dataset is checked for reproducibility in the laboratory of a member of the Board of Editors.</p>

<p>That standard exists for a reason. Rick Danheiser wrote in <a href="https://cen.acs.org/articles/91/i21/Detailing-Experimental-Procedures.html">C&amp;EN</a> that, from 1982-2005, about 12% of submitted Organic Syntheses articles were rejected because the results could not be reproduced. After more detailed author instructions and a procedure checklist were introduced in 2005-07, more than 95% of submissions checked with satisfactory reproducibility.</p>

<p>That is the part I keep coming back to. The problem is not always that an author is hiding something. Often they know the procedure so well that they no longer notice which details are load-bearing. Stirring rate, addition order, concentration, drying time, workup details, vendor grade, how dry the solvent really was, what “room temperature” meant that week in that lab. Anyone who has tried to repeat a reaction from a too-short experimental section knows this feeling.</p>

<p>This is where AI tools could be useful without pretending to be the chemist. Not “write my experimental section,” and definitely not “certify that this procedure works.” More like: read this procedure as an annoying first-year graduate student who has to run it tomorrow, and ask what is missing.</p>

<p>There is already adjacent work here. A <a href="https://www.nature.com/articles/s41467-020-17266-6">Nature Communications paper</a> converted prose synthesis procedures into structured action sequences for chemical synthesis. That is not the same thing as reproducing a reaction, but it points in the right direction: if a procedure cannot be converted into concrete actions, quantities, conditions, and decision points, it probably is not as complete as it looks.</p>

<p>I wrote recently that <a href="https://dwflanagan.com/blog/ground-truth-is-reality/">ground truth in science is still reality</a>, just harder to access than a test suite. That remains true. AI cannot tell you whether a reaction will work in the lab without someone eventually doing the experiment. But it may be able to tell authors where their methods section stops being a recipe and starts being a memory aid for the person who already knows what happened.</p>

<p>That would be a useful tool. Slightly irritating, probably. But useful.</p>

<!-- meta: AI tools may help authors find missing details in methods sections before publication, especially in chemistry where reproducibility lives in specifics. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="ai" /><category term="reproducibility" /><category term="chemistry" /><category term="methods" /><summary type="html"><![CDATA[Via Ethan Mollick, a new paper on agentic reproduction of social-science results asks whether AI agents can reproduce published results from the paper and data alone, without seeing the original code. Often they can. But the more interesting failures reported in the paper are where they cannot, because the paper does not actually specify enough of the method.]]></summary></entry><entry><title type="html">The Median Is Not a Discovery</title><link href="https://dwflanagan.com/blog/the-median-is-not-a-discovery/" rel="alternate" type="text/html" title="The Median Is Not a Discovery" /><published>2026-04-21T00:00:00+02:00</published><updated>2026-04-21T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/the-median-is-not-a-discovery</id><content type="html" xml:base="https://dwflanagan.com/blog/the-median-is-not-a-discovery/"><![CDATA[<p>Via Ethan Mollick:</p>

<blockquote>
  <p>Classic study gave 146 economist teams the same dataset &amp; got wildly different answers. New paper reruns it with agentic AI. Claude Code &amp; Codex land near the human median but with far tighter dispersion &amp; no extremes.</p>
</blockquote>

<p>I’m torn between the reproducibility (the tight clustering) and what it might cost in AI-assisted scientific creativity. Barry Marshall looked at the same gastric biopsy data as everyone else and reached the conclusion the field had ruled out. That kind of outlier isn’t noise; it’s occasionally how science moves. If AI reliably clusters near the median human interpretation, it scales up the research we already know how to do. It won’t find the next <em>H. pylori</em>.</p>

<!-- meta: Agentic AI clusters near the human median in economics research — a feature for reproducibility, but a potential cost for scientific discovery and creative outliers. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="link" /><category term="AI" /><category term="research" /><category term="scientific-discovery" /><summary type="html"><![CDATA[Via Ethan Mollick:]]></summary></entry><entry><title type="html">Scientific datasets are riddled with copy-paste errors</title><link href="https://dwflanagan.com/blog/scientific-datasets-copy-paste-errors/" rel="alternate" type="text/html" title="Scientific datasets are riddled with copy-paste errors" /><published>2026-04-20T00:00:00+02:00</published><updated>2026-04-20T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/scientific-datasets-copy-paste-errors</id><content type="html" xml:base="https://dwflanagan.com/blog/scientific-datasets-copy-paste-errors/"><![CDATA[<p>Markus Englund scanned 600 datasets on Dryad and found serious copy-paste errors in 18 of them — projecting around 700 cases across the full repository of ~24,000 datasets.</p>

<blockquote>
  <p>“There just isn’t anybody whose job it is to actively look for it.”</p>
</blockquote>

<p>Not surprising. Many researchers are working in Excel rather than reproducible pipelines, and journals don’t have the bandwidth to audit supporting data. If anything, 3% seems low?</p>

<p>The obvious fix is an AI toolbench for pre-publication data validation — something that catches this before it enters the literature rather than years after. The <a href="https://dwflanagan.com/blog/math-got-there-first/">verification loop</a> is precisely what experimental science is missing.</p>

<!-- meta: Copy-paste errors found in ~3% of scanned scientific datasets. The real problem: nobody's job to check, and researchers still working in Excel. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="link" /><category term="data quality" /><category term="research reproducibility" /><category term="open science" /><summary type="html"><![CDATA[Markus Englund scanned 600 datasets on Dryad and found serious copy-paste errors in 18 of them — projecting around 700 cases across the full repository of ~24,000 datasets.]]></summary></entry><entry><title type="html">AI model behavior, versioned</title><link href="https://dwflanagan.com/blog/ai-model-behavior-versioned/" rel="alternate" type="text/html" title="AI model behavior, versioned" /><published>2026-04-19T00:00:00+02:00</published><updated>2026-04-19T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/ai-model-behavior-versioned</id><content type="html" xml:base="https://dwflanagan.com/blog/ai-model-behavior-versioned/"><![CDATA[<p>Via Simon Willison, who has turned Anthropic’s published system prompts into a <a href="https://simonwillison.net/2026/Apr/18/extract-system-prompts/">git archive that diffs changes across model releases</a>:</p>

<blockquote>
  <p>Anthropic is the sole major AI laboratory publishing system prompts for consumer-facing chat interfaces, with archives extending back to Claude 3.</p>
</blockquote>

<p>Worth noting that if you’re using the API you might be writing your own system prompt anyway, so this mostly matters for claude.ai users. The harder problem is underlying model behavior: when a researcher publishes results generated with Claude Opus 4.6, what would it take to reproduce that in two years? <a href="https://codeocean.com/">Code Ocean</a> does something like this for computational environments — pin the entire runtime alongside the paper, executable on demand, and <a href="https://www.nature.com/articles/d41586-019-03366-x">Nature has integrated it into peer review</a>. Nobody is doing the equivalent for AI model versions in research workflows yet.</p>

<!-- meta: Anthropic publishes system prompts; Simon Willison now versions them in git. The harder problem is reproducibility of AI-assisted research results across model versions. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="link" /><category term="AI" /><category term="reproducibility" /><category term="publishing" /><summary type="html"><![CDATA[Via Simon Willison, who has turned Anthropic’s published system prompts into a git archive that diffs changes across model releases:]]></summary></entry><entry><title type="html">Build to learn</title><link href="https://dwflanagan.com/blog/build-to-learn/" rel="alternate" type="text/html" title="Build to learn" /><published>2026-04-17T00:00:00+02:00</published><updated>2026-04-17T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/build-to-learn</id><content type="html" xml:base="https://dwflanagan.com/blog/build-to-learn/"><![CDATA[<p>Marty Cagan on the distinction between product discovery (“build to learn”) and product delivery (“build to earn”), and why AI makes the former more important, not less.</p>

<blockquote>
  <p>The hard part is building the product sense necessary to evaluate the learnings and guide the direction.</p>
</blockquote>

<p>Similarly, an AI editor could confirm or deny whether a paper’s claims are likely to be true, but in a coming age of radical overabundance of valid research it’s the <em>taste</em> of the editor that matters — selecting which papers their audience would actually care about. Product sense works the same way: it isn’t verification, it’s curation.</p>

<!-- meta: Marty Cagan on build-to-learn vs build-to-earn, and why product sense — like editorial taste — is what AI can't replace. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="link" /><category term="product-management" /><category term="ai" /><summary type="html"><![CDATA[Marty Cagan on the distinction between product discovery (“build to learn”) and product delivery (“build to earn”), and why AI makes the former more important, not less.]]></summary></entry><entry><title type="html">What does Opus 4.7 verify against?</title><link href="https://dwflanagan.com/blog/what-does-opus-47-verify-against/" rel="alternate" type="text/html" title="What does Opus 4.7 verify against?" /><published>2026-04-17T00:00:00+02:00</published><updated>2026-04-17T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/what-does-opus-47-verify-against</id><content type="html" xml:base="https://dwflanagan.com/blog/what-does-opus-47-verify-against/"><![CDATA[<p><a href="https://www.anthropic.com/news/claude-opus-4-7">Claude Opus 4.7</a> looks like a genuine step forward, and one line in the announcement caught my attention: the model “devises ways to verify its own outputs.” The model isn’t just generating; it’s checking.</p>

<p>The obvious question is: checking against what?</p>

<p>That could mean internal self-consistency — trying a calculation two ways, looking for contradictions in its own reasoning. Useful, but it doesn’t escape the model’s own knowledge boundaries. Or it could mean external retrieval — and for most deployments today, that means a web search. That’s better than nothing, but it’s a weak verification tool for scientific claims. The web will tell you that fish oil is associated with cardiovascular health. It won’t tell you whether the mechanism-of-action proposed in a 2019 paper has been confirmed, challenged, or quietly superseded by six subsequent studies. For that, you need something structured.</p>

<p>Which raises a more interesting question: what would Opus 4.7’s verification loop look like if it had access to a proper scientific knowledge graph — not search, but a graph of claims made across the literature, tagged with confidence, provenance, and the network of studies that support or contradict them? Or better still, causal datasets: not “paper A mentions compound X and outcome Y” but “experiment N demonstrated cause-effect at dose Z, replicated three times.”</p>

<p>I’ve written before about how <a href="https://dwflanagan.com/blog/math-got-there-first/">the speed of the verification loop</a> is what separates fields where AI has transformed research from fields where it hasn’t (yet). Math closes the loop via proof assistants; drug discovery historically couldn’t close it in under months. That’s changing — <a href="https://aws.amazon.com/solutions/case-studies/exscientia-generative-ai/">Exscientia’s</a> closed design-make-test-learn cycles, <a href="https://techcrunch.com/2025/10/20/top-openai-google-brain-researchers-set-off-a-300m-vc-frenzy-for-their-startup-periodic-labs/">Periodic Labs</a> building automated materials discovery. But closing the experimental loop is a separate problem from connecting AI reasoning to the existing literature — and that side has barely started.</p>

<p>A model that actively seeks to verify its reasoning is only as good as what it can verify against. Right now we’re giving it the open web. The more interesting engineering problem is connecting it to the structured record of what science has actually established — and what it hasn’t. Wiley’s <a href="https://www.wiley.com/en-us/solutions-partnerships/ai-solutions/">Scholar Gateway and Nexus Domains</a> are  attempts at this — Scholar Gateway for in-session retrieval via MCP, giving Claude and other AI systems access to peer-reviewed literature rather than the open web; Nexus Domains for curated content feeds delivered via API and MCP to enterprise R&amp;D pipelines. These are first steps in building the right verification layer. The question Opus 4.7 makes newly urgent is whether the rest of the field catches up.</p>

<!-- meta: Claude Opus 4.7 verifies its outputs — but against what? The bottleneck isn't model reasoning, it's the quality of the verification layer it can reach. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="ai" /><category term="research-intelligence" /><category term="knowledge-graphs" /><category term="llm" /><summary type="html"><![CDATA[Claude Opus 4.7 looks like a genuine step forward, and one line in the announcement caught my attention: the model “devises ways to verify its own outputs.” The model isn’t just generating; it’s checking.]]></summary></entry><entry><title type="html">Math and code got there first</title><link href="https://dwflanagan.com/blog/math-got-there-first/" rel="alternate" type="text/html" title="Math and code got there first" /><published>2026-04-14T00:00:00+02:00</published><updated>2026-04-14T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/math-got-there-first</id><content type="html" xml:base="https://dwflanagan.com/blog/math-got-there-first/"><![CDATA[<p>Quanta Magazine has a <a href="https://www.quantamagazine.org/the-ai-revolution-in-math-has-arrived-20260413/">piece this week</a> on how AI has changed mathematical research — AlphaEvolve, LLMs as collaborative partners, problems that used to take months solved in days.</p>

<p>The structural reason mathematics and software development got there first is worth pausing on. Both have fast automated verification built in — proof assistants like Lean for math, test suites and type checkers for code. The loop closes in seconds. Drug discovery has never had that — the verification step is a wet lab experiment that takes weeks or months.</p>

<p>That gap is getting shorter. A dynamic flow system at NC State, <a href="https://www.nature.com/articles/s44286-025-00249-z">published last year in Nature Chemical Engineering</a>, generates ten times more experimental data than previous approaches by monitoring reactions in real time rather than waiting for steady state. Exscientia has been running <a href="https://www.clinicaltrialsarena.com/news/exscientia-outline-robot-and-ai-use-in-drug-discovery-workflow/">closed design-make-test-learn cycles</a> in its Oxford robotics facility since late 2024. <a href="https://techcrunch.com/2025/10/20/top-openai-google-brain-researchers-set-off-a-300m-vc-frenzy-for-their-startup-periodic-labs/">Periodic Labs</a>, which launched last October with a $300M round from founders of ChatGPT and GNoME, is building explicitly toward this for materials discovery.</p>

<p>The distinguishing factor between disciplines where AI has already transformed research and those where it hasn’t isn’t the AI. It’s the speed of the verification loop. Mathematics and software development had that built in. Experimental science is engineering its way to the same place.</p>

<p>The Quanta piece reads like a preview.</p>

<!-- meta: Why math and code got AI assistance first — fast verification loops — and why experimental sciences like drug discovery are engineering their way to the same place. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="ai" /><category term="drug-discovery" /><category term="materials-science" /><category term="self-driving-labs" /><summary type="html"><![CDATA[Quanta Magazine has a piece this week on how AI has changed mathematical research — AlphaEvolve, LLMs as collaborative partners, problems that used to take months solved in days.]]></summary></entry><entry><title type="html">Where graphs supplement LLMs</title><link href="https://dwflanagan.com/blog/where-graphs-supplement-llms/" rel="alternate" type="text/html" title="Where graphs supplement LLMs" /><published>2026-04-13T00:00:00+02:00</published><updated>2026-04-13T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/where-graphs-supplement-llms</id><content type="html" xml:base="https://dwflanagan.com/blog/where-graphs-supplement-llms/"><![CDATA[<p>Graph-based parsers appear to outperform LLMs on relation extraction — and the gap widens as relational complexity grows. A preprint out today from Gajo et al. has evidence across six datasets. For pharma and biomedical knowledge graphs, where the useful relations are mechanism-of-action chains and adverse event pathways rather than simple co-mentions, this is the relevant regime. Useful alongside <a href="https://dwflanagan.com/blog/the-knowledge-graph-as-digital-twin/">what I wrote earlier this week</a> on knowledge graphs as research discovery tools.</p>

<!-- meta: New preprint shows graph-based parsers outperform LLMs on complex relation extraction — with implications for pharma and biomedical knowledge graph construction. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="link" /><category term="knowledge-graphs" /><category term="NLP" /><category term="relation-extraction" /><category term="pharma" /><category term="arxiv" /><summary type="html"><![CDATA[Graph-based parsers appear to outperform LLMs on relation extraction — and the gap widens as relational complexity grows. A preprint out today from Gajo et al. has evidence across six datasets. For pharma and biomedical knowledge graphs, where the useful relations are mechanism-of-action chains and adverse event pathways rather than simple co-mentions, this is the relevant regime. Useful alongside what I wrote earlier this week on knowledge graphs as research discovery tools.]]></summary></entry><entry><title type="html">MCP vs. Skills</title><link href="https://dwflanagan.com/blog/mcp-vs-skills/" rel="alternate" type="text/html" title="MCP vs. Skills" /><published>2026-04-12T00:00:00+02:00</published><updated>2026-04-12T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/mcp-vs-skills</id><content type="html" xml:base="https://dwflanagan.com/blog/mcp-vs-skills/"><![CDATA[<p>A good breakdown of the <a href="https://david.coffee/i-still-prefer-mcp-over-skills/">MCP vs. Skills</a> tradeoffs from David Mohl:</p>

<blockquote>
  <p>Skills are great for pure knowledge and teaching an LLM how to use an existing tool. But for giving an LLM actual access to services, the Model Context Protocol (MCP) is the far superior, more pragmatic architectural choice.</p>
</blockquote>

<p>In practice, some publishers aren’t forcing the choice. Wiley’s <a href="https://www.wiley.com/en-us/solutions-partnerships/ai-solutions/">Knowledge Nexus</a> offers both — MCP if you want to point an LLM at it directly, API if you’d rather build your own integration. Whichever fits your stack is probably fine.</p>

<!-- meta: A clear breakdown of MCP vs. Skills for AI agents — and why publishers like Wiley are offering both MCP and API access for the same content. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="link" /><category term="MCP" /><summary type="html"><![CDATA[A good breakdown of the MCP vs. Skills tradeoffs from David Mohl:]]></summary></entry><entry><title type="html">The knowledge graph as digital twin</title><link href="https://dwflanagan.com/blog/the-knowledge-graph-as-digital-twin/" rel="alternate" type="text/html" title="The knowledge graph as digital twin" /><published>2026-04-11T00:00:00+02:00</published><updated>2026-04-11T00:00:00+02:00</updated><id>https://dwflanagan.com/blog/the-knowledge-graph-as-digital-twin</id><content type="html" xml:base="https://dwflanagan.com/blog/the-knowledge-graph-as-digital-twin/"><![CDATA[<p>A <a href="https://arxiv.org/abs/2604.02592">new paper from Wharton</a> finds that LLM-generated Community Notes on X are rated more helpful than human-written ones across 108,000+ ratings. It’s a well-designed study and the result is credible — for social media fact-checking, which is what it’s testing. Whether something similar could work for scientific literature is a different question, and the answer depends entirely on what you build underneath it.</p>

<p>Social media claims are mostly atomic: a politician said something, a statistic is cited correctly or not, an event happened or didn’t. You can check those against a corpus. Scientific claims are relational — they assert relationships between entities distributed across thousands of papers, and the “truth” of the claim is a property of the network, not any individual document. Asking an LLM to fact-check “compound X inhibits pathway Y at therapeutic doses” requires knowing what the literature establishes about X’s mechanism, Y’s context-dependence, and whether the relevant concentrations have ever appeared in the same study. A retrieval system can find text that mentions both; it can’t tell you whether the relationship holds.</p>

<p>This is precisely what knowledge graphs were built for. Don Swanson demonstrated it in 1986: he found that fish oil and Raynaud’s syndrome research had <a href="https://doi.org/10.1353/pbm.1986.0087">never cited each other</a>, yet traversing the relationships — fish oil inhibits platelet aggregation, platelet aggregation implicated in Raynaud’s — produced a testable hypothesis. No document stated it. The connection existed only in the graph. A clinical trial three years later confirmed it.</p>

<p>Thirty years on, Himmelstein et al. built <a href="https://doi.org/10.7554/elife.26726">Hetionet</a>: 47,000 nodes, 2.25 million relationships, 29 biomedical databases integrated into a single graph. They used it to generate drug repurposing predictions across 209,000 compound-disease pairs. Most of those candidates couldn’t be found by searching the literature because no paper had connected them — that’s what made them candidates worth testing.</p>

<p>The reason I keep coming back to this is that “fact-checking” is actually the least interesting thing a knowledge graph enables. Verification looks backward: does this claim hold given what we know? Discovery looks forward: what does the structure of existing knowledge imply that nobody has tested yet? Swanson and Himmelstein were doing the second thing. An AI system built on structured biomedical knowledge could do both simultaneously — flagging claims that contradict established relationships while surfacing hypotheses that the graph supports but the literature hasn’t yet stated.</p>

<p>The infrastructure question is the hard one, and also the interesting one. Building a knowledge graph like Hetionet is, in a real sense, constructing a digital twin of the scientific record — a computable representation of what the literature actually establishes about how the world works. <a href="https://dwflanagan.com/blog/ground-truth-is-reality/">Ground truth in science is still reality</a>, just harder to access than a test suite. A well-constructed knowledge graph is the closest thing we have to making it queryable. <a href="https://dwflanagan.com/blog/agents-bugs-stats-editor/">Agents can already find errors faster than humans can triage them</a> — the bottleneck isn’t computation, it’s the structured representation of what science actually knows. That’s a much larger project than building a better Community Notes, and a much more valuable one.</p>

<!-- meta: AI fact-checking outperforms humans on social media — but scientific claims are relational, not atomic. What knowledge graphs enable goes beyond verification into discovery. -->]]></content><author><name>Dave Flanagan</name></author><category term="Blog" /><category term="knowledge graphs" /><category term="research intelligence" /><category term="literature-based discovery" /><category term="AI" /><summary type="html"><![CDATA[A new paper from Wharton finds that LLM-generated Community Notes on X are rated more helpful than human-written ones across 108,000+ ratings. It’s a well-designed study and the result is credible — for social media fact-checking, which is what it’s testing. Whether something similar could work for scientific literature is a different question, and the answer depends entirely on what you build underneath it.]]></summary></entry></feed>