2 minute read

A new paper from Wharton finds that LLM-generated Community Notes on X are rated more helpful than human-written ones across 108,000+ ratings. It’s a well-designed study and the result is credible — for social media fact-checking, which is what it’s testing. Whether something similar could work for scientific literature is a different question, and the answer depends entirely on what you build underneath it.

Social media claims are mostly atomic: a politician said something, a statistic is cited correctly or not, an event happened or didn’t. You can check those against a corpus. Scientific claims are relational — they assert relationships between entities distributed across thousands of papers, and the “truth” of the claim is a property of the network, not any individual document. Asking an LLM to fact-check “compound X inhibits pathway Y at therapeutic doses” requires knowing what the literature establishes about X’s mechanism, Y’s context-dependence, and whether the relevant concentrations have ever appeared in the same study. A retrieval system can find text that mentions both; it can’t tell you whether the relationship holds.

This is precisely what knowledge graphs were built for. Don Swanson demonstrated it in 1986: he found that fish oil and Raynaud’s syndrome research had never cited each other, yet traversing the relationships — fish oil inhibits platelet aggregation, platelet aggregation implicated in Raynaud’s — produced a testable hypothesis. No document stated it. The connection existed only in the graph. A clinical trial three years later confirmed it.

Thirty years on, Himmelstein et al. built Hetionet: 47,000 nodes, 2.25 million relationships, 29 biomedical databases integrated into a single graph. They used it to generate drug repurposing predictions across 209,000 compound-disease pairs. Most of those candidates couldn’t be found by searching the literature because no paper had connected them — that’s what made them candidates worth testing.

The reason I keep coming back to this is that “fact-checking” is actually the least interesting thing a knowledge graph enables. Verification looks backward: does this claim hold given what we know? Discovery looks forward: what does the structure of existing knowledge imply that nobody has tested yet? Swanson and Himmelstein were doing the second thing. An AI system built on structured biomedical knowledge could do both simultaneously — flagging claims that contradict established relationships while surfacing hypotheses that the graph supports but the literature hasn’t yet stated.

The infrastructure question is the hard one, and also the interesting one. Building a knowledge graph like Hetionet is, in a real sense, constructing a digital twin of the scientific record — a computable representation of what the literature actually establishes about how the world works. Ground truth in science is still reality, just harder to access than a test suite. A well-constructed knowledge graph is the closest thing we have to making it queryable. Agents can already find errors faster than humans can triage them — the bottleneck isn’t computation, it’s the structured representation of what science actually knows. That’s a much larger project than building a better Community Notes, and a much more valuable one.