Scientific datasets are riddled with copy-paste errors
Markus Englund scanned 600 datasets on Dryad and found serious copy-paste errors in 18 of them — projecting around 700 cases across the full repository of ~24,000 datasets.
“There just isn’t anybody whose job it is to actively look for it.”
Not surprising. Many researchers are working in Excel rather than reproducible pipelines, and journals don’t have the bandwidth to audit supporting data. If anything, 3% seems low?
The obvious fix is an AI toolbench for pre-publication data validation — something that catches this before it enters the literature rather than years after. The verification loop is precisely what experimental science is missing.