This looks very impressive, using LLMs to not only survey the literature but also synthesize the results and generate new statistically significant findings.
I’m not sure if the April 2024 Cochrane reviews used for validation are included in GPT-4.1’s training data, so the evaluation might need a second look, but overall this could significantly accelerate evidence synthesis and make genuine contributions to the literature.
Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information. We developed otto-SR, an end-to-end agentic workflow using large language models (LLMs) to support and automate the SR workflow from initial search to analysis. We found that otto-SR outperformed traditional dual human workflows in SR screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). Using otto-SR, we reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work.