AI tool useful but not a replacement for human screening of the literature for systematic reviews
Abstract: https://www.acpjournals.org/doi/10.7326/M23-3389
Editorial: https://www.acpjournals.org/doi/10.7326/M24-0877
URL goes live when the embargo lifts
Researchers from the Centre for Research in Epidemiology and Statistics (CRESS) conducted an analysis to investigate the sensitivity and specificity of GPT-3.5 Turbo, as a single reviewer, for title and abstract screening in systematic reviews. The authors developed a framework to guide the model in its screening process using 5 prompts evaluating a different component of the PICOS (Population, Intervention, Comparison, Outcomes, and Study design) framework. The model screened 5 systematic reviews representing 22,665 citations. They found that the current performance of GPT-3.5 Turbo models is insufficient to fully replace manual screening in systematic reviews but it could be used to assist reviewers to help them deal with uncertainties and could be used to reduce the number of citations before title and abstract screening by humans. However, according to the authors, use of these models is currently limited by their lower specificities compared with human reviewers, the fact that performance depends on the prompts and therefore the need for prompt engineering, and the limited reproducibility of outputs.
The authors of an editorial from the University of Colorado and Annals of Internal Medicine cautions that time efficiencies gained through GPT may be offset by the time invested in prompt engineering and the reconciliation of false positives. They also note that the limited number of reviews included in this study may make it difficult to determine what characteristics may affect performance. They advise several avenues for future research on the use of GPT in research processes.
Media contacts: For an embargoed PDF, please contact Angela Collom at [email protected]. To speak with the corresponding author, Viet-Thi Tran, MD, PhD, please contact [email protected].