Last year, we discussed a 2017 article detailing the ways that machine learning and automation can potentially expedite the typically lengthy process of a systematic review. Now, a new study published in the February 2021 issue of the same journal describes recent efforts to apply a combination of machine learning and crowdsourcing to improve the item screening process in particular.
Clark and colleagues combined machine and human efforts to facilitate the screening of potentially relevant randomized controlled trials (RCTs) for a Cochrane review using a modified version of Cochrane's Screen4Me program. First, the Cochrane-built "RCT Classifier" was used to automatically sift through all items, discarding them as "Not an RCT" or marking them as potentially relevant ("Possible RCT"). Then, crowd-sourcing was used to further identify eligible RCTs from the latter group.
In addition to having all participants partake in a mandatory training module before contributing to the crowdsourced screening efforts, the model also improves accuracy by using an 'agreement algorithm" which requires, for instance, that each item receives four consecutive votes in agreement (either exclusion or inclusion) before achieving a final classification.
The authors then compared the sensitivity and specificity of this system compared to a review completed based on the same search using the gold standard of all-human, independent, and duplicate screening methods. They also calculated the crowd's autonomy, defined as the proportion of records that required a third reviewer for resolution. To increase information gleaned, the authors allowed records to be re-introduced into the system and re-screened by different screeners (a "second batch" screening).
Screeners had 100% sensitivity in both batches, meaning that all potentially relevant items were correctly identified. Specificity - the proportion of correctly discarded non-relevant items - was 80.71% in the first batch but decreased to 62.43% the second time around. Autonomy was 24.6%, meaning just under a quarter of all items required resolution during their first time through the system. When reintroduced, this number increased to 52.9%, though the authors suggest this number may have decreased if the study were continued.
The authors conclude that although the machine aspect of this method - the RCT identifier - only contributed about 9.1% of the workload, the effectiveness of human crowdsourcing to facilitate the screening process was encouraging. Notably, the 100% sensitivity rate in both batches demonstrates that crowdsourcing is unlikely to wrongfully exclude relevant items from a systematic review. Furthermore, the use of a third resolver - as opposed to automatically assigning all conflicting items to the "potential RCT" group - ultimately contributed substantially to the reduction in workload.
Noel-Storr A, Dooley G, Affengruber L, and Gartlehner G. (2021). Citation screening using crowdsourcing and machine learning produced accurate results: Evaluation of Cochrane's modified Screen4Me process. J Clin Epidemiol 130:23-31.
Manuscript available from the publisher's website here.