Thursday, January 14, 2021

Need for Speed Pt. II: Combining Automation and Crowdsourcing to Facilitate Systematic Review Screening Process

Last year, we discussed a 2017 article detailing the ways that machine learning and automation can potentially expedite the typically lengthy process of a systematic review. Now, a new study published in the February 2021 issue of the same journal describes recent efforts to apply a combination of machine learning and crowdsourcing to improve the item screening process in particular.

Clark and colleagues combined machine and human efforts to facilitate the screening of potentially relevant randomized controlled trials (RCTs) for a Cochrane review using a modified version of Cochrane's Screen4Me program. First, the Cochrane-built "RCT Classifier" was used to automatically sift through all items, discarding them as "Not an RCT" or marking them as potentially relevant ("Possible RCT"). Then, crowd-sourcing was used to further identify eligible RCTs from the latter group. 

In addition to having all participants partake in a mandatory training module before contributing to the crowdsourced screening efforts, the model also improves accuracy by using an 'agreement algorithm" which requires, for instance, that each item receives four consecutive votes in agreement (either exclusion or inclusion) before achieving a final classification.

The authors then compared the sensitivity and specificity of this system compared to a review completed based on the same search using the gold standard of all-human, independent, and duplicate screening methods. They also calculated the crowd's autonomy, defined as the proportion of records that required a third reviewer for resolution. To increase information gleaned, the authors allowed records to be re-introduced into the system and re-screened by different screeners (a "second batch" screening).

Screeners had 100% sensitivity in both batches, meaning that all potentially relevant items were correctly identified. Specificity - the proportion of correctly discarded non-relevant items - was 80.71% in the first batch but decreased to 62.43% the second time around. Autonomy was 24.6%, meaning just under a quarter of all items required resolution during their first time through the system. When reintroduced, this number increased to 52.9%, though the authors suggest this number may have decreased if the study were continued.

The authors conclude that although the machine aspect of this method - the RCT identifier - only contributed about 9.1% of the workload, the effectiveness of human crowdsourcing to facilitate the screening process was encouraging. Notably, the 100% sensitivity rate in both batches demonstrates that crowdsourcing is unlikely to wrongfully exclude relevant items from a systematic review. Furthermore, the use of a third resolver - as opposed to automatically assigning all conflicting items to the "potential RCT" group - ultimately contributed substantially to the reduction in workload.

Noel-Storr A, Dooley G, Affengruber L, and Gartlehner G. (2021). Citation screening using crowdsourcing and machine learning produced accurate results: Evaluation of Cochrane's modified Screen4Me process. J Clin Epidemiol 130:23-31.

Manuscript available from the publisher's website here. 

Friday, January 8, 2021

New Guideline Participation Tool Lays Out Roles and Responsibilities for New and Returning Guideline Group Members

Guideline development groups should contain a multidisciplinary panel of experts and key stakeholders to ensure the quality, relevance, and ultimate implementation of resulting recommendations. However, there are few tools in existence to ensure the effective participation of panel members when working to draft guidelines, and preparing panel members with little to no previous experience in guideline development can be an especially daunting task. A new paper published in next month's issue of the Journal of Clinical Epidemiology aims to provide a tool to guide these efforts, with a specific focus on guideline developed using the GRADE framework.

To develop the tool, Piggott and colleagues first established a draft tool that included 61 items based on a previously published systematic review of guideline development handbooks. They then conducted a series of ten key informant interviews comprising both past and prospective guideline development group members to narrow the tool down to three major themes: selection of participants, guideline group process, and tool format. The resulting 33-item Guideline Participant Tool (GPT) was then validated in a survey of 26 guideline group members from various societies including WHO and the American Society of Hematology (ASH). The tool itself breaks the process of guideline participation into three major time windows: 

  • Before (Preparations): 12 items including clarifying objectives and one's role within the group and familiarizing oneself with the guideline development methodology to be used.
  • During (Meetings): 15 items including avoiding undue interruptions, adhering to the specified methodology, and referring to the PICO question at hand as a way to stay on task.
  • After (Follow-up): 6 items including maintaining proper confidentiality of information discussed, reviewing meeting minutes to identify any discrepancies in a timely fashion, and assisting with the promotion, dissemination, and evaluation of the guideline as requested.

According to the authors, "Most participants found that the tool is most useful before guideline group meetings explaining what to expect at each phase. Participants thought that the tool was useful beforehand as a reference for orienting themselves to the structure of meetings, understanding the guideline development process, and what might be required of them. Respondents agreed that the tool serves as a reference for them to stay on track with the required tasks and to support structuring the process of guideline development."

The authors go on to suggest that the tool be used as required reading for all group members ahead of their participation on a panel. 

Piggott T, Baldeh T, Akl EA, et al. 2021. Supporting effective participation in health guideline development groups: The Guideline Participant Tool. J Clin Epidemiol 130:42-48.

Manuscript available from the publisher's website here. 

Friday, December 18, 2020

Reviews of screening interventions often fail to include relevant harms

With any given intervention comes a set of both potential desirable as well as undesirable effects, and proper consideration should be given to both in the context of clinical decision-making. However, our knowledge about potential undesirable effects (or "harms") of an intervention depends on the availability of the evidence, just as it does with the potential benefits. A recent systematic review of reviews for screening interventions suggests that the way evidence for harms is synthesized may not follow the same rigor and depth as for an intervention's potential desirable effects, limiting our ability to throughly weigh the two against one another when making clinical decisions to inform screening behaviors.

In the January 2021 issue of Journal of Clinical Epidemiology, Johanssen and colleagues systematically searched and screened 47 Cochrane reviews, making note of those that reported including potential harms as outcomes within the search strategy, even if no available evidence was ultimately found. Overdiagnosis was only included in 15% of the 39 reviews in which the Johanssen and colleagues deemed it a potentially relevant outcome; overtreatment was mentioned in 16% of eligible reviews. The inclusion of secondary harm outcomes in potentially eligible reviews ranged from 7% (incidental findings) to 91% (all-cause mortality). While psychosocial consequences was discussed as a potential outcome in a majority (64%) of eligible reviews, the data for this outcome were often not synthesized. 

Overall, reviews were less likely to meta-analyze or assess the risk of bias for evidence around harms than for benefits. 
Two-thirds (67%) of summary of findings tables, however, did not include any harms as outcomes; further, 42% of abstracts and 58% of plain language summaries did not mention any harms. 

The authors conclude that these findings demonstrate a need for a "broad collaboration" to develop reporting guidelines and core outcome sets that will ensure the more thorough and rigorous reporting of harms outcomes in screening studies. Through a consensus process involving a diverse set of stakeholders including clinicians, methodologists, policymakers, and medical ethicists, improved standards can be set for the reporting of all outcomes of screening interventions that are of potential relevance to patients.

Johansson M, Borys F, Peterson H, et al. 2021. Addressing the harms of screening - A review of outcomes in Cochrane reviews and suggestions for next steps. J Clin Epidemiol 129:68-73.

Manuscript available from the publisher's website here. 

Monday, December 14, 2020

Evidence Foundation Scholar Update

Dr. Christian Kershaw, a health policy analyst with CGS Administrators, LLC, attended the fall 2019 GRADE Guideline Development Workshop free of charge as an Evidence Foundation scholar. As part of the scholarship, Dr. Kershaw submitted and then presented on a proposal for reducing bias in healthcare, focusing on how to build and lead cross-functional teams (blog post here). We followed up with Dr. Kershaw at one year post-workshop to see what's happened since her attendance.

"In my work as a health policy analyst for a Medicare Fee-for-Service contractor, I use my research background to help with the evaluation of scientific literature on products being considered for coverage by Medicare. My team was created because of a call for transparency in how Medicare makes coverage decisions. I received the Evidence Foundation scholarship to the Fall 2019 GRADE conference and attended along with my team members to learn GRADE methodology. Our goal was to determine if implementation of GRADE methodology would standardize our literature evaluation process for coverage decisions," said Dr. Kershaw.

"At the conference I presented on the benefits of establishing cross-functional teams. By joining forces with team members with a heterogeneous set of skills and backgrounds, we can leverage individual strengths and encourage innovation to reach a common goal. In my work, I collaborate with MDs, RNs, and policy experts with a well-rounded knowledge base of clinical standards, Medicare processes, and coverage policies. After learning GRADE methodology, we implemented the use of GRADE to improve the transparency and standardization of our process for writing coverage policies. our team has now completed two GRADE workshops, and we are constantly working to improve our use of this methodology. We have found that the use of GRADE helps our cross-functional team improve our ability to systematically make coverage determinations based on scientific evidence."

Stay tuned for future updates from other past Evidence Foundation scholars like Dr. Kershaw and the exciting work they are doing to improve the application of GRADE methodology and evidence-based medicine.

If you are interested in learning more about GRADE and attending the workshop as a scholarship recipient, applications for our upcoming workshop next May are now open. The deadline to apply is February 28, 2021. Details can be found here. 

Tuesday, December 8, 2020

No Single Definition of a Rapid Review Exists, but Several Common Themes Emerge

"The only consensus around [a rapid review] definition," write Hamel and colleagues in a review published in the January 2021 issue of the Journal of Clinical Epidemiology, "is that a formal definition does not exist."

In their new review, Hamel et al. sifted through 216 rapid reviews and 90 methodological articles published between 2017 and 2019 to better understand the existing definitions and use of the term "rapid review," identifying eight common themes among them all.

The figure below from the publication shows the relative usage of these themes throughout the relevant identified articles.

In summary of all definitions examined in the review, the authors suggest the following broad definition of a rapid review: "a form of knowledge synthesis that accelerates the process of conducting a traditional systematic review through streamlining or omitting a variety of methods to produce evidence in a resource-efficient manner."

To complicate matters further, Hamel and colleagues also found that reviews meeting these general criteria may not always go by the term "rapid." For instance, the term "restricted review" fits many of these same parameters, but is not necessarily defined by the amount of time from inception to publication. However, the lack of an as-yet agreed-upon definition of a "rapid review" may ultimately hamper authors and potential end-users of these products, as the accepted legitimacy of such reviews may depend upon a common understanding of their standards and methodological frameworks. In addition, the range of rigor and specific protocols continues to vary widely between products labeled as "rapid reviews." Until there is a broader consensus of the definition of a rapid review and what, exactly, it entails, this working definition and associated themes provide insight into the current state of the art.

Check out our related post on the two-week systematic review here.

Hamel C, Michaud A, Thuku M, Skidmore B, Stevens A, Nussbaumer-Streit B, and Garritty C. (2020). Defining rapid reviews: A systematic scoping review and thematic analysis of definitions and defining characteristics of rapid reviews. J Clin Epidemiol 129: 74-85.

Manuscript available from the publisher's website here

Thursday, December 3, 2020

Assessing the Reliability of Recently Developed Risk of Bias Tools for Non-Randomized Studies

Risk of bias is one of the five domains to be considered when assessing the certainty of evidence across a body of studies, and is the only domain which must first be assessed on the individual study level. While several risk of bias assessment tools exist for non-randomized studies (NRS; or observational trials), two of the most recently introduced are the Risk of Bias in Non-Randomized Studies of Interventions (ROBINS-I, developed in 2016) and the Risk of Bias instrument for NRS of Exposures (ROB-NRSE, developed in 2018). Assessment of the risk of bias in a systematic review off of which a guideline is based should ideally be conducted independnelty by at least two reviewers. Given this scenario, how likely is it that the two reviewers' assessments will agree sufficiently with one another?

In a recently published paper by Jeyaraman and colleagues, a multi-center group of collaborators assessed both the inter-rater reliability (IRR) and interconsensus reliability (ICR) of these tools based on a previously published cross-sectional study protocol. The seven reviewers had a median of 5 years of experience assessing risk of bias, and two pairs of reviewers assessed risk of bias using each tool. IRR was used to assess reliability within pairs, while ICR assessed reliability between the pairs. The time burden was also assessed by recording the amount of time required to assess each included study and to come to a consensus. For the overall assessment of bias, IRR was rated as "Poor" (Gwet's agreement coefficient of 0%) for the ROBINS-I tool and "slight" (11%) for the ROB-NRSE tool, whereas the ICR was rated as "poor" for both ROBIN-I (7%) and ROB-NRSE (0%). The average evaluator time burden was over 48 minutes for the ROBINS-I tool and almost 37 minutes for the ROB-NRSE.

Click to enlarge.

Click to enlarge.

The authors note that overall, ROBINS-I tended to have a better IRR as well as ICR, both of which may be due in part to poorer reporting quality in exposure studies. In addition, simplification of related guidance documents for applying the tool and increased training for reviewers looking to use the ROBINS-I and ROB-NRSE tools to assess risk of bias in non-randomized studies may improve agreement considerably while cutting down on the time required to apply the tool correctly to each individual study.

Jeyaraman MM, Rabbani R, Copstein L, Robson RC, Al-Yousif N, Pollock M, ... & Abou-Setta AM. (2020). Methodologically rigorous risk of bias tools for nonrandomized studies had low reliability and high evaluator burden. J Clin Epidemiol 128:140-147.

Manuscript available from the publisher's web site here. 

Wednesday, November 25, 2020

Diagnostic Test Accuracy Meta-Analyses Are Often Missing Information Required for Reproducibility

Reproducibility of results is considered a key tenet of the scientific process. When results of a study are reproduced by others using the same protocol, there is less chance that the original results observed were due human or random error. Testing the reproducibility of evidence syntheses (e.g., meta-analyses) is just as important as for individual trials.

In a paper published earlier this month, Stegeman and colleagues undertook the task of testing the reproducibility of meta-analyses of diagnostic test accuracy. The authors identified 51 eligible meta-analyses published in January 2018. In 19 of these, sufficient information was provided in the text of the study to reproduce the 2x2 tables of the individual studies included; in the remaining 32, only estimates were provided in the text. In 17 of these 32, the authors located primary data to attempt reproducibility. When attempting to reproduce the meta-analyses of the 51 identified papers, reproducibility was only achieved 28% of the time; none of the 17 papers for which 2x2 tables were not provided were reproducible.

Click to enlarge.

Only 14 (27%) of the 51 articles provided full search terms. In nearly half (25) of the included reviews, at least one of the full texts of included references could not be located; in 12, at least one title or abstract could not be located. Overall, of the 51 included reviews, only one was deemed fully reproducible by providing a full protocol, 2x2 tables, and the same summary estimates as the authors.

The authors conclude with a call for increased prospective registration of protocols and improved reporting of search terms and methods. The application of the 2017 PRISMA statement for diagnostic test accuracy is a helpful tool for any aspiring author of a diagnostic test accuracy meta-analysis to improve the reporting and reproducibility of results.

Stegeman I. and Leeflang M.M.G. (2020). Meta-analyses of diagnostic test accuracy could not be reproduced. J Clin Epidemiol 127:161-166.

Manuscript available at the publisher's website here