Tuesday, January 26, 2021

Heterogeneity Basics: When It Matters, and What to Do About It

The biggest benefit of a meta-analysis is that is allows multiple studies' findings to be pooled into a single effect estimate, raising the statistical power of the test and potentially raising our certainty in the effect estimate in turn. However, a single estimate may be misleading if there is significant heterogeneity (or inconsistency in GRADE terminology) among the individual studies. One study, for instance, may point to a potential harm of an intervention while the others in the same meta-analysis suggest a benefit; this study may vary from the others in important ways regarding its population, the performance of the intervention, or even the study design itself. A brief primer on heterogeneity newly published by Cordero and Dans details how it can be identified and managed to improve the way implications of a meta-analysis are presented and applied.

Of eyeballs and i2: detecting heterogeneity

Identifying the presence of heterogeneity among a group of pooled studies may be as simple as visually inspecting a forest plot for confidence intervals that show poor overlap or discordance in their estimate of effects (i.e., some showing a likely benefit while others showing a likely harm). 

However, some statistical analyses can also provide more nuanced and objective measures of potentially worrisome heterogeneity: 

  • the Q statistic, which tests the null hypothesis that no heterogeneity is present and provides a p-value for this likelihood (however, large p-values should not necessarily be interpreted as the absence of heterogeneity). 
  • i2 is a measure based on the Q statistic and can be interpreted generally as the amount of total variability within the sample that is due to differences between studies. The larger the i2, the greater the likelihood of "real" heterogeneity. A 95% confidence interval surrounding the estimate should be presented when using i2 to detect heterogeneity.
You've found some heterogeneity. What now?

Once heterogeneity has been detected - preferably through a combination of visual inspection and statistical analysis - explanations for these between-study differences should be sought. A comparison of the details of each study's PICO (Population, Intervention, Comparator, Outcome) elements is a great place to start. For instance, does the one outlying study have an older mean age in their population? Did they narrow their inclusion criteria to, say, only pregnant women? Perhaps they defined and operationalized their outcome in a different way than the other studies. 

If heterogeneity cannot be explained with this method, it's best to use a random-effects model for meta-analysis, because unlike the fixed-effects model, it does not assume that there is a single "true" effect of the intervention which all of the included studies are estimating. The random-effects model, on the other hand, assumes a level of variability between the studies and that each study is providing its own unique estimate within its unique setting. 

Cordero CP and Dans AL. (2021). Key concepts in clinical epidemiology: Detecting and dealing with heterogeneity in meta-analyses. J Clin Epidemiol 130:149-151.

Manuscript available here. 

Wednesday, January 20, 2021

Help for Choosing Among Multiple Interventions Using GRADE

It is not uncommon for a health guideline to compare two or more interventions against one another. However, while sophisticated statistical approaches such as network meta-analyses allow us to compare these interventions head-to-head in terms of specified health outcomes, they do not take other important aspects of clinical decision-making into account, such as patient values and preferences, resource use, and equity considerations. A new paper from Piggott and colleagues aims to provide initial suggestions for using the GRADE evidence to decision (EtD) framework when choosing which of multiple interventions to recommend.

The authors identified a need for more direction when undertaking a multiple intervention comparison (MC) approach while working on recently released guidelines for the European Commission Initiative on Breast Cancer in which multiple screening intervals were compared against one another. Based on this experience, the group drafted a flexible yet transparency-minded framework to help guide similar efforts in the future, which was then added as a module in GRADE's official guideline development software, GRADEpro

The new module was pilot-tested for feasibility with several additional guidelines. The module allows the user to select and then compare multiple pairwise comparisons against one another (for instance, with one column for "Intervention 1 vs. Comparator 1" and "Intervention 2 vs. Comparator 2"). A five-star system is used to judge various components of the EtD, such as cost effectiveness, for each individual intervention and comparator, whereas a column on the right-hand side allows the user to input the relative importance of these components in decision-making.

Finally, the user can review all judgments across interventions and summatively recommend the most favorable intervention(s) overall.

Piggott T, Brozek J, Nowak A, et al. (2021). Using GRADE evidence to decision frameworks to choose from multiple interventions. J Clin Epidemiol 130:117-124.

Manuscript available from the publisher's website here.

Thursday, January 14, 2021

Need for Speed Pt. II: Combining Automation and Crowdsourcing to Facilitate Systematic Review Screening Process

Last year, we discussed a 2017 article detailing the ways that machine learning and automation can potentially expedite the typically lengthy process of a systematic review. Now, a new study published in the February 2021 issue of the same journal describes recent efforts to apply a combination of machine learning and crowdsourcing to improve the item screening process in particular.

Clark and colleagues combined machine and human efforts to facilitate the screening of potentially relevant randomized controlled trials (RCTs) for a Cochrane review using a modified version of Cochrane's Screen4Me program. First, the Cochrane-built "RCT Classifier" was used to automatically sift through all items, discarding them as "Not an RCT" or marking them as potentially relevant ("Possible RCT"). Then, crowd-sourcing was used to further identify eligible RCTs from the latter group. 

In addition to having all participants partake in a mandatory training module before contributing to the crowdsourced screening efforts, the model also improves accuracy by using an 'agreement algorithm" which requires, for instance, that each item receives four consecutive votes in agreement (either exclusion or inclusion) before achieving a final classification.

The authors then compared the sensitivity and specificity of this system compared to a review completed based on the same search using the gold standard of all-human, independent, and duplicate screening methods. They also calculated the crowd's autonomy, defined as the proportion of records that required a third reviewer for resolution. To increase information gleaned, the authors allowed records to be re-introduced into the system and re-screened by different screeners (a "second batch" screening).

Screeners had 100% sensitivity in both batches, meaning that all potentially relevant items were correctly identified. Specificity - the proportion of correctly discarded non-relevant items - was 80.71% in the first batch but decreased to 62.43% the second time around. Autonomy was 24.6%, meaning just under a quarter of all items required resolution during their first time through the system. When reintroduced, this number increased to 52.9%, though the authors suggest this number may have decreased if the study were continued.

The authors conclude that although the machine aspect of this method - the RCT identifier - only contributed about 9.1% of the workload, the effectiveness of human crowdsourcing to facilitate the screening process was encouraging. Notably, the 100% sensitivity rate in both batches demonstrates that crowdsourcing is unlikely to wrongfully exclude relevant items from a systematic review. Furthermore, the use of a third resolver - as opposed to automatically assigning all conflicting items to the "potential RCT" group - ultimately contributed substantially to the reduction in workload.

Noel-Storr A, Dooley G, Affengruber L, and Gartlehner G. (2021). Citation screening using crowdsourcing and machine learning produced accurate results: Evaluation of Cochrane's modified Screen4Me process. J Clin Epidemiol 130:23-31.

Manuscript available from the publisher's website here.  

Friday, January 8, 2021

New Guideline Participation Tool Lays Out Roles and Responsibilities for New and Returning Guideline Group Members

Guideline development groups should contain a multidisciplinary panel of experts and key stakeholders to ensure the quality, relevance, and ultimate implementation of resulting recommendations. However, there are few tools in existence to ensure the effective participation of panel members when working to draft guidelines, and preparing panel members with little to no previous experience in guideline development can be an especially daunting task. A new paper published in next month's issue of the Journal of Clinical Epidemiology aims to provide a tool to guide these efforts, with a specific focus on guideline developed using the GRADE framework.

To develop the tool, Piggott and colleagues first established a draft tool that included 61 items based on a previously published systematic review of guideline development handbooks. They then conducted a series of ten key informant interviews comprising both past and prospective guideline development group members to narrow the tool down to three major themes: selection of participants, guideline group process, and tool format. The resulting 33-item Guideline Participant Tool (GPT) was then validated in a survey of 26 guideline group members from various societies including WHO and the American Society of Hematology (ASH). The tool itself breaks the process of guideline participation into three major time windows: 

  • Before (Preparations): 12 items including clarifying objectives and one's role within the group and familiarizing oneself with the guideline development methodology to be used.
  • During (Meetings): 15 items including avoiding undue interruptions, adhering to the specified methodology, and referring to the PICO question at hand as a way to stay on task.
  • After (Follow-up): 6 items including maintaining proper confidentiality of information discussed, reviewing meeting minutes to identify any discrepancies in a timely fashion, and assisting with the promotion, dissemination, and evaluation of the guideline as requested.

According to the authors, "Most participants found that the tool is most useful before guideline group meetings explaining what to expect at each phase. Participants thought that the tool was useful beforehand as a reference for orienting themselves to the structure of meetings, understanding the guideline development process, and what might be required of them. Respondents agreed that the tool serves as a reference for them to stay on track with the required tasks and to support structuring the process of guideline development."

The authors go on to suggest that the tool be used as required reading for all group members ahead of their participation on a panel. 

Piggott T, Baldeh T, Akl EA, et al. 2021. Supporting effective participation in health guideline development groups: The Guideline Participant Tool. J Clin Epidemiol 130:42-48.

Manuscript available from the publisher's website here.