U.S. GRADE Network blog: Risk of bias

Showing posts with label Risk of bias. Show all posts

Monday, September 13, 2021

Re-analysis of a systematic review on injury prevention demonstrates that methods do really matter

How much of a difference can methodological decisions make? Quite a bit, argues a new paper published in the Journal of Clinical Epidemiology. A re-analysis of a 2018 meta-analysis on the role of the Nordic hamstring curl (NHE) on injury prevention, the study outlined and then executed several methodological changes within the context of an updated search and found that the resulting magnitude of effect - and strength of recommendations using GRADE - were not quite as dazzling as the original analysis.

Impellizzeri and colleagues noted several suggested changes to the 2018 paper, including:

limiting the meta-analysis to higher-level evidence (randomized controlled trials) when available,
clarifying the interventions used in the included studies and being cognizant of the effect of co-interventions (for instance, when NHE was used alone versus in combination with other exercises as part of an injury reduction program),
being careful not to "double-dip" on events (i.e., injuries) that recur in the same individual when presenting the data as a risk ratio
discussing the impact of between-study heterogeneity when discussing the certainty of resulting estimates,
presenting the lower- and upper-bounds of 95% confidence intervals for estimates of effect in addition to the point estimates, and
taking the limitations of the literature and other important considerations into account when formulating final summaries or recommendations (for instance, using the GRADE framework)

The authors ran an updated systematic search but excluded non-randomized controlled trials or studies that incorporated other exercises with the NHE in the intervention group. Risk of bias was assessed using the Cochrane tool for randomized studies. The overall certainty of evidence as assessed using GRADE was rated "low," although given that concerns regarding risk of bias, inconsistency, and imprecision were noted, the certainty may range to "very low" following the standard GRADE framework. The forest plot of the updated analysis can be seen below.

The results of the updated analysis show that rather than reduce the risk of hamstring injury by 50%, the range of possible effects was too large to draw a conclusion on the effectiveness of this intervention, and only a conditional recommendation can be warranted.

Impellizzeri, F.M., McCall, A., and van Smeden, M. (2021). Why methods matter in a meta-analysis: A reappraisal showed inconclusive injury preventive effect of Nordic hamstring exercise. J Clin Epidemiol, in-press.

The manuscript is available at the publisher's site here.

Monday, August 30, 2021

Misuse of ROBINS-I Tool May Underestimate Risk of Bias in Non-Randomized Studies

Although it is currently the only tool recommended by the Cochrane Handbook for assessing risk of bias in non-randomized studies of interventions, the Risk Of Bias In Non-randomized Studies of Interventions (ROBINS-I) tool can be complex and difficult to use effectively for reviewers lacking specific training or expertise in its application. Previous posts have summarized research examining the reliability of ROBINS-I, suggesting that it can improve with training of reviewers. Now, a study from Igelström and colleagues finds that the tool is commonly modified or used incorrectly, potentially affecting the certainty of evidence or strength of recommendations resulting from synthesis of these studies.

The authors reviewed 124 systematic reviews published across two months in 2020, using A MeaSurement Tool to Assess systematic Reviews (AMSTAR) to operationalize the overall quality of the reviews. The authors extracted data related to the use of ROBINS-I to assess risk of bias across studies and/or outcomes as well as the number of studies included, whether meta-analysis was performed, and whether any funding sources were declared. They then assessed whether the application of ROBIN-I was predicted by the review's overall methodological quality (as measured by AMSTAR), the performance of risk of bias assessment in duplicate, the presence of industry funding, or the inclusion of randomized controlled trials in the review.

Overall methodological quality across the reviews was generally low to very low, with only 17% scoring as moderate quality and 6% scoring as high quality. Only six (5%) of the reviews reported explicit justifications for risk of bias judgments both across and within domains. Modification of ROBINS-I was common, with 20% of reviews modifying the rating scale, and six either not reporting across all seven domains or adding an eight domain. In 19% of reviews, studies rated as having a "critical" risk of bias were included in the narrative or quantitative synthesis, against guidance for the use of the tool.

Reviews that were of higher quality as assessed by AMSTAR tended to contain fewer "low" or "moderate" risk of bias ratings and more judgments of "critical" risk of bias. Thus, the authors argue, incorrect or modified use of ROBINS-I may risk underestimating the potential risk of bias among included studies, potentially affecting the resulting conclusions or recommendations. Associations between the use of ROBINS-I and the other potential predictors, however, were less conclusive.

Igelström, E., Campbell, M., Craig, P., and Katikireddi, S.V. (2021). Cochrane's risk-of-bias tool for non-randomized studies (ROBINS-I) is frequently misapplied: A methodological systematic review. J Clin Epidmiol, in-press.

Manuscript available from publisher's website here.

Friday, May 14, 2021

Reliability of Risk of Bias Assessments of Non-randomized Studies Improves After Customized Training

We previously reported on a paper published in 2020 assessing the inter-rater reliability (IRR) and inter-consensus reliability (ICR) of the Risk of Bias in Non-Randomized Studies of Interventions (ROBINS-I) tool, developed in 2016, and the Risk of Bias instrument for NRS of Exposures (ROB-NRSE) tool, developed in 2018. This paper found that reliability generally tended to be poor for these tools, while risk of bias assessments took evaluators, on average, 48 minutes for the ROBINS-I tool and almost 37 minutes for the ROB-NRSE.

Now, a new publication from the same group has examined the effect of training on the reliability of these tools. An international team of reviewers with a median of 5 years of experience with risk of bias assessment first applied the ROBINS-I and ROB-NRSE tools to a list of 44 non-randomized studies of interventions and exposures, respectively, using only the 53 pages of publicly available guidance. Then, the reviewers received an abridged and customized training document which was tailored specifically to the topic area of the reviews, included simplified guidance for assessing risk of bias, and also provided additional guidance related to more advanced concepts. The reviewers then re-assessed the studies' risk of bias after a several-weeks-long wash-out period.

Changes in the inter-rater reliability (IRR) for the ROBINS-I (top) and ROB-NRSE tools (bottom) from before and after a customized training intervention.

The training intervention improved the IRR of the ROBINS-I tool, generally improving the range of within-domain reliability while the reliability of the overall bias rating improved from "poor" to "fair." Meanwhile, the ICR improved substantially, with the overall rating's reliability improving from "poor" to "near perfect." Improvements were also observed after training in the application of the ROB-NRSE tool, with IRR of the overall bias improving significantly from "slight" to "near perfect" while its ICR improved from "poor" to "near perfect." For both tools, the pre-to-post-intervention correlations between reviewers' scores were poor, suggesting that the training did have an impact on these measures independent of a simple learning effect. While customized training was associated with a decrease in evaluator burden for the ROBINS-I tool, this did not hold true for the ROB-NRSE.

The findings of this analysis suggest that the use of a customized, shortened guidance tool specifically tailored to the topical content of a review, including simplified guidance for decision-making within each domain, can improve the reliability of resulting risk of bias assessments. The authors suggest that future reviewers create such guidance based on the specific needs and considerations of their topic area, and publish these tools along with the review.

Jeyaraman MM, Robson RC, Copstein L et al. (2021). Customized guidance/training improved the psychometric properties of methodologically rigorous risk of bias instruments for non-randomized studies. J Clin Epidemiol, in-press.

Manuscript available here.

Monday, March 15, 2021

A Blinding Success?: The Debate over Reporting the Success of Blinding

While the use of blinding is a hallmark of placebo-controlled trials, whether the blinding was successful - i.e., whether or not participants were able to figure out the treatment condition to which they have been assigned - isn't always tested, nor are the results of these tests always reported. The measurement of the success of blinding in trials is controversial and not uniformly used, and the item has been dropped from subsequent versions of the CONSORT reporting items for trials. According to a recent discussion of the pros and cons to measuring the success of blinding, only between 2-24% of trials perform or report these types of tests.

As Webster and colleagues explain, the benefits to measuring the success of blinding are as follows:

the success (or failure) of blinding in a placebo-controlled trial can introduce a source of bias that affects the results.
while the effect of blinding itself may be small, these small effects could still result in changes to policy or practice
there are documented instances in which the failure to properly blind (for instance, providing participants with a sour-tasting Vitamin C condition versus a sweet lactose "placebo") led to an observed effect (for instance, on preventing or treating the common cold) whereas there was no effect in the subgroup of participants who were successfully blinded.

Reasons commonly given against the testing of successful blinding include the following:

At times, a break in blinding can lead to conclusions in the opposite direction. For instance, physicians who are unblinded may assume that the patients with better outcomes received a drug widely supposed to be "superior," when in fact, the opposite occurred.
In some cases, a treatment with dramatically superior results can result in unblinding, even when the treatment conditions were identical - but that doesn't necessarily mean the blinding was a failure or could have been prevented, given the dramatic differences in outcomes.
If the measurement of blinding is performed at the wrong time - such as before the completion of the trial - participants may become suspicious and this in itself could potentially confound treatment effects.

Webster RK, Bishop F, Collins GS, et al. (2021). Measuring the success of blinding in placebo-controlled trials: Should we be so quick to dismiss it? J Clin Epidemiol, pre-print.

Manuscript available from publisher's website here.

Thursday, February 25, 2021

The Use of GRADE in Systematic Reviews of Nutrition Interventions is Still Rare, but Growing

While the GRADE framework is used by over 100 health organizations to assess the certainty of evidence and guide the formulation of clinical recommendations, its use in the field of nutrition for these purposes is still sparse. A recent examination of all systematic reviews using GRADE in the ten highest-impact nutrition journals over the past five years provides insight and suggestions for moving the field forward in the use of GRADE for evidence assessment in systematic reviews of nutritional interventions.

Werner and colleagues identified 800 eligible systematic reviews, 55 (6.9%) of which used GRADE, and 47 (5.9%) of which rated the certainty of evidence specific to different outcomes. The number of these reviews using GRADE increased year-to-year, from two in 2015 to 23 in 2019. Reviews claiming to use a modification of GRADE were excluded from analysis.

Of the 811 identified cases of downgrading the certainty of evidence, and 31 cases of upgrading. Reviews of randomized controlled trials had a mean number of 1.6 domains downgraded per outcome, while reviews of non-randomized studies had a mean of 2.1. In about 6.5% of upgrading cases, this was done for unclear purposes not in line with GRADE guidance, such as upgrading for low risk of bias, narrow confidence intervals, or very low p-values. Reviews of non-randomized studies were more likely to have outcomes downgraded for imprecision and inconsistency, and less likely to have downgrades for publication bias than those of randomized studies.

The authors conclude that while the use of GRADE in systematic reviews of nutritional interventions has grown over recent years based on this sample, continued education and training of nutrition researchers and experts can help improve the spread and quality of the application of GRADE to assess the certainty of evidence in this discipline.

Werner SS, Binder N, Toews I, et al. (2021). The use of GRADE in evidence syntheses published in high-impact-factor nutrition journal: A methodological survey. J Clin Epidemiol, in-press.

Manuscript available here.

Thursday, December 3, 2020

Assessing the Reliability of Recently Developed Risk of Bias Tools for Non-Randomized Studies

Risk of bias is one of the five domains to be considered when assessing the certainty of evidence across a body of studies, and is the only domain which must first be assessed on the individual study level. While several risk of bias assessment tools exist for non-randomized studies (NRS; or observational trials), two of the most recently introduced are the Risk of Bias in Non-Randomized Studies of Interventions (ROBINS-I, developed in 2016) and the Risk of Bias instrument for NRS of Exposures (ROB-NRSE, developed in 2018). Assessment of the risk of bias in a systematic review off of which a guideline is based should ideally be conducted independnelty by at least two reviewers. Given this scenario, how likely is it that the two reviewers' assessments will agree sufficiently with one another?

In a recently published paper by Jeyaraman and colleagues, a multi-center group of collaborators assessed both the inter-rater reliability (IRR) and interconsensus reliability (ICR) of these tools based on a previously published cross-sectional study protocol. The seven reviewers had a median of 5 years of experience assessing risk of bias, and two pairs of reviewers assessed risk of bias using each tool. IRR was used to assess reliability within pairs, while ICR assessed reliability between the pairs. The time burden was also assessed by recording the amount of time required to assess each included study and to come to a consensus. For the overall assessment of bias, IRR was rated as "Poor" (Gwet's agreement coefficient of 0%) for the ROBINS-I tool and "slight" (11%) for the ROB-NRSE tool, whereas the ICR was rated as "poor" for both ROBIN-I (7%) and ROB-NRSE (0%). The average evaluator time burden was over 48 minutes for the ROBINS-I tool and almost 37 minutes for the ROB-NRSE.

Click to enlarge.

The authors note that overall, ROBINS-I tended to have a better IRR as well as ICR, both of which may be due in part to poorer reporting quality in exposure studies. In addition, simplification of related guidance documents for applying the tool and increased training for reviewers looking to use the ROBINS-I and ROB-NRSE tools to assess risk of bias in non-randomized studies may improve agreement considerably while cutting down on the time required to apply the tool correctly to each individual study.

Jeyaraman MM, Rabbani R, Copstein L, Robson RC, Al-Yousif N, Pollock M, ... & Abou-Setta AM. (2020). Methodologically rigorous risk of bias tools for nonrandomized studies had low reliability and high evaluator burden. J Clin Epidemiol 128:140-147.

Manuscript available from the publisher's web site here.

Friday, November 20, 2020

Practical Tips for Finding and Assessing Patient Survey Data

An essential part of translating a body of evidence into a clinical recommendation within the GRADE framework is the consideration of patients' values and preferences. Not only should the likely treatment preferences and values placed on outcomes among the patient population be considered; if there is likely a great amount of variability within these, this may also influence the ultimate strength of recommendation.

Guideline panels and public health decision-makers may use self-reported patient survey data to better understand the range of patient values and preferences when formulating recommendations or policies. However, like all sources of evidence, patient surveys may be at risk for specific sources of bias which can ultimately affect the results. What should decision-makers look out for when applying patient survey data to a recommendation for care? In a recently published paper, Santesso and colleagues propose a practical guide for finding, interpreting, and applying patient data to better inform healthcare decision-making.

Click to enlarge.

Because 97% of published surveys have been found to use the words "survey" or "questionnaire" in the title, the authors suggest using these terms in title, abstract, and topic fields when conducting a search for relevant data. When assessing the risk of bias of a given survey, decision-makers should ask whether the population was adequately representative of the patient population in question, taking care to consider the use of random sampling and the potential impact of nonresponse. A survey should also be assessed for whether it measures the intended constructs adequately. Survey authors should report the variability around reported measures whenever possible, and these data can be used to judge the overall variability in patient values and preferences. Finally, decision-makers should take care to discern how directly the survey data applies to the patient population in question; the table of survey respondent characteristics is a useful place from which to draw judgments of directness.

Using these helpful and practical points of guidance, guideline panel members and clinical decision-makers can better inform their retrieval, critical appraisal, and application of patient survey data to important healthcare questions, ultimately resulting in more informed guidelines and policies.

Santesso N, Akl E, Bhandari M, Busse JW, Cook DJ, Greenhalgh T, Muti P, Schünemann H, and Guyatt G. (2020). A practical guide for using a survey about attitudes and behaviors to inform health care decision making. J Clin Epidemiol 128:93-100.

Manuscript available from the publisher's website here.

Friday, July 10, 2020

Room for Improvement: Use of Cochrane RoB tool in non-Cochrane Systematic Reviews is Largely Incomplete

The Cochrane Risk of Bias (RoB) tool for randomized controlled trials (RCTs) is commonly used in both Cochrane and non-Cochrane systematic reviews as a standardized way to assess and report the risk of bias within a study or a body of evidence. The tool comprises seven domains, each representing a potential source of bias within the design or execution of an RCT. Judgments for each domain (for instance, allocation concealment, or selective outcome reporting) are made between whether the study possessed a low, high, or unclear risk of bias from that source.

A new review of non-Cochrane systematic reviews (NCSRs) published in this month’s edition of the Journal of Clinical Epidemiology reports that the use of the Cochrane RoB tool in these reviews is incomplete or inadequate in most cases. Within 508 eligible systematic reviews that used the original (2011) Cochrane RoB tool published through 3 July 2018, the majority (85%) reported the analysis of risk of bias; within these papers, about half (53%) used the Cochrane tool specifically, leaving a total of 269 reviews for further analysis.

A non-negligible minority of studies included in the review by Puljak et al. either did not include certain domains of the Cochrane RoB tool, or did not report which domains were used. Only 40% of the reviews analyzed RoB through all seven domains. Click to enlarge.

Less than two-thirds (60%) of the 269 included reviews used all seven domains of the Cochrane tool, report Puljak and colleagues, and only 16 of the included reviews (5.9%) reported both a judgment and a comment explaining each judgment either within the manuscript or in a supplementary file. Within these 15 reviews, the proportion of inadequate judgments (either those in which the comment was not in line with the judgment or in which there was no supporting comment) ranged from 25% (Other Bias domain) to 65% (Selective Reporting Bias domain). The reviews “rarely” included full tables illustrating the RoB judgments for the different domains.

The authors’ findings highlight that both a judgment (low/high/unclear risk of bias) as well as a comment explaining the judgment within each domain should be included in systematic reviews that report use of the Cochrane RoB tool.

Puljak, L., Ramic, I., Naharro, C.A., Brezova, J., Lin, Y.C., Surdila, A.A,.... & Salvado, M.S. Cochrane risk of bias tool was used inadequately in the majority of non-Cochrane systematic reviews. J Clin Epidemiol, 2020; 123: 114-119.

Manuscript available from publisher's website here.

Tuesday, June 2, 2020

Research Shorts: Use of GRADE for the Assessment of Evidence about Prognostic Factors

In addition to questions of interventions and diagnostic tests, GRADE can also be used to assess the certainty of evidence when it comes to prognostic factors. In part 28 of the Journal of Clinical Epidemiology’s GRADE series published earlier this year, Foroutan and colleagues provide guidance for applying GRADE to a body of evidence of prognostic factors.

The Purpose of Prognostic Studies

GRADE may be applied to a body of evidence, separated by individual prognostic factors instead of outcomes, for one of two reasons. The first is a non-contextualized setting, such as when the certainty of evidence surrounding prognostic factors is being evaluated for application within research planning and analysis (e.g., determining which factors are best to use when stratifying for randomization). The second is a contextualized setting, when the certainty of evidence surrounding prognostic factors is used to help inform clinical decisions.

Establishing the Certainty of Evidence

Unlike when grading the certainty of evidence of an intervention, when assessing prognostic evidence, the overall certainty for observational studies starts out as HIGH. This is because the patient population is likely to be more representative studies than in RCTs, when eligibility criteria may place artificial restrictions on the characteristics of patients. Certainty may then be rated down based on the five traditional domains:

Risk of bias tools and instruments such as QUality In Prognosis Studies (QUIPS) and Prediction model Risk Of Bias ASsessment Tool (PROBAST) may be helpful here. When teasing out the effect of each potential factor, consider utilizing some form of multivariate analysis that accounts for dependence between several different prognostic factors.
Inconsistency can be examined via visual tests of the variability between individual point estimates and the overlap of confidence intervals; statistical tests such as i2 are likely to be less helpful, as they can often be inflated when large studies lead to particularly narrow Cis. As always, potential explanations for any observed heterogeneity should be considered a priori.
Imprecision will depend on whether the setting is contextualized, in which case it will depend on the relationship between the confidence interval and the previously set clinical decision threshold, or non-contextualized, in which case the threshold will most likely represent the line of no effect.
Indirectness should be based on a comparison of the PICOs for the clinical question at hand, and those addressed in the meta-analyzed studies.
Publication bias can be assessed via visually exploring a funnel plot or the use of appropriately applied statistical tests.

Foroutan F, Guyatt G, Zuk V, Vandvik PO, Alba AC, Mustafa R, Vernooij R et al. GRADE guidelines 28: Use of GRADE for the assessment of evidence about prognostic factors: Rating certainty in identification of groups of patients with different absolute risks. J Clin Epidemiol 121; 62-70.

Manuscript available from the publisher's website here.

Tuesday, May 19, 2020

Research Shorts: Assessing the Certainty of Diagnostic Evidence, Pt. I: Risk of Bias and Indirectness

Systematic reviews or health technology assessments (HTAs) that examine the body of evidence on diagnostic procedures can - and should - transparently assess and report the overall certainty of evidence as part of their findings. In the two-part, 21^st installment of the GRADE guidance series published in the Journal of Clinical Epidemiology, Schünemann and colleagues provide methods for approaching the first two major domains of the GRADE approach: risk of bias and indirectness.

While there are certainly differences between methods for assessing the certainty of evidence of diagnostic tests as opposed to interventions, the fundamental parts of GRADE remain unchanged:

Make Clinical Questions Clear via PICOs

It is paramount to clearly define the purpose or role of a diagnostic test and to see the test in light of its potential downstream consequences for making subsequent treatment decisions. As with a review of an intervention, a review of a diagnostic test should be built upon questions that define the Population, Intervention (the “index test” being assessed), Comparator (the “reference” test representing the current standard of care), and Outcomes (PICOs).

Prioritize Patient-Important Outcomes

Outcomes should be relevant to the population at hand. As such, the ideal study design to generate this evidence for outcomes related to test accuracy is a randomized controlled trial with a test-retest format that directly investigates the downstream effects of a testing strategy on outcomes in the population at hand, seen in Figure 1A below.

However, this is often not available. In this case, test accuracy would be used as a surrogate outcome, and test accuracy studies such as those in Figure 1B can be linked to additional evidence that examines the effect of downstream consequences of test results on patient-important outcomes. (More on that in a March 2020 blog post, here.)

Assessing Risk of Bias in Test Accuracy Studies

There are several important factors to consider when assessing a body of test accuracy studies for risk of bias. Potential issues with regard to risk of bias include:

· Populations that differ from those intended to receive the test (e.g., in terms of disease risk)

· Failure to compare the test in question to an independent reference/standard test in all enrolled patients (e.g., by using only a composite test)

· Lack of blinding when ascertaining test results

The QUADAS-2 tool can be used to guide assessment of bias in these studies.

Use PICO to Guide Assessment of Indirectness

Lastly, as when evaluating intervention studies, indirectness can be assessed by determining whether the Population, Index test, Comparator/reference test, and Outcomes match those in the clinical question.

Schünemann H, Mustafa RA, Brozek J, Steingart KR, Leeflang M, Murad MH, Bossuyt P, et al. GRADE guidelines 21 pt. 1: Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy. J Clin Epidemiol Feb 12. pii: S0895-4356(19)30673-0. doi: 10.1016/j.jclinepi.2019.12.020. [Epub ahead of print]

Manuscript available here on publisher’s site.

Monday, April 6, 2020

Rapid Guidelines in GRADE Pt. I: Needed Advice when Time is of the Essence

While most clinical practice guidelines take 2-3 years to develop and publish, the emergence of a public health crisis or urgent humanitarian need requires the dissemination of evidence-based guidance in a more rapid manner. To this effect, several national- and international-level guideline-producing organizations, such as the Centers for Disease Control (CDC) and the World Health Organization (WHO), have developed processes for the development of evidence-based guidance for these more urgent situations.

WHO’s 2006 recommendations for the pharmacological management of avian influenza in humans is one example of a rapidly developed guideline. Of current relevance, WHO has recently published interim guidance on the management of severe acute respiratory infection when novel coronavirus is suspected, and the UK’s National Institute for Health and Care Excellence (NICE) has also developed interim guidelines for the treatment of COVID-19 in patients receiving critical care, kidney dialysis, and systemic anticancer therapy. Because this matter is rapidly evolving and advice is needed immediately, the protocols used by NICE, WHO and other organizations are different than it would be for less urgent topics.

Can rapid guidelines use GRADE?

In short, yes. Recommendations can be made based on the transparent grading and reporting of the certainty of evidence that lie at the heart of GRADE, whether this is over the timeframe of hours, days, weeks, or months. The key word here is transparent: no matter the speed of development, recommendations should always be couched within the terms of the certainty of evidence behind them, and judgments of the evidence should be clearly presented. In a 2016 paper on the use of GRADE to respond to health questions with different levels of urgency, Thayer and Schünemann provide terms for the various speeds of response, and considerations for recommendations therein:

Ultra-short emergency response: 1 or more hours

Urgent response: 1-3 weeks

Rapid response: 1-3 months

Routine response: More than three months

Recommendations can still be formed based on the certainty of the evidence that's available, whatever that evidence may be. While systematic reviews of all available evidence are a foundational aspect of non-urgent guidelines, evidence in the form of narrative syntheses, modeling, or late-breaking data from the field can be used when time is short and systematically compiled data are sparse. Regardless of the source, the domains of GRADE still allow for evidence to be appraised and to guide the resulting direction and strength of recommendations.

Stay tuned for Pt. II coming soon, where we'll take a closer look at organizations that have developed rapid recommendations in response to time-sensitive public health issues.

For a checklist to guide the development of rapid recommendations, see the G-I-N/McMaster checklist.

For more information about appraising the certainty of evidence in the lack of meta-analyzed data, see this paper.

Thayer KA & Schünemann H. Using GRADE to Respond to Health Question With Different Levels of Urgency. Environment international. 2016 July-August: 585-589.

Manuscript available at the publisher's website here.

Thursday, March 26, 2020

Extremely Serious Research Short: GRADE’s terminology for rating down by three levels

Contributed by Madelin Siedler, 2019/2020 U.S. GRADE Network Research Fellow

Since the inception of GRADE two decades ago, GRADE methodology has needed to evolve along with the arrival of new ways of assessing the evidence. One such evolution has come with the introduction of methods for assessing risk of bias for non-randomized studies, such as the Risk Of Bias In Non-randomized Studies (ROBINS-I) and the RoB Instrument for Nonrandomized Studies of Exposures (ROBINS-E).

Because these tools assess the risk of bias in non-randomized studies as if they represent a pragmatic trial, they automatically begin from a lower risk of bias than alternative assessments such as the Newcastle-Ottowa Scale. When rating down in GRADE, however, non-randomized studies start as low certainty of evidence before any rating up or down occurs. This means that while a study assessed with ROBINS-I or E would start as high-quality evidence, it may require a reduction of three levels if very serious risk of bias is present. In other words, a reduction of three levels for a study assessed with ROBINS-I or E would be analogous to a two-level reduction for a non-randomized study assessed with another method.

A rating by any other name…

In order to determine what exactly this new three-level reduction should be called, members of the GRADE Working Group conducted a survey of 225 participants recruited via social media, the Guidelines International Network (G-I-N), and other sources. Just over one-third (34.2%) were members of the GRADE Working Group and all respondents had participated in guideline development in some capacity. The results are presented in a newly published article as part of a new “GRADE Notes” series in the Journal of Clinical Epidemiology.

Within the survey, participants were asked to rate the following terms for this novel three-level reduction, from least (1) to most-favored (4):

Critically serious
Extremely serious
Most serious
Very, very serious

Respondents' average ranking of terms.

T. Piggott et al. / Journal of Clinical Epidemiology - (2020)

“Extremely serious” took the lead as the most favorably ranked term with an average score of 3.19, with “critically serious” a close second at 3.12. Respondents found “extremely serious” the most agreeable due to its clarity and the fact that it seemed to “naturally” follow the existing two-level term, “very serious.”

The term “extremely serious” can now be found within the GRADEpro application when rating the certainty of evidence within non-randomized studies while utilizing the ROBINS-I or ROBINS-E instruments.

Piggott T, Morgan RL, Cuello-Garcia CA, Santesso N, Mustafa RA, Meerpohl JJ, Schünemann HJ, GRADE Working Group. GRADE notes: Extremely Serious, GRADE’s Terminology for Rating Down by 3-Levels. Journal of Clinical Epidemiology. 2019 Dec 19.

Manuscript available here on publisher's site.

Wednesday, January 22, 2020

Research Shorts: Rating the certainty in evidence in the absence of a single estimate of effect

Contributed by Madelin Siedler, 2019/2020 U.S. GRADE Network Research Fellow

When a pooled estimate from a meta-analysis of several studies is not present to guide the rating of evidence in these domains, how should one make a final determination of the certainty of evidence using GRADE?

Evidence from a 30,000-foot view

In their 2017 paper published in Evidence-Based Medicine, Murad and colleagues describe methods for applying GRADE when bodies of evidence are either sparse or too disparate to pool. A systematic review, for instance, may only provide a narrative synthesis of the current evidence given these limitations. When a neat estimate of effect presented as part of a tidy forest plot is not available, it is necessary to use one’s best judgment to rate the domains by taking a broader view. In these cases, Murad et al. recommend the following approach:

Risk of Bias: Judge the risk of bias across all studies that include the outcome of interest.
Inconsistency: Consider the direction and size of the estimates of effect from each study. Generally, do they all tell the same story, or do they vary considerably?
Indirectness: Make an overall judgment about the amount of directness or indirectness of the body of evidence, given your specific question (always consider your population, intervention, outcome, and comparator[s] of interest). Generally, are the studies synthesized answering questions similar to yours? Or might the dissimilarities be enough to lower your trust in the estimate of effect as it pertains to your question?
Imprecision: Examine the total information size of all studies (number of events for binary outcomes, or number of participants for continuous outcomes) as well as each study’s reported confidence interval for this outcome. If there are fewer than 400 total events or participants, or if the confidence intervals from most studies - or the largest - include no effect, imprecision is likely present.
Publication bias: Suspect publication bias if there is a small number of only positive studies, or if data were reported in trial registries but never published.

As always, one may consider rating up the quality of evidence from an observational study if a large magnitude of effect, a dose-response gradient, or plausible residual confounding that would increase the certainty of effect are present in the majority of studies examined.

Murad MH, Mustafa RA, Schünemann HJ, Sultan S, Santesso N. Rating the certainty in evidence in the absence of a single estimate of effect. BMJ Evidence-Based Medicine. 2017 Jun 1;22(3):85-7.

Manuscript available here on publisher's site.