Thursday, May 28, 2020

Sink or SWiM? When and How to Use Narrative Synthesis in Lieu of Meta-Analysis

The terms “systematic review” and “meta-analysis” often go hand-in-hand. However, there are other ways to synthesize and present the findings of a systematic review that do not entail statistical pooling of the data. This is referred to as narrative synthesis or Synthesis Without Meta-Analysis (SWiM), and a recent webinar presented by Cochrane (viewable for free here) provided the definition, potential uses, and pitfalls to watch for when considering the use of narrative synthesis within a systematic review.

What is narrative synthesis/SWiM?

Narrative synthesis or Synthesis Without Meta-analysis (SWiM) is an approach used to describe quantitatively reported data from studies identified within a systematic review in a way that does not quantitatively pool or meta-analyze the data. Narrative synthesis is not the same as a narrative review, which is an unsystematic approach to gathering studies.

A narrative synthesis adds value to the literature by providing information about what the studies on a certain topic say as a whole, as opposed to simply summarizing the findings from individual studies one-by-one. Whereas a meta-analysis is useful in that it provides an overall estimate of the size of an effect of an intervention, a narrative synthesis allows the reviewer to organize, explore, and consider the ways that the findings from several studies are connected to as well as how they are different from one another – and the potential moderators that define these relationships. Thus, its focus is on the question of the existence, nature, and the direction of an effect, rather than its size.

When is it appropriate to perform a narrative synthesis/SWiM?

There are several reasons why narrative synthesis/SWiM may be used when reporting the findings of a systematic review.
·      There are not enough data to calculate standardized effect sizes. Meta-analyzing outcomes that are reported using different scales requires the standardization of these data. However, in certain fields, authors of studies may be less likely to report all of the elements required to calculate a standardized effect size, such as the measures of variance; contacting the authors to obtain this information may not yield the needed data. To exclude these studies outright, however, and meta-analyze only studies in which all the needed data are reported, may under- or misrepresent the entire body of evidence.
·      There is substantial heterogeneity among included studies. Notable inconsistency between studies with regards to their effect sizes and direction (statistical heterogeneity), the study design (methodological heterogeneity), or from clinical differences surrounding the PICO may render a quantitative meta-analysis of studies to be of little utility, especially if there is a small number of studies to be analyzed together. However, it’s important to ask yourself whether the heterogeneity is truly of enough concern to preclude meta-analysis. PICO elements should be carefully considered a priori as to which are similar enough to be pooled, and which require their own analysis.

What are some common errors made in narrative syntheses/SWiMs?
There are a few common piftalls to watch out for when deciding to report your synthesis without quantitative meta-analysis.
·      Not transparently reporting that a narrative synthesis was used when data could not be/were not meta-analyzed
·      Not reporting the methods used for narrative synthesis in detail
·      Not referring to methodological guidance when describing the decision to perform a narrative synthesis
·      Not providing clear links between the data and the synthesis, such as via tables or charts used to report the same data as in the text.

By improving the reporting and presentation of these items within a systematic review, end-users will be better able to understand the reasons why a narrative synthesis was conducted, and ultimately utilize the findings.

Guidance for the reporting of narrative synthesis, or SWiMs, can be found by using the new SWiM reporting guideline checklist here.

We recently reported on GRADE guidance for assessing the certainty of evidence in such circumstances as when a narrative synthesis is presented. More here.

Friday, May 22, 2020

Research Shorts: Assessing the Certainty of Diagnostic Evidence, Pt. II: Inconsistency, Imprecision, and Publication Bias

Earlier this week, we discussed a recent publication of the GRADE series in Journal of Clinical Epidemiology that provides guidance for assessing risk of bias and indirectness across a body of evidence of diagnostic test accuracy. In this post, we’ll follow-up with continued guidance (published in Part II) for the rest of the GRADE domains.


Unexplained inconsistency should be evaluated separately for the findings on test specificity and test sensitivity. When a meta-analysis is available, both the visual and quantitative markers of inconsistency can be used in a similar fashion to a meta-analysis of intervention studies. If differences between studies related to any of the PICO elements is suspected as an explanation for observed heterogeneity, exploration via subgroup analyses may be appropriate.


Again, imprecision of a test’s sensitivity and specificity should be evaluated separately. As with assessments of interventional evidence, evaluation of imprecision across a body of test accuracy studies entails the consideration of the width of the confidence interval as well as the number of events (specifically, the number of patients with the disease and the number of positive tests for sensitivity, and the number of patients without the disease and the number of negative tests for specificity).

In contextualized settings, when one end of the confidence interval may lead to the use of the testing strategy while the other end would not, then imprecision is likely present. It may be helpful to set priori a threshold through which a confidence interval should not cross in order for the test to have sufficient value.

Publication Bias

The use of traditional funnel plot assessments (e.g., Egger’s or Begg’s test) on a body of test accuracy studies is more likely to result in undue suspicion of publication bias than when applied to a body of therapeutic studies. While other sophisticated statistical assessments are available (e.g., Deeks’ test, trim and fill), systematic review and health technology assessment (HTA) authors may choose to base a judgment of publication bias on the knowledge of the existence of unpublished studies. If studies published by for-profit entities or those with precise estimates claiming high test accuracy despite small sample sizes exist, publication bias may also be suspected.

Upgrading the Certainty of Evidence ("Rating Up")

As with an assessment of interventional evidence, there may be reasons to upgrade the certainty of evidence in the face of highly convincing links between the use of a test and the likelihood and/or magnitude of an observed outcome. The diagnostic test accuracy equivalent of a dose-response gradient – the Receiving Operator Characteristic, or ROC curve – may be used to assess this potential upgrader.

Schünemann H, Mustafa RA, Brozek J, Steingart KR, Leeflang M, Murad MH, Bossuyt P, et al. GRADE guidelines 21 pt. 2. Test accuracy: inconsistency, imprecision, publication bias, and  other domains for rating the certainty of evidence and presenting it in evidence profiles and summary of findings tables. J Clin Epidemiol  2020 Feb 10. pii: S0895-4356(19)30674-2. doi: 10.1016/j.jclinepi.2019.12.021. [Epub ahead of print].

Manuscript available here on the publisher's site.

Tuesday, May 19, 2020

Research Shorts: Assessing the Certainty of Diagnostic Evidence, Pt. I: Risk of Bias and Indirectness

Systematic reviews or health technology assessments (HTAs) that examine the body of evidence on diagnostic procedures can - and should - transparently assess and report the overall certainty of evidence as part of their findings. In the two-part, 21st installment of the GRADE guidance series published in the Journal of Clinical Epidemiology, Schünemann and colleagues provide methods for approaching the first two major domains of the GRADE approach: risk of bias and indirectness.

While there are certainly differences between methods for assessing the certainty of evidence of diagnostic tests as opposed to interventions, the fundamental parts of GRADE remain unchanged:

Make Clinical Questions Clear via PICOs

It is paramount to clearly define the purpose or role of a diagnostic test and to see the test in light of its potential downstream consequences for making subsequent treatment decisions. As with a review of an intervention, a review of a diagnostic test should be built upon questions that define the Population, Intervention (the “index test” being assessed), Comparator (the “reference” test representing the current standard of care), and Outcomes (PICOs).

Prioritize Patient-Important Outcomes

Outcomes should be relevant to the population at hand. As such, the ideal study design to generate this evidence for outcomes related to test accuracy is a randomized controlled trial with a test-retest format that directly investigates the downstream effects of a testing strategy on outcomes in the population at hand, seen in Figure 1A below.

However, this is often not available. In this case, test accuracy would be used as a surrogate outcome, and test accuracy studies such as those in Figure 1B can be linked to additional evidence that examines the effect of downstream consequences of test results on patient-important outcomes. (More on that in a March 2020 blog post, here.)

Assessing Risk of Bias in Test Accuracy Studies

There are several important factors to consider when assessing a body of test accuracy studies for risk of bias. Potential issues with regard to risk of bias include:
·      Populations that differ from those intended to receive the test (e.g., in terms of disease risk)
·      Failure to compare the test in question to an independent reference/standard test in all enrolled patients (e.g., by using only a composite test)
·      Lack of blinding when ascertaining test results

The QUADAS-2 tool can be used to guide assessment of bias in these studies.

Use PICO to Guide Assessment of Indirectness

Lastly, as when evaluating intervention studies, indirectness can be assessed by determining whether the Population, Index test, Comparator/reference test, and Outcomes match those in the clinical question.

Schünemann H, Mustafa RA, Brozek J, Steingart KR, Leeflang M, Murad MH, Bossuyt P, et al. GRADE guidelines 21 pt. 1: Study design, risk of bias, and indirectness in rating the certainty across a body of evidence for test accuracy.  J Clin Epidemiol Feb 12. pii: S0895-4356(19)30673-0. doi: 10.1016/j.jclinepi.2019.12.020. [Epub ahead of print]

Manuscript available here on publisher’s site.

Thursday, May 14, 2020

Research Shorts: Calculating Absolute Effects for Time-to-Event Outcomes

Time-to-event (TTE) data provide information about whether a specific event occurs as well as the amount of time that passes before its occurrence. As such, TTE analyses can be particularly useful in the development of guidelines in fields such as oncology, where various diagnosis and treatment options can change the time-course of a disease and its consequences. A methodological systematic review of cancer-related systematic reviews, however, found that review authors often struggled to appropriately apply TTE data in terms of their absolute effect. A 2019 paper by Skoetz and colleagues provides guidance for applying these types of data to calculate absolute effects in the development of systematic reviews and guidelines.

Direct calculation of absolute effect
If the TTE data come from studies with a fixed length of follow-up period and individual participant data, a timepoint at which all participant data are available should be used to create a 2x2 table, and absolute effect calculated accordingly.  Most of the time, however, absolute effect will not be directly calculable. This is the case with studies that have staggered participant entry and variable length of follow-up, and no time-points at which all individual participant data are captured. In this scenario, an absolute effect can be estimated from the pooled hazard ratio and an assumed baseline risk can be used, or a regular risk difference calculated if events are rare.

Indirect calculation of absolute effect
To estimate baseline risk in the calculation of an absolute effect sizing using a hazard ratio, it is important to use the best estimate of the baseline risk of the population at hand. While data reported in individual clinical trials may be used, consider that they may be either artificially inflated (by means of enrolling patients at higher-than-average risk) or reduced from the true population risk (by means of excluding patients with comorbidities). Thus, it is preferable to obtain a baseline risk estimate through large-scale observational studies conducted in the population of interest with a low risk of bias. Using this type of data to estimate baseline risk is also more likely to result in a higher certainty of effect, depending on the size of the study.

If these options are not suitable, data from the survival curves of control groups within studies at low risk of bias may be used. If possible, utilize data from a middle time-point.

No matter how absolute effects are calculated, it is important to clearly and transparently report this information, including:
  • reporting how the baseline risks were estimated
  • using the same the numbers consistently – e.g, whether reporting number of patients with events or those who remain event-free
  • uniformly choosing one specific time point based on the studies used.
Because absolute effects are more easily understood and used within shared decision-making, these estimates should be provided within the abstract as well as the Summary of Findings table or Evidence Profile.

This figure provides an example of how absolute risk based on time-to-event data can be meaningfully communicated in a patient-facing graphic.

The paper provides further guidance on determining the certainty of evidence using TTE data, calculating absolute absolute effects for events such as mortality, providing graphical representations of the absolute effect, and calculating corresponding numbers needed to treat and median survival times to further aid decision-making.

Skoetz, N., Goldkuhle, M/, van Dalen, E.C. et al. GRADE guidelines 27: How to calculate absolute effects for time-to-event outcomes in summary of findings tables and Evidence Profiles. J Clin Epidemiol 118 (2020); 124-131.

Manuscript available from publisher’s website here.  

Monday, May 11, 2020

Adventures in Protocol Publication

As most reading this will know, a systematic review is no small feat. But while the complete project itself can feel intimidating at times, a well-planned systematic review is broken up into enough small parts to make each part feel manageable – and a lot like an accomplishment it itself. One such step that more and more authors are choosing to take is the publication of their protocol in a peer-reviewed journal.

As Evidence Foundation fellow, I have had the unique opportunity to lead the development of a systematic review and critical appraisal of physical activity guidelines in collaboration with members of the U.S. GRADE Network. After nearly 18 months of work, I’m happy to report that the first draft of the manuscript has been written – but I was given a sweet taste of this accomplishment earlier on when my protocol was published in January. Here’s what I learned through the process.

Reasons to Publish a Systematic Review Protocol (For the Good of Science)
·      Just as with clinical trials, the publication of a protocol for a systematic review alerts other researchers in the field to the work being conducted, thus reducing duplication of efforts.
·      Defining the goals and processes to be used in the systematic review before it’s conducted (a priori) likely reduces bias.
·      According to a 2017 study comparing reviews with and without a published protocol, reviews with published protocols were more likely to be thorough and transparent in their reporting of methods in the resulting review. (However, this may just be because those who are likely to publish a protocol are also more likely to be generally thorough and transparent… but if that’s the case, which side would you like to be on?)

Reasons to Publish a Systematic Review Protocol (For Your Own Good)
·      Set yourself up for success. Submitting a protocol to a peer-reviewed journal gives you an opportunity to resolve any issues and automatically improve the quality of your final review manuscript before you even press “submit.” That means less work at the end of the day, and likely a shorter time window from submission of your final review to its publication. For instance, my reviewers asked that I further elaborate and clarify the history and importance of physical activity guidelines, which ultimately strengthened the introduction to my SR.
·      Save yourself room. Going in-depth in your published protocol means you can spend less space on the methods section of your final review, leaving you with more room for the meat of the paper: the results and discussion sections. Simply discuss your methods more briefly and cite your published protocol for further reading (and, lest I forget to mention, citing yourself is the ultimate power move).
·      Grow your CV. By getting their protocol published, a young researcher can add a precious first-author citation to their vitae. These don’t grow on trees, and publishing a protocol is like a two-for-one deal.
·      Stay accountable. Publishing your protocol for the world to see may be just the motivation you need to finish the task – and quickly, now that everyone’s waiting to see the results!

Reasons Not to Publish a Protocol (and Just Stick to PROSPERO Instead)
·      Financial burden. Publishing is not usually a cheap endeavor, and unless you have additional support, charges and fees may be better spent on the final review.
·      Opportunity cost. Honestly consider how much additional time and psychic bandwidth it may take you to get a protocol published, from the drafting to the revisions and everything in between (like editing every reference with a fine-toothed comb). Is it time that you’d rather spend on working on the review?
·      Longer time to publish. As per the above, it’s possible that the work of publishing a protocol may protract the entire process. That same 2017 study found that the median time from the search to submission of a review for which a protocol had been published was 325 days, and 578 days to publication of the final document. This stands in contrast to the matched reviews for which a protocol was not published, which only took a median of 122 days to submission and 358 days to publication.

A (By All means Non-Exhaustive) List of Places to Publish a Systematic Review Protocol
·      BMJ Open
·      Cochrane Database of Systematic Reviews
·      Environment International
·      JBI Database of Systematic Reviews and Implementation Reports
·      Medicine
·      Systematic Reviews

If you’re adequately convinced after weighing the costs and benefits, dust off your PRISMA-P checklist (heads up: the journals above will need you to show how you’ve fulfilled each criterion) and get writing.

Wednesday, May 6, 2020

Research Short: Defining Ranges for Certainty of Evidence Ratings of Diagnostic Accuracy

Recently, we reviewed a paper describing the methods by which the evidence of downstream consequences of screening can be linked to evidence of test accuracy via formal and informal modeling. The resulting judgment of the certainty of this evidence will communicate our certainty that a test’s true accuracy lies within a given range. A new paper published earlier this year provides guidance on evaluating the certainty of evidence for diagnostic accuracy.

Ranges for determining the certainty of evidence of test accuracy may be either fully or partially contextualized (meaning the range takes into account some or all of the possible effects of a test strategy, and is based on a value judgment of the relative importance of outcomes) or non-contextualized (meaning the range only takes into account the accuracy of the test without consideration of the relative implications of false positive or negatives).

Non-contextualized judgments assume that outside of differences in accuracy, everything else about two test strategies will have the same impact on outcomes; thus, certainty of evidence is judged based solely on the accuracy data. Contextualized judgments, on the other hand, also take into account the downstream consequences of a test’s accuracy – particularly the potential effects of false positives or negatives. Typically, non-contextualized or partially contextualized ratings are used in systematic reviews or health technology assessments (HTAs), whereas fully contextualized ratings should be used in the formation of guideline recommendations.
            Sources of ranges for test accuracy with varying levels of contextualization include:
·      Non-contextualized (systematic review or HTA)
o   Confidence interval: certainty that the true sensitivity or specificity lies within the confidence interval(s) of the tests
-  Does not take precision into account
o   Direction of effect: certainty that there is a true difference between the sensitivity and specificity of two test strategies
-  Requires a determination of what would make a meaningful difference in accuracy
·      Partially contextualized (systematic review or HTA)
o   Specified magnitude: determines whether a difference in accuracy between tests is trivial, small, moderate, or large.
-  The acceptable magnitude of difference will be based at least partially on the importance of the downstream consequences of false positives and negatives
Example of a partly contextualized diagram of downstream consequences of screening of cervical dysplasia using a screen-treat strategy.

·      Fully contextualized (guideline recommendations)
o   Rates the certainty of a test’s sensitivity and specificity based on whether the overall balance between benefits and harms would differ from one end of the range to the other.
-  Ranges are determined by first considering all important and critical downstream consequences of testing.

Hultcrantz M,  Mustafa RA, Leeflang MMG, Lavergne V, Estrada-Orozco K, Ansari MT, Izcovich A et al. Defining ranges for certainty ratings of diagnostic accuracy: A GRADE concept paper. J Clin Epidemiol 117 (138-148).

Manuscript available here on publisher's site.

Friday, May 1, 2020

Grey Matters: An Introduction to the Grey Literature and Where to Find It

Within the methods section of many a systematic review, it is common to come across the term "grey literature." Put plainly, grey literature comprises pieces of evidence that are not formally published in a book or peer-reviewed journal article. 

Examples of grey literature that can be valuable to a systematic review include:
  • conference abstracts and proceedings
  • clinical study reports
  • dissertations and theses
  • journal preprints

Searching the "grey lit" has several important benefits:
  • It expands the reach of a systematic review beyond the scope of the databases mined by a search, increasing the chance of finding pieces of evidence that may be helpful to the final synthesis of data.
  • It helps reduce the impact of potential publication bias on the findings of a review.
  • It keeps the review current by including upcoming data from recent conferences, doctoral work, and other yet-to-be-published sources.

Ideally, a search of the grey literature should be used in tandem with other forms of hand-searching, including the searching of relevant citations within included articles and of well-known reviews on similar topic.

Where to Find Grey Literature

Below are some resources that list helpful links for exploring the grey literature: