Synthesizing study results in a systematic

A single study rarely suffices to underpin treatment or policy decisions. This creates a strong imperative for systematic reviews. Authors of reviews need a method to synthesize the results of several studies, regardless of whether or which statistical method is used. In this article, we provide arguments for combining studies in a review. To combine studies, authors should judge the similarity of studies. This judgement should be based on the working mechanism of the intervention or exposure. It should also be assessed if this mechanism is similar for various populations and follow-up times. The same judgement applies to the control interventions. Similar studies can be combined in either a meta-analysis or narrative synthesis. Other methods such as vote counting, levels of evidence synthesis, or best evidence synthesis are better avoided because they may produce biased results. We support our arguments by re-analysing a systematic review. In its original form, the review showed strong evidence of no effect, but our re-analysis concluded there was evidence of an effect. We provide a flowchart to guide authors through the synthesis and assessment process.

The basic idea underlying evidence-based medicine is that better use of evidence from scientific research will increase the quality of healthcare including prevention (1). Evidence is, however, seldom unequivocal and many topics of interest to practitioners have been evaluated in more than one study with varying results. This creates a clear need for synthesizing the results of multiple studies such as in systematic reviews. The systematic review has been defined as a review in which bias has been reduced by the systematic identification, appraisal, synthesis, and, if relevant statistical aggregation of all relevant studies on a specific topic according to a predetermined and explicit method (2). The value of systematic reviews in providing answers to questions relevant to practice is increasingly recognized also for occupational health (3,4).
In the past, rather than providing an answer to a specific question, the purpose of a review was to give an overview of what had been written about a certain topic in the scientific literature. For this traditional "overview type" of review, synthesis of the results in one summary outcome is less necessary. This difference in objectives has created confusion about if, when, and how results of studies in reviews should be synthesized.
Not all types of questions can be answered with systematic reviews. The traditional idea of giving an overview of "the state of the art" can still be useful. It is, however, increasingly recognized that also in this respect it would be good to be more systematic. This has led to a new nomenclature for reviews such as "scoping" reviews (5). The objective of a scoping review is to summarize a range of evidence in order to convey the breadth and depth of a field. Such reviews have requirements different than systematic reviews as defined above. Results of qualitative studies can also be combined in a synthesis of studies, but the problems here are different from those in quantitative studies (6). Therefore, in this article, we restrict ourselves to systematic reviews of quantitative studies only.
There is sometimes confusion about the difference between a systematic review and a meta-analysis. A systematic review is a review of the literature, but it Verbeek et al does not necessarily include a meta-analysis. A metaanalysis is a statistical synthesis of the results of several individual studies in one pooled summary estimate. As such, it is easy to see that a meta-analysis requires a systematic review of the literature. Since a meta-analysis is often included in a systematic review, many use the term meta-analysis as a synonym for systematic review (7). Meta-analysis has a long history in educational and psychological research (8). The statistical technique of combining study results is not difficult and the pooled effect estimate has the charm of simplicity. However, this pooled effect estimate does not have much meaning if this comes from primary studies that widely vary in types of exposures, interventions, or participants. Metaanalysis has therefore been criticized for comparing apples to pears, and authors have been cautioned against combining study results too easily (9).
Regardless of whether a statistical method is used or not, authors will always need a method to synthesize the results of several studies to be able to provide answers to practical questions. The challenge will be to strive for a valid answer that is as concise and succinct as possible. For interventions, we would ultimately like to know how well the intervention works, and for exposures we would like to know to what degree they cause ill-health. The method of combining study results is not a trivial problem as the results of reviews can widely vary depending on the type of study synthesis used. Moreover, the validity of a systematic review has more direct practical implications than a primary study as its results are more likely to be used for policy making or to underpin clinical practice guidelines than the results of a single study.
Therefore, we would like to provide an overview of methods for synthesizing study results in a systematic review and assess their pros and cons.

The review process and decisions on study synthesis
The question how to synthesize study results is important from the very inception of a systematic review. During the process of performing a systematic review, several steps are taken that influence the synthesis of individual studies. In the first phase of the review, during the operationalization of the inclusion criteria, the author must determine whether similar or dissimilar studies are included and thus whether studies can be combined or not (2). How studies can be combined is a problem we will address later. The more important question is which studies have sufficient similarity to give an interpretable pooled estimate of the effect of the intervention or exposure. This will always be a subjective assessment because, in the end, no two studies will be identical. It is self-evident that the similarity of the studies depends on the inclusion criteria of the review. Sometimes, authors have formulated these criteria so broadly that studies can never be sensibly combined unless they are divided into different categories. They state, for example, that they want to study the effect of "interventions" on a certain health problem, meaning that they want to include all possible interventions. Or they state that they want to study the effect of a broad type of intervention such as "behavioral interventions" or "exercise". This is of course not impossible but it actually means that one performs several reviews under the umbrella of one review. This is not always recognized. Authors often state that the included studies were so heterogeneous that they could not be combined into a meta-analysis without noting that this was due to their own broad inclusion criteria (10)(11)(12). Instead of making subcategories and performing a meta-analysis, authors then still combine studies. For the study synthesis, they do not explain how they combined the results or they use a "self-invented" method for synthesis often leading to biased results of the systematic review. Ioannidis et al (13) has especially pointed this out, arguing that authors of reviews should better underpin their decisions about heterogeneity and more often make use of meta-analysis. This does not preclude the combination of broad clinical questions into a meta-analysis as shown in several systematic reviews, but there has to be proper argumentation to make the summary estimate credible (14,15).
In the literature, the criteria for combining studies are often referred to as clinical and statistical heterogeneity. Clinical heterogeneity means that any feature of the included studies can be so divergent that it precludes synthesis. Clinical heterogeneity is not an intuitive concept because it is unclear what clinical means in this context. Therefore, we would prefer to use the word "similarity" of studies instead. Statistical heterogeneity is the variation in treatment effect that is due to differences between studies rather than by chance alone (16). Even though studies can be judged similar enough to be combined, the statistical heterogeneity can be so considerable that it does not make sense to combine the results. If, for example, some studies have a large beneficial effect and other studies have a harmful effect, then it does not make sense to combine the results and state that there is no effect. In that case, there are probably differences between the studies that we did not understand or cannot estimate with the data at hand (17).
How to judge if studies are similar?
In figure 1, we provide a flow-chart of the argumentation for combining or not combining studies. Our primary feature of interest in studies is usually the intervention or the exposure, and this should therefore be the point of departure. If the interventions are not similar, then the results of studies should be reported separately, either in separate systematic reviews or separate sections of one systematic review. We advocate judging the similarity of interven-Synthesizing study results in a systematic review tions by their mechanism or action based on which one would expect a similar effect of the intervention. This is a subjective judgment because a general intervention classification does not exist. It is not an easy judgment to make because the meaning of the intervention has to be interpreted by the authors of the systematic review based on the short description that is provided in the report of the primary study (18). With complex interventions such as behavioral or organizational changes, the judgment can especially be complicated. The intervention feature of interest can only be a small part of the whole intervention, and then it is unclear what is being combined. For example pedometers or step counters to increase physical activity are usually part of a broader package of measures to induce a less sedentary lifestyle. Other features that could be part of an intervention package, such as professional guidance in exercises or working time availability for exercise, can be more crucial and thus make interventions dissimilar (19). Recently, more emphasis has been put on the systematic development of interventions such as with intervention mapping or the use of logic models (20). In addition, articles that report protocols of randomized controlled trials allow more room for an extensive description of the intervention. These developments will enable better judgment of similar interventions.
As a second step, authors should assess the control condition because it is conceivable that a no-intervention control group will have a different effect than a lessintensive intervention control group. Here, at least the following types of control conditions can be discerned: no intervention, a waiting-list control condition that will get the intervention later, a true placebo or sham treatment, alternative interventions, and similar but less intensive interventions. Judgment of similarity depends also here on the mechanism by which the effect is brought about. If there would not be consensus about the working mechanism, then the effect of different working mechanisms on the conclusions should be examined in a sensitivity analysis.
Interventions can have a different effect on various participants, for example, children or adults (21). It can also be surmised that the intervention would work similarly in various occupations that are subject to the same exposure. For example, we expected the same effect on back pain of training in manual handling of patients and materials among nurses and baggage-handlers because the mechanism was deemed similar and judged to produce similar results (22).
In general, it is not recommendable to combine different study designs such as randomized and nonrandomized studies (23). The idea is that different designs will lead to different types and degrees of bias and that therefore the summary estimate will be difficult to interpret.
Outcomes that are conceptually dissimilar should also not be combined even though it would be technically easy to do. For example the effect of reduction of exposure for treating occupational asthma can be measured on asthma symptoms and sick leave days due to asthma (24). It can be assumed that these effects would be different and cannot be combined. On the other hand, if the authors are interested in the effect of physical Start from list of included studies Check the conceptual similarity of the items 1 to 7 for the included studies If all items deemed similar, combine and perform meta-analysis.
If data insufficient perform narrative synthesis.
Check / Explain Statistical Heterogeneity If statistical heterogeneity present, consider subgroups or meta-regression Verbeek et al conditioning on sick leave among back pain patients, it makes sense to combine time to return to work and the mean number of sick leave days as outcomes. Both types of outcomes measure the same concept, thus it can be assumed that the intervention has a similar effect on both types of outcomes (17). For many interventions, such as educational interventions, it would be plausible that they have a differential effect over time. Sometimes there could be a learning period after which a full effect is expected or the effect could wear off over time. Depending on the mechanism that is anticipated, only outcomes at similar follow-up times should be combined. In our view, it does not make sense to split this into too small parts because then there will never be enough studies to combine. This is a specific problem for studies that use back pain as an outcome. Here, the experts expect a differential effect of intervention in the short term after three months followup, after a year follow-up, and after longer time periods. There is, however, no empirical evidence that this is a valid categorization (25).
Once the authors of the systematic review have decided whether the studies' elements are similar enough to be combined, they must then assess if the data in the original reporting is appropriate to enable a statistical meta-analysis. Statistically, it is only possible to combine study results that are measured in a similar way, such as dichotomous outcomes as odds ratios (OR) or rate ratios or continuous outcomes as mean differences. However, simple methods exist to transform effect-sizes for dichotomous outcomes into effect sizes for continuous outcomes and vice versa. This greatly facilitates the conduction of meta-analysis. We refer to Borenstein & Cooper for an extensive and didactic overview of these methods (7,8).

Meta-analysis and statistical heterogeneity
After authors follow this procedure and have decided that studies may be combined and their data is appropriate, they can proceed with the meta-analysis. Software for meta-analysis is freely available from the Cochrane Collaboration if not used for commercial purposes (Review Manager 5.1, Cochrane Collaboration, Copenhagen, Denmark). Also other statistical programmes have sophisticated options for meta-analysis such as Stata version 9 (StataCorp, College Station, TX, USA). In a meta-analysis, the study results are weighted according to their precision or variance, where studies with greater precision get a higher weight. The pooled estimate is then calculated based on these weighted study effect sizes. The results can also be presented graphically in a forest plot which gives an immediate overview of the individual studies and their statistical heterogeneity (26). In figure 2, the weight of the stud-ies is based on the standard error of the log OR, with more precise studies with smaller standard errors having more weight.
High statistical heterogeneity means that the between-study variance is higher than would be expected based on chance alone. When there is high statistical heterogeneity, this should be analyzed for example by dividing studies into different subgroups (27). Subgroups could show different pooled effect estimates and thus explain the heterogeneity in the whole sample of studies. Ultimately meta-regression can be used, where characteristics of studies are regressed on the effect sizes to find out if this explains effect-size variations. Since the sample sizes of studies included in a systematic review are usually small, the power of meta-regression is low. Therefore it is recommended to use this only as a hypothesis-generating technique (28).

Narrative synthesis
If a statistical combination of studies is not possible for example because the various study elements are not similar enough, the only alternative is a narrative synthesis. This is not essentially different from the procedure described above up to the point of statistically combining the results. Instead of combining them, the results are simply described as well as possible. This has been elaborated by Rodgers et al (30) for interventions to increase ownership of smoke alarms. The authors performed independently both a narrative synthesis and a meta-analysis for one particular systematic review. Their conclusion was that the final conclusions were similar but that a narrative synthesis provided more ideas for implications for future research and the metaanalysis more ideas for moderators of the effect of the intervention (29,30).

Alternative synthesis methods
Vote counting. Vote counting is best described as summing up the numbers of studies with statistically significant outcomes and those without significant outcomes. If those with statistically significant outcomes prevail then it is concluded that there is evidence for the effectiveness of an intervention or exposure (31). The main argument against the vote counting method is that, for studies with low statistical power, the approach easily leads to the conclusion that there is no effect while in reality there is an effect.
Levels of evidence. Levels of evidence are best described by the Cochrane Back Review Group in their previous methods guidelines published in 2003 (32). The group followed the same approach as described above for judging the clinical homogeneity. If studies are homogeneous, they are synthesized into a level of evidence for or against the effectiveness of an intervention. The summing depends on the quality of the studies and is summarized as strong, moderate, low, or conflicting evidence. 1 The levels of evidence method is sometimes also called "qualitative synthesis".
The levels of evidence synthesis should not be confused with the overall judgment of the quality of evidence in a systematic review as proposed by the Grading of Recommendations, Assessment, Development and Evaluations (GRADE) working group (23,33). The GRADE method is an overall judgment of the quality of evidence and not a method to synthesize study results. In addition to a summary measure of effect, such as a pooled relative risk or OR, the quality of the evidence is rated as high, moderate, low, or very low. The working group advocates the use of the following five criteria to judge the quality of the evidence: (i) risk of bias in the included studies, (ii) indirectness of the evidence, (iii) unexplained heterogeneity of the results of the included studies, (iv) imprecision of the results, and (v) probability of publication bias. Thus 1 Consistent findings among multiple high-quality randomized trials add up to strong evidence, multiple low-quality randomized trials or only one high-quality trial result in moderate evidence, only one low-quality trial leads to limited evidence. Conflicting evidence is the outcome when there are inconsistent findings. the quality of the evidence reflects the confidence that the estimate of effect is correct.
One of the problems with the levels of evidence synthesis is the definition of a positive or negative outcome. A positive outcome is usually defined when there is a statistically significant positive outcome at a level of P<0.05 and a negative outcome if there is a non-significant outcome. A consistent finding would then be that four out of five trials had a significantly positive outcome.
The advantage of the levels of evidence synthesis is that it is saves a lot of work because no laborious data extraction is needed. One only has to know the P-value of the outcome to be able to combine study results. In addition, one can synthesize evidence for or against effectiveness. A serious drawback of the method is that its criteria are not well defined. Ferreira et al (34) compared the application of the levels of evidence synthesis method in different reviews of the same research group. They concluded that there were "markedly different conclusions on treatment efficacy" and they cautioned against its use. Also advocates of the levels of evidence method concluded that the system is sensitive to how the method is interpreted and used (35). However, the main argument against its use is that non-significant results are counted as evidence of no effect even in cases where the confidence intervals are wide and, thus, these studies do not add to the power of the systematic review. This  Verbeek et al increases the chance of a false-negative result or betaerror of not concluding that an intervention is effective even though in reality it is. More generally, the absence of evidence of an effect of an intervention should not be confused with evidence of the absence of an effect (36,37). The Cochrane Back Review Group has withdrawn the guidance on levels of evidence in its most recent updated guidelines (38).
Best evidence synthesis. Another but similar approach is called best evidence synthesis, which can accommodate studies from a range of disciplines relevant to human health (39). This approach does not differ from the levels of evidence approach described above: study results are synthesized into strong, moderate, partial, or mixed evidence based on the quality and the positive or negative outcome of the study. Slavin, who proposed the method, indeed criticized the prevailing approach of meta-analysis in social sciences at that time, in which study results were combined regardless of methodological quality. He proposed to exclude lower quality evidence in case higher quality evidence is available and thus always base conclusions on the best available evidence. He further proposed to conduct proper meta-analysis of the results of included studies but to finally also comment on and describe more than just the effect-sizes resulting from the meta-analysis.

Worked example
Using an example, we would like to point out that the levels of evidence approach can lead to conclusions that are different from those obtained with a proper meta-analysis. Hartvigsen et al (40) performed a systematic review of the relation between psychosocial factors at work and the presence of back pain (40). The authors used a system of levels of evidence to assess the association between organizational aspects of work and back pain. They included only prospective cohort studies that compared the occurrence of back pain between workers with high and low levels of exposure to psychosocial factors at work. Based on nine studies, they concluded that there was moderate evidence for no association between organizational stress and low-back pain (41)(42)(43)(44)(45)(46)(47)(48)(49).
We reanalyzed their material with the procedure described above and combined the results in a metaanalysis.
We took the list of nine included studies as a point of departure and re-analyzed them using the decision flowchart provided in figure 1. Two articles reported on the same study and thus we excluded one [personal communication, Gonge et al (49)]. Most studies reported on more than one measure of organizational stress. We used the job-demand-control model of Karasek (50) to group the exposures according to psychological demands, skill discretion, and decision authority (table 1). In all studies, the control condition had a much lower degree of exposure or no exposure with sufficient contrast to bring about a difference in outcome. Participants were not similar in the studies varying from general population to construction workers, but we assumed that the effects of stress exposure would be similar. We also assumed that effects would not vary according to gender but where effect sizes were reported separately for men and women, those for men were used. The study designs were all prospective cohort studies except for the study by Gonge et al (48) that used a case-crossover design. This design is substantially different from the other studies and so we excluded it. The outcome measures were all selfreports of low-back symptoms that we thought would be similarly influenced by organizational stress. Follow-up times varied from 1-10 years and were all long enough to bring about an effect of organizational stress. Effect sizes were however different across studies. In three studies, back-pain scores were analyzed as continuous variables using multiple regression analysis. Because articles reported only betas and P-values and not standard errors, we could not combine these effect sizes. In four other studies, dichotomous variables were used and analyzed with logistic regression analysis. We combined effect sizes based on dichotomous outcomes using the generic inverse variance method as implemented in Revman 5.1 (Cochrane Collaboration, Copenhagen, Denmark). As input in Revman, we used the natural logarithm of the OR [ln(OR)] and its standard error which we calculated from the 95% confidence intervals (95% CI) provided in the articles. Because statistical heterogeneity was low, we used a fixed-effects model. We followed the same procedure for psychological demands and skill discretion.
For the relation between psychological demands and low-back pain, this resulted in a pooled OR of 1.83 (95% CI 1.25-2.67), which was supported by two studies that used multivariate regression. For decision authority the OR was 1.12 (95% CI 0.73-1.72), which was also supported by two studies that used multivariate regression. For skill discretion, the OR was 1.27 (95% CI 0.89-1.80) supported by two studies that used multivariate regression (figure 2). In contrast to Hartvigsen et al's conclusions, based on these new results, we found evidence that psychological demands at work are related to low-back pain and that there is a possible but uncertain relation between decision authority and skill discretion and low-back pain. This change in conclusion is mainly due to the use of meta-analysis instead of the levels of evidence approach. Better classification of the exposure categories and stricter application of the inclusion criteria did not change the results.

Concluding remarks
Synthesis of studies in systematic reviews asks especially for judgment on the conceptual similarity of studies. Such a judgment will lead more often to proper meta-analysis or narrative synthesis. Alternatives such as vote counting, levels of evidence synthesis, or best evidence synthesis are better avoided because they may produce biased results of systematic reviews.