1. Introduction
There has been a lengthy debate within the medical literature about the effects of heterogeneity in meta-analyses of randomized controlled trials involving dichotomous outcome measures (Dersimonian and Laird, 1986; Thompson and Pocock, 1986). This debate has mainly concentrated upon what should be done if heterogeneity is detected, but little discussion has taken place as to which is the most appropriate statistical test to use when attempting to assess heterogeneity in a meta-analysis. As a result the statistics which are routinely used are those which authors find easiest to calculate, with the most commonly used being that described by Yusuf et al. (1985) (now often described as the ‘Peto method’), which is the standard technique in the Cochrane Library. Other fairly common methods are those due to Woolf (1955) and DerSimonian and Laird (1986).1
In contrast, in the statistical literature there exists a very wide literature on the most appropriate way to assess heterogeneity in a series of 2Ă—2 tables, which is an equivalent problem, although it is usually considered within the framework of the analysis of sub-strata in retrospective studies of the relationship between disease incidence and exposure to a suspected risk factor (Zelen, 1971; Halperin et al., 1977). Two excellent review papers by Paul and Donner (1989, 1992) describe and compare ten homogeneity statistics,2 six of which require iterative methods and only one of which, the Woolf statistic, overlaps with those which are commonly used in meta-analysis. Of the other two common statistics, the Dersimonian and Laird statistic is based on risk differences, which is inappropriate in the context of retrospective studies, whilst the Peto statistic is actually equivalent to a test proposed by Zelen (1971) which was subsequently shown to be invalid by Halperin et al. (1977) except for the special case where all tables have a common odds ratio of 1.3
In this paper we therefore make use of simulation methods to compare the commonly used homogeneity statistics with those recommended in the statistical literature, in the context of meta-analyses of randomized controlled trials in pain relief. We will extend this analysis to other applications elsewhere. We conclude that none of the standard methods used routinely in meta-analyses for assessing homogeneity give the appropriate levels of significance and all have very low power to detect true heterogeneity. We recommend an alternative non-iterative test statistic, first suggested by Breslow and Day (1980), based on the Mantel–Haenszel estimator (Mantel and Haenszel, 1959) of the odds ratio, although we show that this too lacks power to detect true heterogeneity.
2. Methods
Throughout this paper we will consider homogeneity tests for dichotomous outcome measures. A typical meta-analysis will be assumed to consist of k trials. In each of the k trials we will assume that we have numbers, ni, mi, assigned to each of the treatment and control groups (in the simulations we will assume that ni=mi, i.e. we have a perfectly balanced design which is what we usually aim to achieve in randomized controlled trials in pain relief). In the ith trial we will assume that there are ri successes in the treatment group and si successes in the control group (success might mean a patient improves upon treatment, for example in pain relief we will usually take ‘success’ to mean at least 50% pain relief (McQuay and Moore, 1998), although in cancer studies success is usually replaced by number of deaths, etc.). For each trial this can be written as a 2Ă—2 table, as shown in Table 1, where ti is the total number of successes and Ni=ni+mi is the total number of patients in trial i, i=1,…k. The problem of assessing homogeneity in a meta-analysis of k trials is therefore identical to assessing the homogeneity of k-independent 2Ă—2 tables.
We will consider five test statistics for assessing homogeneity: the three described in Section 1 (the Peto statistic (denoted by QP), the Woolf statistic (QW) and the Dersimonian and Laird statistic (QDL)), together with the score test based on the conditional maximum likelihood estimator of the assumed common odds ratio (Qmle) (Cox, 1972; Liang and Self, 1985; Paul and Donner, 1989, 1992) and the Breslow–Day score statistic based on the Mantel–Haenszel estimator of the assumed common odds ratio (QBD). Brief details of the way in which each of these statistics is calculated are given in Appendix A and full details can be found in the original references.
2.1. Simulations
In order to test the efficacy of the above statistics in detecting heterogeneity we have performed two sets of simulations. The first considers the case of truly homogeneous data with fixed underlying treatment effect (which we term the experimental event rate or EER) and control effect (the control event rate or CER). This allows us to determine whether the five tests give the correct level of statistical significance for truly homogeneous data. Since both event rates are fixed, the effect size is homogeneous in both the log-odds scale and the risk difference scale and since we also use perfectly balanced designs in each trial (i.e. equal group sizes) we can conclude that the comparisons we make between the performance of the statistics in our simulations is fair.
The second set of simulations considers the case of heterogeneous data by allowing the underlying event rates to vary randomly. This allows us to assess the power of the five tests to detect increasing levels of heterogeneity in the data.
We attempt to make our simulations mimic as closely as possible the likely data that will occur in meta-analyses in pain studies. We therefore use in all cases a CER value of 0.2 and EER values ranging from 0.2 (i.e. no effect of treatment) up to 0.7 (an extremely powerful analgesic) (McQuay and Moore, 1998). For each pair of values of CER and EER, we then simulate 10 000 meta-analyses with particular numbers of trials, k, in each meta-analysis (we consider the cases k=5, 10, 20 and 50). The number of patients, ni, in each group of a particular trial is assumed to follow a lognormal distribution with mean 50 (SD 25) (again typical of RCTs in pain relief; McQuay and Moore, 1998) as described in Appendix B. In each of the k trials within each of the 10 000 simulated meta-analyses individual patient data are then generated so that (i) for fixed effects they have the same underlying values of the CER and EER as all other trials in the meta-analysis and (ii) for random effects they have underlying values of the CER of 0.2 and the EER of 0.5, but both are allowed to vary randomly about these underlying means. This variation in the random effects model is calculated via a random perturbation on the log-odds scale, as described in Appendix B. The assignment of an individual as a success or failure simply depends on whether a random number generated from a uniform distribution on [0,1] is less than or greater than the underlying event rate for that group.
Once the data have been generated, the five homogeneity statistics are calculated and the proportion which give statistically significant results are counted and used to create the graphs given in Section 3. The simulation algorithm is given in Appendix B. The most commonly chosen level for statistical significance in homogeneity tests is P=0.1 or 10% and we will therefore use this level of significance throughout this paper (we will refer to this as the ‘nominal significance’ level). This implies that in truly homogeneous trials, we would expect each of our homogeneity tests to give a statistically ‘significant’ result (against homogeneity) in about 10% (or about 1000) of the 10 000 simulated meta-analyses, purely due to random chance. For truly heterogeneous data, we would expect the tests to detect heterogeneity more frequently than in 10% of cases, depending on the degree of heterogeneity present.
3. Results
The results of the simulations for fixed effects are given in Figs. 1 and 2 and those for random effects are given in Figs. 3 and 4. In Fig. 1 we show the percentage of the 10 000 simulated meta-analyses which give a statistically significant result at the 10% (P=0.1) level for meta-analyses containing 5, 10, 20 and 50 trials using a fixed CER of 0.2 and fixed EER values ranging from 0.2 to 0.7 in increments of 0.05 (so that the data is truly homogeneous). It can be seen that only two of the five statistics, QBD and Qmle, come close to maintaining the nominal significance level of 10%. The statistic which performs worst is the Peto statistic, which as expected gives accurate values only for very small treatment effects, but gives gross under-estimates for larger effects. This is because this statistic is identical to that proposed by Zelen (1971) and was shown by Halperin et al. (1977) to follow a χ2 distribution only when there is no effect of treatment compared to control (so that the odds ratio is exactly 1). This statistic should not therefore be used in RCTs in pain research, where typical effect sizes are large.
Of the other two commonly used statistics, QW tends to give too low a percentage of trials with statistically significant heterogeneity, whilst QDL tends to give too high a percentage. The degree of under- and over-estimation increases markedly with the number of trials in each meta-analysis; this is again to be expected since all such statistics follow a χ2 distribution only asymptotically, where ‘asymptotically’ means the number of trials remains fixed, but the number of patients in each group of each trial becomes large (Paul and Donner, 1989). We would therefore expect that both QW and QDL would give closer to nominal levels of significance with larger group sizes. This is confirmed by the further simulations shown in Fig. 2, where we have used lognormally distributed group sizes with mean 200 (SD 50) for the cases of 20 and 50 trials in each meta-analysis; both QW and QDL give much closer to the nominal 10% significance level, QBD continues to be very accurate, whilst QP again gives very poor results except for very small treatment effects (we do not give values for Qmle in Fig. 2 since the iterative procedure necessary takes an inordinate amount of CPU time for such large group sizes).
Figs. 1 and 2 considered simulated data where the true underlying effect sizes were fixed for all trials, i.e. the simulated meta-analyses were all truly homogeneous. In Fig. 3 we consider what happens when the underlying effect sizes are heterogeneous. To do this we choose underlying values of the CER of 0.2 and of the EER of 0.5, which are typical of an average analgesic (McQuay and Moore, 1998). We then allow both event rates to vary randomly about the underlying value, with the fluctuations generated in the log-odds scale, for the reasons given in Appendix B. Also given in Table 2 and Fig. 5 in Appendix B are the means and standard deviations of both the underlying perturbed event rates and the observed event rates generated by our algorithm. For small values of σre (the standard deviation of the random error in the log-odds scale) up to about 0.15, we generate a small random effect and the standard deviations of the observed event rates increase only slightly due to the random effect (see Table 2 in Appendix B). However, as σre continues to increase the observed standard deviations become much larger than would be expected due to binomial variation alone, so that the observed EERs within any particular meta-analysis might cover the complete range of values of known analgesics, as illustrated in the lower panels of Fig. 5 in Appendix B. All trials simulated in Fig. 3 use lognormally distributed group sizes with approximate mean 50 (SD 25) (as in Fig. 1).
It can be seen from the results presented in Fig. 3 that all of the statistics have very low power to detect random effects, unless those effects are large. Whilst the Dersimonian and Laird statistic, QDL, appears to have the greater power, this is probably just a consequence of the over-estimation given by this statistic with fixed effects as shown in Figs. 1 and 2 and similarly for the under-estimation given with the Woolf statistic, QW, and the Peto statistic, QP. The maximum likelihood statistic, Qmle, and the Breslow–Day statistic, QBD, again give very similar results, reinforcing our recommendation of the Breslow–Day statistic.
For all of the tests, the power is a function of the number of trials included in the meta-analysis (this is in agreement with the recent results of Hardy and Thompson (1998) who have shown that the power of homogeneity statistics is a function of the total information available within the meta-analysis). For meta-analyses with numbers of trials between 10 and 20 (typical in pain relief), the Breslow–Day statistic rejects the null hypothesis of homogeneity in 30–40% of simulations with a random effect with σre=0.2 and only in 70–90% of the simulations with a large random effect with σre=0.4. The implications of these results for meta-analyses in pain relief are explored in Section 4.
We also considered the effect of the size of the treatment effect on the power of the homogeneity tests and found, as expected, that it had little influence. We give two examples of such a simulation in Fig. 4, which shows the power of the homogeneity tests as a function of experimental event rate for random effects with SDs σre=0.2 and 0.3, with 20 trials per meta-analysis and lognormally distributed group sizes with mean 50 (SD 25). It is clear that the power of all of the homogeneity tests (except the Peto statistic, QP) remains roughly constant as the treatment effect increases. The slight decreases at the end of the range are likely to be due to the skewing of the distributions of the perturbed event rates in transforming from the log-odds scale back to the probability scale (as described in Appendix B).
4. Discussion
As we mentioned briefly above, all statistical tests of homogeneity depend upon the assumption of the asymptotic normality of whatever measurement of effect size we are using – in this paper this is either the risk difference or the log of the odds ratio (an excellent discussion of the derivation of homogeneity statistics is given in Chapter 10 of Fleiss (1981)). As we have shown in Fig. 1, the only non-iterative statistic which gives nominal significance for truly homogeneous data and for group sizes typical of pain studies is the Breslow–Day statistic and we would therefore recommend this statistic for routine use in meta-analysis of pain studies. Fig. 1 also shows that the DerSimonian and Laird statistic over-estimates the degree of heterogeneity and the Woolf statistic under-estimates it; it seems likely from the results shown in Fig. 2, which use comparatively large group sizes, that this is due to the assumption of asymptotic normality being poor for these statistics when used with the smaller group sizes of Fig. 1. We have shown that the Peto statistic gives nominal significance only when there is no treatment effect, but gives reasonable accuracy (at least for small to medium meta-analyses) for effect sizes up to odds ratios of around 1.5–2 (risk differences of about 0.15) and is therefore unsuitable for use in pain studies, but will give good results in the types of studies for which it was originally introduced, i.e. those with small effect sizes.
In Figs. 3 and 4 we considered the effects of including random effects to simulate heterogeneity between the trials and showed that all of the statistics have very low power to detect such heterogeneity. This has been reported before in the context of heterogeneity in several 2×2 tables (Jones et al., 1989),4 as well as in the context of meta-analyses by Hardy and Thompson (1998).5 These results suggest that in practice homogeneity tests are of very limited use; in a typical pain study (10–20 trials per meta-analysis) which has a strong degree of heterogeneity (SD 0.25 in Fig. 3) the statistical tests are (at best) equally likely to reject or accept the null hypothesis of homogeneity. It is only with extremely heterogeneous data and large numbers of trials in the meta-analysis that the power of the tests is sufficient to detect the heterogeneity most of the time. So, in most practical situations, failure to detect heterogeneity does not allow us to say with any helpful degree of certainty that the data is truly homogeneous.
This leads us into the debate as to whether we should use fixed effects analyses to estimate treatment effects, or random effects analyses, which is a topic which has already received much attention in the literature (see, for example, Dersimonian and Laird, 1986; Greenland and Salvan, 1990; Hardy and Thompson, 1998). Since in practical situations it will not be possible to demonstrate with any degree of certainty that a set of trials are statistically homogeneous, we would advocate the quantitative combination of results only where the trials contained in a meta-analysis can be shown to be clinically homogeneous. We would propose as a definition of clinical homogeneity that all trials have (i) fixed and clearly defined inclusion criteria and (ii) fixed and clearly defined outcomes or outcome measures. In pain relief, for example, the first of these would be satisfied by all patients having moderate or severe pain, whilst the second would be satisfied by using at least 50% pain relief as the successful outcome measure (Edwards et al., 1999). Provided that the trials are considered to be clinically homogeneous, then we would advocate following the advice of Fleiss (1981, p. 164), who suggests that statistical homogeneity should be tested only at a very conservative level (such as P=0.01). A similarly attractive argument has been put forward by Greenland and Salvan (1990, p. 252), who argue that the choice between fixed and random effects modelling is secondary to the exploration of ‘clinically important’ inter-study differences; where such differences do exist then it is more important to attempt to model and perhaps explain the inter-study differences rather than to attempt to pool the disparate study results in a single summary estimate. Another balanced and informative discussion of how to deal with heterogeneity is given in the paper by Thompson and Pocock (1986), who suggest that ‘quantitative conclusions [of meta-analyses]…must take into account the practical relevance of the individual studies and the clinical heterogeneity between them’.
Acknowledgements
We would like to thank Dr Richard Stevens for his very helpful advice on the statistical aspects of this paper and an anonymous referee for his comments which have greatly improved the revised draft. We are grateful to the following organizations for their financial support which has enabled us to undertake this research: the Medical Research Council for a Career Development Fellowship (D.J.G.), the European Union Biomed 2 contract BMH4 CT95 0172 (HJM) and the NHS Research and Development Health Technology Assessment Programme 94/11/4.
References
Breslow NE, Day NE.
The analysis of case-control studies (chapter 4), Statistical methods in cancer research, 1. Lyons: International Agency for Research on Cancer; 1980.
Cox DR.
Analysis of binary data. London: Methuen; 1972.
Dersimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7:177-188.
Dudewicz EJ, Mishra SN.
Modern mathematical statistics. New York: Wiley; 1988.
Edwards JE, Oldham A, Smith L, Carroll D, Wiffen PJ, Mcquay HJ, Moore RA. Oral aspirin in post-operative pain. A quantitative systematic review. Pain. 1999;81:289-297.
Fleiss JL., 1981. Statistical methods for rates and proportions (chapter 10), 2nd ed. Wiley, New York.
Greenland S, Salvan A. Bias in the one-step method for pooling study results. Stats Med. 1990;9:247-252.
Halperin M, Ware JH, Byar DP, Mantel N, Brown CC, Kozial J, Gail M, Green SB. Testing for interaction in an
IĂ—
JĂ—
K table. Biometrika. 1977;64:271-275.
Hardy RJ, Thompson SG. Detecting and describing heterogeneity in meta-analysis. Stats Med. 1998;17:844-856.
Jones MP, O'Gorman TW, Lemke JH, Woolson RF. Monte-Carlo investigation of homogeneity tests of the odds ratio under various sample size configurations. Biometrics. 1989;45:171-181.
Liang KY, Self SG. Tests for homogeneity of odds ratio when the data are sparse. Biometrika. 1985;72:353-358.
Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst. 1959;22:719-748.
McQuay HJ, Moore RA. Using numerical results for systematic reviews in clinical practice. Ann Intern Med. 1997;126:712-720.
McQuay HJ, Moore RA.
An evidence-based resource for pain relief. Oxford: Oxford University Press; 1998.
Paul SR, Donner A. comparison of tests of homogeneity of odds ratios in
K 2Ă—2 table. Stats Med. 1989;8:1455-1468.
Paul SR, Donner A. Small sample performance of tests of homogeneity of odds ratios in
K 2Ă—2 table. Stats Med. 1992;11:159-165.
Thompson SG, Pocock SJ. Can meta-analysis be trusted? Lancet. 1986;338:1127-1130.
Woolf B. On estimating the relation between blood group and disease. Ann Hum Genet. 1955;19:251-253.
Yusuf S, Peto R, Lewis J, Collins R, Sleight P. Beta blockade during and after myocardial infarction: an overview of the randomised trials. Prog Cardiovasc Dis. 1985;5:335-371.
Zelen M. The analysis of several 2Ă—2 tables. Biometrika. 1971;58:129-137.
Appendix A Homogeneity statistics
A.1. The Peto statistic
In the paper of Yusuf et al. (1985), the problem is framed in terms of observed, Oi, and expected, Ei, numbers of successes in treatment group i and its variance Vi (where in the notation of Table 1, Ei=niti/Ni and Vi=Ei(1−ni/Ni)(Ni−ti)/(Ni−1)). Making use of an approximation to the conditional maximum likelihood estimate of the common log-odds ratio these authors derive a ‘natural approximate chi-square test for heterogeneity’ given by
with degrees of freedom one less than the number of non-zero variances (and so is usually equal to k−1).
A.2. The Woolf statistic
In the notation of Table 1, the Woolf homogeneity statistic (Woolf, 1955) is given in terms of the natural logarithm of the individual estimates of the odds ratio, ψi, from each trial. Letting yi=lnψi, the Woolf homogeneity statistic is given by
where wi=1/ri+1/ni−ri+1/si+1/mi−si is the approximate variance of the log-odds ratio. QW is again taken to follow approximately a χ2 distribution with k−1 degrees of freedom.
A.3. The DerSimonian and Laird statistic
The DerSimonian and Laird (1986) statistic is the only one of the five which is not based on the common odds ratio and instead is based on estimates of the risk difference. Defining rTi=ri/ni and rCi=si/mi to be the proportions of successful patients in the treatment and control groups in trial i and yi=rTi−rCi, the DerSimonian and Laird homogeneity statistic is defined as
where wi=1/si2 and si2=rTi(1−rTi)/ni+rCi(1−rCi)/mi is an estimate of the sampling variance in the ith study. QDL is again taken to follow approximately a χ2 distribution with k−1 degrees of freedom. DerSimonian and Laird also considered two further test statistics: the first was similar to QDL described above but with equal weights for each trial and was shown to give poor results; the second was based on the natural logarithm of the relative odds and is equivalent to the Woolf statistic.
A.4. The conditional maximum likelihood score statistic
This is by far the most complex of the test statistics considered and is unlikely to be adopted for routine use. It is included only for comparison with the four non-iterative statistics. The conditional likelihood is the conditional distribution of the observed data assuming that all marginal totals are fixed (Cox, 1972; Breslow and Day, 1980) and is expressed in terms of the assumed common odds ratio ψ and the observed number of successes, ri, in each of the treatment groups as
The maximum of this expression as a function of ψ can be found (for example by the Newton–Raphson method) to give the conditional maximum likelihood estimate of the odds ratio Symbol. Liang and Self (1985) describe a homogeneity statistic based on Symbol given by
where Symbol and Symbol are the exact mean and variance of ri. Again this is approximately distributed as a χ2 random variable with k−1 degrees of freedom.
A.5. The Breslow–Day test statistic
Breslow and Day (1980) (Paul and Donner, 1992) proposed a homogeneity statistic based on the Mantel–Haenszel estimator (Mantel and Haenszel, 1959) of the odds ratio, ψMH (Symbol in the notation of Table 1) and is defined as
where ei(ψMH) is the expected value of ri given ψMH, vi(ψMH) is an estimator of the variance of ri given the value of ψMH and conditional on the value of ti (see Paul and Donner, 1992). Each value of ei can be found by solving the quadratic equation
and taking the unique root in the interval max(0, ti−mi)≤ei≤min(ti, ni) (Fleiss, 1981). vi is then obtained as
Although this looks complex, in practice it involves finding the root of k quadratic equations, followed by the usual summation to obtain the test statistic, and so is computationally inexpensive and would be easy to implement on a standard spreadsheet.
Appendix B The simulation algorithm
B.1. Generation of group sizes
The group sizes for each of the trials in the meta-analysis are generated from a lognormal distribution (to ensure non-negativity) with mean N (SD σN). If a random variable X is normally distributed with mean μX and variance σX2, then Y=exp(X) has a lognormal distribution. It can be shown that (see, for example, Dudewicz and Mishra, 1988) the mean of Y is μY=exp(μX+σX2/2) and the variance of Y is σY2=exp(2μ+σX2)(expσX2−1). The group sizes for all simulations were therefore generated by first simulating a normal random variable, x say, then obtaining the group size from n=nint[exp(x)], where ‘nint’ is the nearest integer value. In practice (except for values shown in Fig. 2) we used values of μX=3.8 and σX=0.48 to obtain a lognormal distribution with approximate mean N=50 (SD σN=25). For Fig. 2 we used values of μX=5.25 and σX=0.25 to obtain a lognormal distribution with approximate mean N=200 (SD σN=50).
B.2. Generation of the EER for the random effects model
In the following simulation algorithm, the underlying CER is taken to be q and the underlying EER is p. If we are considering fixed effects, then we simply choose a value of the CER, q (this is taken to be 0.20 in all simulations), and of the EER, p (this is chosen in the range 0.2 up to 0.7 for fixed effect simulations), and we then follow the algorithm given below for this particular pair (p, q) for 10 000 simulations.
If we are considering random effects, then, for example, we cannot simply allow the CER and EER to vary randomly about some mean with, say, a normally distributed random error, since we could then obtain values of p which are outside the range [0,1] and are therefore unrealistic. We therefore consider perturbing instead in the log-odds scale. We do this by first choosing underlying values of p and q (we use p=0.5 and q=0.2 for all simulations in Fig. 3) and calculate from these the underlying log-odds yp=ln[p/(1−p)] and yq=ln[q/(1−q)] which can take values over the whole real line and are asymptotically normally distributed. We then simulate normally distributed random errors, εpεq, with mean 0 (SD σre) (we take σre to vary between 0.05 and 0.5 in Figs. 3 and 4) and add this to yp and yq. We then invert this process to obtain the perturbed values of pi and qi for trial i from pi=exp(yp+εpi)/(1+exp(yp+εpi)) and qi=exp(yq+εqi)/(1+exp(yq+εqi)).
To check the range of values that this process generates, we calculated the mean and standard deviation of the ‘perturbed’ pi and qi values generated in each simulation of 10 000 meta-analyses. We also calculated the mean and standard deviation of the ‘observed’ event rates, Symbol and Symbol, where nei and nci are the observed numbers of experimental and control events in trial i which has ni patients in each group. Examples of the resulting means and standard deviations are given for the case n=20 in Table 2 for each value of σre used in Fig. 3, with underlying CER q=0.2 and EER p=0.5. These values allow us to quantify the effect that increases in σre have on the distribution of the observed events rates, so that a value of σre=0.15 increases the standard deviation of the observed experimental event rates by only about 10% (a small random effect), whilst values of σre=0.3 and 0.5 increase the standard deviation of the observed experimental event rates by 37 and 78%, respectively.
Note that the mean value for qi is increasingly biased above the underlying CER value of 0.2 with increasing random effect size. This is due to the non-linear nature of the transformation from the perturbed log-odds scale back to the probability scale, which skews the distribution away from 0 in the probability scale. This also accounts for the slightly lower percentage changes in the standard deviation of the observed control event rates for a given value of σre in Table 2.
In Fig. 5 we also give histograms of the observed event rates, Symbol and Symbol, in each of the 200 000 (20Ă—10 000) trials which were simulated to generate Fig. 3 for each value of σre. σre=0 is simply a binomial variation (with random, lognormally varying group sizes) and it is clear that a small random effect of σre=0.1 makes little impression on the observed distributions, but for σre=0.3 and 0.5 there is a very marked effect on the observed distributions and we might expect that an effective homogeneity test would allow us to detect random effects of this order with high power.
B.3. Generation of the data for each simulation
In our simulations, each meta-analysis consists of k trials, with ni patients in each group (i.e. we use perfectly balanced designs). The simulation algorithm is then as follows:
I For each of the k trials in the meta-analysis:
- Generate a random number ni (lognormally distributed) which is the number of patients in each of the control and experimental groups.
- If we are considering random effects, generate the perturbed event rates, pi, qi, for each group as described above. If we are considering fixed effects then set pi=p, qi=q, the fixed underlying event rates.
- For each of the ni patients in the control group generate a random number, r say, uniformly distributed between 0 and 1. If r<qi then add 1 to the number of control events. This will result in a simulated value of the total number of control events, nci, and an observed control event rate of Symbol.
- Repeat 3 for the experimental group (so now use r<pi) etc. to obtain the number of experimental events, nei, and the observed experimental event rate of Symbol.
II Using the data from the k trials simulated in I, calculate each of the five homogeneity statistics.
III Repeat steps I and II 10 000 times and count the number of simulated trials in which each of the five homogeneity tests detects statistically significant heterogeneity (at the 10% level).
1A citation search on BIDS on each of these three papers yielded 1110 citations for the Yusuf et al. (1985) paper, 977 citations for the Woolf (1955) paper and 583 citations for the Dersimonian and Laird (1986) paper.
Cited Here
2We will use the term homogeneity statistics since we are testing whether the null hypothesis of homogeneity holds; if it does not, then we infer that there is evidence of heterogeneity between the trials.
Cited Here
3As we will show it also works adequately for the small effect sizes for which it was originally introduced by Yusuf et al. (1985), but is inappropriate for use with effect sizes typically seen in pain studies. It is worth noting that the associated estimate of the odds ratio proposed by Yusuf et al. (1985) is also severely biased for large effects (Greenland and Salvan, 1990) and is therefore also inappropriate for use in meta-analysis of pain studies. Instead we would recommend the use of the Mantel–Haenszel estimator of the log-odds ratio (Mantel and Haenszel, 1959), or perhaps more usefully from a clinical perspective reporting the effect size in terms of numbers-needed-to treat or NNTs (McQuay and Moore, 1997).
Cited Here
4This paper also gives an interesting explanation of why homogeneity tests have such low power.
Cited Here
5These authors start from the assumption that the effect measure is normally distributed and then simulate data from an appropriate normal distribution. They do not therefore observe the effect size and group size variation in the performance of the homogeneity statistics that we have demonstrated.
Cited Here
Keywords:Homogeneity; Heterogeneity; Meta-analysis; Pain relief
© 2000 Lippincott Williams & Wilkins, Inc.