Monday, April 6, 2020

Power for inferiority tests in multi-lab replications

In a previous post, I provided a procedure for calculating power for inferiority tests for multi-lab studies. However, the power formula in this post was incorrect. I’m sorry for any confusion this caused (though based on the view count for that post, I gather that the damage was minimal). Here, I provide a corrected procedure, which I’ve validated with Monte Carlo simulations (information on the validation is provided on the bottom of this post).

Where did I go wrong?

So how did I mess this up? The short answer is that I checked my results against the power functions in TOSTER (Lakens, 2017; which are designed to calculate power for equivalence tests), when one bound was set to (negative) infinity. I assumed this was effectively giving power for an inferiority test. But this is not the case. Although an inferiority test is, in effect, just one of the two arms of an equivalence test, the math underlying the calculations for power is somewhat different.

Thankfully, I figured out a way to reasonably calculate power in the presence of heterogeneity. I describe it below. You’ll notice I’ve ripped off some of my old post.

Some basics

Researchers are increasingly interested in conducting multi-lab replications in order to, among other things, achieve high levels of statistical power that would otherwise be unattainable by a single lab. However, several recent multi-lab replications have turned up nonsignificant results (e.g., Many Labs 4; Klein et al, 2019). When the estimated effect across labs is small and nonsignificant, it can be useful to test whether the observed effect is smaller than a specified bound (e.g., the smallest effect that would have been theoretically meaningful; Anvari & Lakens, 2019). Testing against a specified bound to test whether the studied effect is smaller than that bound is known as an inferiority test. (When testing assess whether the studied effect falls within a set of two bounds, this approach is called an equivalence test; for a more thorough discussion, see Lakens, 2017). In a frequentist framework, an inferiority test can be carried out by checking whether the upper limit of a confidence interval excludes the specified inferiority bound – or equivalently, calculating a test statistic and corresponding p-value to assess whether the observed effect is significantly smaller than the bound.

Just as it’s important to conduct power analyses to determine a multi-lab study’s ability to detect effects, it’s important to assess the study’s ability to reject effects. Calculating power for inferiority tests is fairly straightforward, though the methods are less familiar to researchers than power calculations for typical significance tests. However, when conducting an inferiority test with data collected at multiple labs, heterogeneity across the labs can decrease the power of that test. I haven’t found a tutorial on how to calculate power for inferiority tests in this situation, so here’s my (second) attempt at devising a method.

Calculating power for inferiority tests in the presence of heterogeneity

Here, I am specifically focusing on a simple case of comparing two independent groups. Drawing directly from Julious (2004), the formula for power (\(1 - \beta\)) for inferiority tests is as follows:

\(1 - \beta = 1 - pt(t_{1-\alpha, df}, df, \lambda)\)

where \(pt()\) is the probability function of the t distribution, \(df=N_{total}-2\), \(\alpha\) is the specified Type I error rate, and \(\lambda\) is the noncentrality parameter. The formula for \(\lambda\) needs to account for the heterogeneity across labs. To do this, we use the following formula:


where \(n\) is the group size per lab, \(m\) is the number of labs, \(\delta\) is the population effect size (as a standardized mean difference), and \(d_{bound}\) is the inferiorty bound (also as a standardized mean difference), and \(L\) is the heterogeneity across labs (as explained variance). This formula for \(\lambda\) is adapted from the formula Julious provides, as well as the formula for the noncentrality parameter accounting for heterogeneity provided by Jake Westfall in this blog post.

As I noted previously, in many cases, it makes sense to assume \(\delta\) is 0. This represents a situation in which the effect does not exist (on average), but there may be heterogeneity around 0. That is, in some labs, the effect might be above or below zero for reasons other than sampling variation. I’ll expand on this issue later and talk about some situations in which I think this is plausible.

As heterogeneity (\(L\)) increases, power decreases, so selecting a reasonable value for this parameter is important. In the last version of this post, I agrued that 1% is a reasonable starting point, based on the results of Many Labs 2. Jake Westfall has provided a way of converting \(\tau\) into \(L\) (the proportion of lab variance). The formula he provides is \(L = \frac{\tau^2}{1-\tau^2}\). This formula can be useful if you have a good heterogeneity estimate from a meta-analysis. Many Labs 2 found that there was fairly little heterogeneity across labs for most effects they examined. Among effects for which there was some heterogeneity, the \(\tau\) estimates were around .10. Using the equation above, this converts to \(L\) = .01 (1%). However, often we should be more conservative and assume there could be greater heterogeneity, but I think 1% is a good place to start.

Here’s an R function to calculate power using this method:

pwr.inferiority <- function(n, m, d_population = 0, d_bound, L, alpha = .05) {
  ncp <- abs((d_population - d_bound)/(2*(sqrt((1/(2*m*n)) + (L/m)))))
  DF <- 2*(m*n) - 2
  pwr <- 1 - pt(qt(1 - alpha, DF), DF, ncp)

The arguments of this function match the math we were working through above: n is the group size per lab (not the total sample size!), m is the number of labs, d_population (0 by default) is the population effect size, d_bound is the inferiority bound, L is the proportion of lab heterogeneity, and alpha is the desired Type I error rate (.05 by default). The output is an estimate of power, given the specified parameters. To determine a sample size for acheiving a particular level of power, I suggest taking an iterative approach (i.e., try different sample sizes with reasonable values for the other parameters).

As an example can illustrate how the function works, let’s imagine a multi-lab study in which 5 labs each collect N = 200 (n = 100 per group) for a two-group between subjects experiment. First, what happens when there is no heterogeneity?

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = 0)
## [1] 0.9351492

It looks like we are very well powered!

Now let’s imagine we have the same design, but there is 1% heterogeneity across labs.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = .01)
## [1] 0.5712866

Suddenly, power drops dramatically. Why? The lab heterogeneity has a huge effect on the precision of the effect size estimates.

One of the things Jake Westfall pointed out in the blog post I linked earlier was that, when there is heterogeneity, increasing the number of labs generally increases power to detect effects more efficiently than increasing the number of participants per lab. This principle applies for rejecting effects as well as detecting them. If we instead have 50 labs each collecting n = 10 per group (still total N = 1000), the power is much greater:

pwr.inferiority(10, 50, d_population = 0, d_bound = .20, L = .01)
## [1] 0.8925013

The larger number of labs is able to better cope with the heterogeneity, and the effect can be more precisely estimated.

Simulations suggest that the approach presented here is fairly accurate (see the bottom of the post for details), but it is limited in its applicability to two-group designs and under some circumstances may be a bit too liberal in its estimates. If you need precise estimates or are planning a more complex design, I suggest taking a simulation-based approach (though that, of course, is considerably more complex and demanding of resources).

An illustration based on the CLIMR project

In the previous version of this post, I illustrated this approach to power calculation using the CLIMR project, a multi-lab replication of key findings in the Construal Level Theory literature. I’m one of the coordinators of this project, along with Erik Mac Giolla, Sofia Calderon, and Karl Ask. Let’s have another look at that project, using the corrected method.

In this project, we’re planning to replicate four experiments, with a total of 64 labs (as of March 24, 2020). Each lab will collect approximately N = 25 participants per experiment (n = 12.5 per group on average). For this demonstration, we'll assume that the true effect of each experiment we replicate is zero but that procedural variations across labs cause the effects to vary in size. I’m not saying this is what will happen or that it’s even likely. But it’s useful to know how informative nonsignficant results can be. Let’s use the function above to create a power curve for inferiority tests. The inferiority bounds are on the x-axis, and power is on the y-axis.

Under these assumptions, we have a considerable amount of power to exclude quite small effect sizes. For instance, we’ll have 85.0% power to reject effects of d = .15 or larger and 97.3% power to reject effects of d = .20 or larger. This puts us in quite good shape for offering a severe test of these effects.

When and why should we care about heterogeneity in inferiority tests?

It’s reasonable to ask whether heterogeneity is even an issue likely to affect inferiority tests in multi-lab studies, but I think it’s important to consider. First, let’s consider why you would test for inferiority in the first place. Presumably, you are interested in assessing whether the observed average effect has been estimated with sufficient precision to reject some specified effect size. That effect size might represent the smallest value that would be of theoretical or practical significance. A well-planned multi-lab test of a theory could identify in advance how small an effect would have to be on average in order for the theory (or at least a crucial part of the theory) to be falsified. Any half-decent theory should at least specify the direction an effect is supposed to go, so typically the concern will be inferiority (“Is the effect smaller than .10?”) rather than equivalance (“Is the effect between -.10 and +.10?”). Thus, a significant inferiority test can be part of evidence useful for decisively demonstrating that an effect does not exist (or at least, that it does not exist in any meaningful way). Planning your sample size and required number of labs for inferiority testing, rather than only typical significance testing, is helpful for reducing the likelihood that you will end up with inconclusive results, like a nonsignificant finding that does not provide evidence against the smallest effect of interest.

So if planning for and testing inferiority is a good idea in principle, we ought to consider some of the factors that affect the power of the test. Heterogeneity is one of them. Variation in effect size might seem like a strange thing to consider when thinking about a situation in which the true effect is zero or very close to it. Why would there be variation in the effect if the effect does not exist? There are at least two situations I can think of in which heterogeneity is plausible even if the true effect is close to or exactly zero.

First, the true effect might occur only under selective circumstances. If the average situation in the multi-lab study was not one in which the phenomenon is likely to occur but sometimes it does (e.g., for some individuals or at some sites), you could end up with a very small mean effect size with nontrivial amounts of heterogeneity. Assuming that the test of the hypothesis was a fair representation of that prediction, this result would still be quite bad for a theory, but it could have important implications for future work.

Second, the effect might not be real at all, but procedural variations might cause “effects” to vary across labs. For instance, perhaps labs sampled differing proportions of non-naive participants who were able to guess the hypothesis. Methodological blemishes like this could result in artifactual heterogeneity. I think it’s quite likely that issues like these regularly arise but are frequently undetected.


Anvari, F., & Lakens, D. (2019). Using Anchor-Based Methods to Determine the Smallest Effect Size of Interest.

Julious, S. A. (2004). Sample sizes for clinical trials with normal data. Statistics in medicine, 23(12), 1921-1986.

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Chartier, C. R., … & Cromar, R. (2019). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement.

Lakens, D. (2017). Equivalence tests: a practical primer for t tests, correlations, and meta-analyses. Social psychological and personality science, 8(4), 355-362.


To validate this procedure for calculating power, I tested it in a series of Monte Carlo simulations. In these simulations, I simulated data with varying amounts of heterogeneity around an effect of 0, and I performed inferiority tests in three ways:

  1. with a one-sided test against an inferiority bound as though there was no heterogeneity (with the TOSTER package; \(\alpha = .05\)),
  2. with a one-sided inferiority test using an estimate from a random effects meta-analysis (with the metafor and TOSTER packages; \(\alpha = .05\)), and
  3. by fitting a linear mixed-effects model (fitting random slopes for the interaction between labs and condition) and checking whether the 90% CI for the effect excludes the inferiority bound (using lme4, calculating the CI using the t-distribution and the Satterthewaite DF).

Then I calculated the mean squared difference between the calculated power to the power simulated from 10,000 tests for 24 different combinations of inferiority bounds, samples sizes, labs, and heterogeneities. For the cases in which there was no heterogeneity, simulated power for the method that assumed there was no heterogeneity matched the calculations almost exactly (mean squared error < .0001). As one would expect, as heterogeneity increased, the simple method performed more poorly and became anti-conservative (mean squared error .05 when heterogeneity was 5%). Simulated power for the meta-analytic approach and the linear mixed model approach were both close to the calculated power overall (mean squared error .007 and .008 respectively). Thus, the power calculations were fairly accurate, though not exact. Generally, the meta-analytic and mixed model power in the simulations was a bit lower than the calculated power. The mixed model approach was especially conservative when there was no heterogeneity (mean squared error .02), possibly because the model’s random slopes were unwarranted and the standard error became inflated. Given these results, I would recommend treating power calculated using this method as a close but relatively liberal estimate.

There are some situations where you may need very precise estimates of power. Again, probably the most accurate and most flexible way to assess power for inferiority tests is to specify your own simulation-based power analysis, based on your specific design. However, using the formula and function I’ve provided here can give reasonable starting values for numbers of participants and labs in custom-built simulations.

Supplemental code

Code to reproduce the figure above

ds <- seq(from = 0, to = .60, by = .01)

dp <- 0

n <- 12.5
m <- 64

L <- .01

inf_power <- pwr.inferiority(n, m, d_population = dp, d_bound = ds, L = L)

inf_data <- data.frame(bound = ds, power = inf_power)

         x = bound,
         y = power
       )) +
    size = 1.5,
    color = "#b91372"
  ) +
    xintercept = .475,
    linetype = "longdash"
  ) +
    xintercept = .20,
    linetype = "dotted"
  ) +
    yintercept = .80,
    linetype = "dashed"
  ) +
    yintercept = .90,
    linetype = "dashed"
  ) +
    yintercept = .95,
    linetype = "dashed"
  ) +
    yintercept = .99,
    linetype = "dashed"
  ) +
    breaks = seq(0, .80, .05)
  ) +
    breaks = c(.10, .33, .66, .80, .90, .95, .99),
    labels = c(".10", ".33", ".66", ".80", ".90", ".95", ".99")
  ) +
    y = "Power",
    x = "Inferiority bound (d)",
    title = "Power for inferiority",
    subtitle = paste("Power for each experiment to reject effect sizes\n", m, 
" labs each collecting N = 25", 
" participants per experiment,\nassuming an average effect of ", 
dp, " and ", 100*L, "% heterogeneity across labs", 
sep = "")
  ) +
    color = FALSE
  ) +
    ylim = c(0, 1)
  ) +