# UPDATE: THE CONCLUSIONS OF THIS POST ARE WRONG

In my efforts to validate this approach with Monte Carlo simulations, I discovered that the formula I used to calculate power in this post is incorrect. DO NOT USE THE POWER CALCULATIONS IN THIS POST.

To view a corrected and updated post on this topic, with an explanation of what went wrong, click here. I’m retaining the original post below, as a record.

# The previous (incorrect) post

Researchers are increasingly interested in conducting multi-lab replications in order to, among other things, achieve high levels of statistical power that would otherwise be unattainable by a single lab. However, several recent multi-lab replications have turned up nonsignificant null results (e.g., Many Labs 4; Klein et al, 2019). When the estimated effect across labs is small and nonsignificant, it can be useful to test whether the observed effect is smaller than a specified bound (e.g., the smallest effect that would have been theoretically meaningful; Anvari & Lakens, 2019). Testing against a specified bound to test whether the studied effect is smaller than that bound is known as an inferiority test. (When testing assess whether the studied effect falls within a set of two bounds, this approach is called an equivalence test; for a more thorough discussion, see Lakens, 2017).

Just as it’s important to conduct power analyses to determine a multi-lab study’s ability to detect effects, it’s important to assess the study’s ability to reject effects. Calculating power for inferiority tests is fairly straightforward, though the methods are less familiar to researchers than power calculations for typical significance tests. Dan Lakens’s R package TOSTER provides handy functions for calculating power for equivalence and (non)-inferiority tests. However, when conducting an inferiority test with data collected at multiple labs, heterogeneity across the labs can decrease the power of that test. I haven’t found a tutorial on how to calculate power for inferiority tests in this situation, so I put this one together.

## A method for estimating power for two-group inferiority tests in multi-lab studies

Here, I’ve borrowed substantially from Jake Westfall (particularly this blog post from a few years ago) and from Dan Lakens’s work on the TOSTER package to offer a method for calculating power for two-group inferiority tests (between-subjects).

Based on the calculations Dan Lakens uses in TOSTER and based on Julious’s (2004) guide to samples size estimation for equivalence tests, this procedure I’m suggesting here assumes that power ($1-\beta$) for an inferiority test is given as

$1 - \beta = 2 pt(t_{1-\alpha, df}, df, \lambda) - 1$

where $pt()$ is the probability function of the t distribution, $df=N_{total}-2$ and $\alpha$ is the specified Type I error rate. The most important part of this equation is the noncentrality parameter $\lambda$. I’ve calculated the noncentrality parameter as

$\lambda=|\frac{\delta-d_{bound}}{2\sqrt{\frac{1}{2nm}+\frac{L}{m}}}|$

where $n$ is the group size per lab, $m$ is the number of labs, $\delta$ is the population effect size, and $d_{bound}$ is the inferiorty bound (i.e., the value you’re testing against to see if your observed effect size is smaller). For many purposes, it may make sense to assume that $\delta$ is zero, to represent a case in which you are studying a null effect. Setting $d_{bound}$ requires theoretical or practical reasoning (see Anvari & Lakens, 2019).

In the formula above, the value $L$ is the proportion of total variance in the effect size explained by the lab by condition interaction. In other words, this value represents the extent to which labs vary in their effect sizes. This is not mere sampling variation; we’re talking about true variation in the effects at different replication sites. In practice, this could be due to procedural differences, cultural differences, or just about anything that could vary across labs.

As I mentioned above, I am borrowing heavily from the work of Jake Westfall, who provided a version of the noncentrality parameter equation which I adapted into this one. He also provides a way of calculating $L$ from a $\tau$ – a common estimator of heterogeneity in meta-analyses: $L = \frac{\tau^2}{1-\tau^2}$

Jake takes .05 (5%) to be a reasonable value for $L$, based on data from Many Labs 1. This seems like a mostly fine assumption to me. But Many Labs 2, which was designed specifically to examine heterogeneity across labs, found that there was relatively little heterogeneity. Among effects for which there was some heterogeneity, the $\tau$ estimates were around .10. Using Jake’s equation, this gives us a reasonable default $L$ of .01 (1%) – but perhaps in some cases, we should be more conservative and assume there could be unusually large amounts of variation in effects across labs.

Let’s go back to the noncentrality parameter. This value is, very roughly speaking, the t value that represents a nonzero effect size in the population, and it specifies the location of the noncentral t distribution. The typical noncentrality parameter is essentially a t value, calculated by the mean difference divided by the standard error. The larger the standard error, the greater the variance. And the larger the variance, the lower the power and precision. This version of the noncentrality parameter we’re using here takes into account additional variance components. That is, it lets us specify how variation other than sampling error – namely lab heterogeneity – play a role in the effect size estimates we will observe in our multi-lab study. The greater the lab heterogeneity, the lower the power.

Here is a simple R function for calculating power using this approach:

pwr.inferiority <- function(n, m, d_population = 0, d_bound, L, alpha = .05) {

ncp <- abs((d_population - d_bound)/(2*(sqrt((1/(2*m*n)) + (L/m)))))

DF <- 2*(m*n) - 2

pwr <- 2*(1 - pt(qt(1 - alpha, DF), DF, ncp)) - 1

return(pwr)

} 

The arguments of this function match the math we were working through above: n is the group size per lab (not the total sample size!), m is the number of labs, d_population (0 by default) is the population effect size, d_bound is the inferiority bound, and L is the proportion of lab heterogeneity. The output is an estimate of power, given the specified parameters. To determine a sample size for acheiving a particular level of power, I suggest taking an iterative approach (i.e., try different sample sizes with reasonable values for the other parameters).

As an example can illustrate how the function works, let’s imagine a multi-lab study in which 5 labs each collect N = 200 (n = 100 per group) for a two-group between subjects experiment. When L is set to 0 (no lab heterogeneity), pwr.inferiority and TOSTER give similar results.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = 0)
##  0.8702984
powerTOSTtwo(alpha = .05, N = 100*5, low_eqbound_d = -Inf, high_eqbound_d = .20)
## The statistical power is 87.08 % for equivalence bounds of -Inf and 0.2 .
## 
##  0.8708419

My procedure is slighly more conservative with its estimates because TOSTER uses the normal distribution to calculate power, and my procedure uses the t-distribution.

Now let’s imagine we have the same design, but there is 1% heterogeneity across labs.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = .01)
##  0.1425731

Suddenly, power drops dramatically. Why? The lab heterogeneity has a huge effect on the precision of the effect size estimates.

One of the things Jake Westfall pointed out in the blog post I linked earlier was that, when there is heterogeneity, increasing the number of labs generally increases power to detect effects more efficiently than increasing the number of participants per lab. This principle applies for rejecting effects as well as detecting them. If we instead have 50 labs each collecting n = 10 per group (still total N = 1000), the power is much greater:

pwr.inferiority(10, 50, d_population = 0, d_bound = .20, L = .01)
##  0.7850025

The larger number of labs is able to better cope with the heterogeneity, and the effect can be more precisely estimated.

Let’s consider a real example: The CLIMR project, a multi-lab replication of several key findings of Construal Level Theory. I’m one of the coordinators of this project, along with Erik Mac Giolla, Sofia Calderon, and Karl Ask. As of March 24, a total of 64 labs had committed to assisting with data collection. We’ve described what this means for our ability to detect effects on our website. What about interpreting nonsignificant effects? Let’s assume that the true effect of each experiment we replicate is zero but that procedural variations across labs cause the effects to vary in size. I’m not saying this is what will happen or that it’s even likely. But it’s useful to know how informative nonsignficant results can be.

So on these assumptions, how much power to reject effects do we have? A pretty considerable amount of power, as it turns out! With this many labs, we have about 95% power to reject effects of .20 or greater. Again, we’re making a lot of assumptions here, but this bodes well for our ability to rule out fairly small effects if the results are nonsignificant.

## A caveat

I think this is a reasoanble approach for estimating power, but I would be more confident after verifying these calculations with Monte Carlo simulations. At present, I haven’t done this validation. If and when I do, I’ll update this post.

## Supplemental code

Code to reproduce the figure above:

ds <- seq(from = 0, to = .60, by = .01)

dp <- 0

n <- 12.5
m <- 64

L <- .01

inf_power <- pwr.inferiority(n, m, d_population = dp, d_bound = ds, L = L)

inf_data <- data.frame(bound = ds, power = inf_power)

ggplot(inf_data,
aes(
x = bound,
y = power
)) +
geom_line(
size = 1.5,
color = "#b91372"
) +
geom_vline(
xintercept = .475,
linetype = "longdash"
) +
geom_vline(
xintercept = .20,
linetype = "dotted"
) +
geom_hline(
yintercept = .80,
linetype = "dashed"
) +
geom_hline(
yintercept = .90,
linetype = "dashed"
) +
geom_hline(
yintercept = .95,
linetype = "dashed"
) +
geom_hline(
yintercept = .99,
linetype = "dashed"
) +
scale_x_continuous(
breaks = seq(0, .80, .05)
) +
scale_y_continuous(
breaks = c(.10, .33, .66, .80, .90, .95, .99),
labels = c(".10", ".33", ".66", ".80", ".90", ".95", ".99")
) +
labs(
y = "Power",
x = "Inferiority bound (d)",
title = "Power for inferiority",
subtitle = paste("Power for each experiment to reject effect sizes\n", m,
" labs each collecting N = 25",
" participants per experiment,\nassuming an average effect of ",
dp, " and ", 100*L, "% heterogeneity across labs",
sep = "")
) +
guides(
color = FALSE
) +
coord_cartesian(
ylim = c(0, 1)
) +
theme_classic()

## Update(s)

2020-03-29: In the original version of this post, I used $\tau$ as the symbol for both the noncentrality parameter (following Julious, 2004) and the meta-analytic heterogeneity estimator. Now, I use $\lambda$ as the noncentrality parameter. I don’t think I’ve ever seen the noncentrality parameter for a t distribution symbolized as $\lambda$, so don’t assume that other sources are going to match the notation here.