Saturday, March 28, 2020

Power for inferiority tests in multi-lab replications

Researchers are increasingly interested in conducting multi-lab replications in order to, among other things, achieve high levels of statistical power that would otherwise be unattainable by a single lab. However, several recent multi-lab replications have turned up nonsignificant null results (e.g., Many Labs 4; Klein et al, 2019). When the estimated effect across labs is small and nonsignificant, it can be useful to test whether the observed effect is smaller than a specified bound (e.g., the smallest effect that would have been theoretically meaningful; Anvari & Lakens, 2019). Testing against a specified bound to test whether the studied effect is smaller than that bound is known as an inferiority test. (When testing assess whether the studied effect falls within a set of two bounds, this approach is called an equivalence test; for a more thorough discussion, see Lakens, 2017).

Just as it’s important to conduct power analyses to determine a multi-lab study’s ability to detect effects, it’s important to assess the study’s ability to reject effects. Calculating power for inferiority tests is fairly straightforward, though the methods are less familiar to researchers than power calculations for typical significance tests. Dan Lakens’s R package TOSTER provides handy functions for calculating power for equivalence and (non)-inferiority tests. However, when conducting an inferiority test with data collected at multiple labs, heterogeneity across the labs can decrease the power of that test. I haven’t found a tutorial on how to calculate power for inferiority tests in this situation, so I put this one together.

A method for estimating power for two-group inferiority tests in multi-lab studies

Here, I’ve borrowed substantially from Jake Westfall (particularly this blog post from a few years ago) and from Dan Lakens’s work on the TOSTER package to offer a method for calculating power for two-group inferiority tests (between-subjects).

Based on the calculations Dan Lakens uses in TOSTER and based on Julious’s (2004) guide to samples size estimation for equivalence tests, this procedure I’m suggesting here assumes that power (\(1-\beta\)) for an inferiority test is given as

\(1 - \beta = 2 pt(t_{1-\alpha, df}, df, \lambda) - 1\)

where \(pt()\) is the probability function of the t distribution, \(df=N_{total}-2\) and \(\alpha\) is the specified Type I error rate. The most important part of this equation is the noncentrality parameter \(\lambda\). I’ve calculated the noncentrality parameter as


where \(n\) is the group size per lab, \(m\) is the number of labs, \(\delta\) is the population effect size, and \(d_{bound}\) is the inferiorty bound (i.e., the value you’re testing against to see if your observed effect size is smaller). For many purposes, it may make sense to assume that \(\delta\) is zero, to represent a case in which you are studying a null effect. Setting \(d_{bound}\) requires theoretical or practical reasoning (see Anvari & Lakens, 2019).

In the formula above, the value \(L\) is the proportion of total variance in the effect size explained by the lab by condition interaction. In other words, this value represents the extent to which labs vary in their effect sizes. This is not mere sampling variation; we’re talking about true variation in the effects at different replication sites. In practice, this could be due to procedural differences, cultural differences, or just about anything that could vary across labs.

As I mentioned above, I am borrowing heavily from the work of Jake Westfall, who provided a version of the noncentrality parameter equation which I adapted into this one. He also provides a way of calculating \(L\) from a \(\tau\) – a common estimator of heterogeneity in meta-analyses: \(L = \frac{\tau^2}{1-\tau^2}\)

Jake takes .05 (5%) to be a reasonable value for \(L\), based on data from Many Labs 1. This seems like a mostly fine assumption to me. But Many Labs 2, which was designed specifically to examine heterogeneity across labs, found that there was relatively little heterogeneity. Among effects for which there was some heterogeneity, the \(\tau\) estimates were around .10. Using Jake’s equation, this gives us a reasonable default \(L\) of .01 (1%) – but perhaps in some cases, we should be more conservative and assume there could be unusually large amounts of variation in effects across labs.

Let’s go back to the noncentrality parameter. This value is, very roughly speaking, the t value that represents a nonzero effect size in the population, and it specifies the location of the noncentral t distribution. The typical noncentrality parameter is essentially a t value, calculated by the mean difference divided by the standard error. The larger the standard error, the greater the variance. And the larger the variance, the lower the power and precision. This version of the noncentrality parameter we’re using here takes into account additional variance components. That is, it lets us specify how variation other than sampling error – namely lab heterogeneity – play a role in the effect size estimates we will observe in our multi-lab study. The greater the lab heterogeneity, the lower the power.

Here is a simple R function for calculating power using this approach:

pwr.inferiority <- function(n, m, d_population = 0, d_bound, L, alpha = .05) {
  ncp <- abs((d_population - d_bound)/(2*(sqrt((1/(2*m*n)) + (L/m)))))
  DF <- 2*(m*n) - 2
  pwr <- 2*(1 - pt(qt(1 - alpha, DF), DF, ncp)) - 1

The arguments of this function match the math we were working through above: n is the group size per lab (not the total sample size!), m is the number of labs, d_population (0 by default) is the population effect size, d_bound is the inferiority bound, and L is the proportion of lab heterogeneity. The output is an estimate of power, given the specified parameters. To determine a sample size for acheiving a particular level of power, I suggest taking an iterative approach (i.e., try different sample sizes with reasonable values for the other parameters).

As an example can illustrate how the function works, let’s imagine a multi-lab study in which 5 labs each collect N = 200 (n = 100 per group) for a two-group between subjects experiment. When L is set to 0 (no lab heterogeneity), pwr.inferiority and TOSTER give similar results.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = 0)
## [1] 0.8702984
powerTOSTtwo(alpha = .05, N = 100*5, low_eqbound_d = -Inf, high_eqbound_d = .20)
## The statistical power is 87.08 % for equivalence bounds of -Inf and 0.2 .
## [1] 0.8708419

My procedure is slighly more conservative with its estimates because TOSTER uses the normal distribution to calculate power, and my procedure uses the t-distribution.

Now let’s imagine we have the same design, but there is 1% heterogeneity across labs.

pwr.inferiority(100, 5, d_population = 0, d_bound = .20, L = .01)
## [1] 0.1425731

Suddenly, power drops dramatically. Why? The lab heterogeneity has a huge effect on the precision of the effect size estimates.

One of the things Jake Westfall pointed out in the blog post I linked earlier was that, when there is heterogeneity, increasing the number of labs generally increases power to detect effects more efficiently than increasing the number of participants per lab. This principle applies for rejecting effects as well as detecting them. If we instead have 50 labs each collecting n = 10 per group (still total N = 1000), the power is much greater:

pwr.inferiority(10, 50, d_population = 0, d_bound = .20, L = .01)
## [1] 0.7850025

The larger number of labs is able to better cope with the heterogeneity, and the effect can be more precisely estimated.

Let’s consider a real example: The CLIMR project, a multi-lab replication of several key findings of Construal Level Theory. I’m one of the coordinators of this project, along with Erik Mac Giolla, Sofia Calderon, and Karl Ask. As of March 24, a total of 64 labs had committed to assisting with data collection. We’ve described what this means for our ability to detect effects on our website. What about interpreting nonsignificant effects? Let’s assume that the true effect of each experiment we replicate is zero but that procedural variations across labs cause the effects to vary in size. I’m not saying this is what will happen or that it’s even likely. But it’s useful to know how informative nonsignficant results can be.

So on these assumptions, how much power to reject effects do we have?

A pretty considerable amount of power, as it turns out! With this many labs, we have about 95% power to reject effects of .20 or greater. Again, we’re making a lot of assumptions here, but this bodes well for our ability to rule out fairly small effects if the results are nonsignificant.

A caveat

I think this is a reasoanble approach for estimating power, but I would be more confident after verifying these calculations with Monte Carlo simulations. At present, I haven’t done this validation. If and when I do, I’ll update this post.


Anvari, F., & Lakens, D. (2019). Using Anchor-Based Methods to Determine the Smallest Effect Size of Interest.

Julious, S. A. (2004). Sample sizes for clinical trials with normal data. Statistics in medicine, 23(12), 1921-1986.

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C., Nosek, B. A., Chartier, C. R., … & Cromar, R. (2019). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement.

Lakens, D. (2017). Equivalence tests: a practical primer for t tests, correlations, and meta-analyses. Social psychological and personality science, 8(4), 355-362.

Supplemental code

Code to reproduce the figure above:

ds <- seq(from = 0, to = .60, by = .01)

dp <- 0

n <- 12.5
m <- 64

L <- .01

inf_power <- pwr.inferiority(n, m, d_population = dp, d_bound = ds, L = L)

inf_data <- data.frame(bound = ds, power = inf_power)

         x = bound,
         y = power
       )) +
    size = 1.5,
    color = "#b91372"
  ) +
    xintercept = .475,
    linetype = "longdash"
  ) +
    xintercept = .20,
    linetype = "dotted"
  ) +
    yintercept = .80,
    linetype = "dashed"
  ) +
    yintercept = .90,
    linetype = "dashed"
  ) +
    yintercept = .95,
    linetype = "dashed"
  ) +
    yintercept = .99,
    linetype = "dashed"
  ) +
    breaks = seq(0, .80, .05)
  ) +
    breaks = c(.10, .33, .66, .80, .90, .95, .99),
    labels = c(".10", ".33", ".66", ".80", ".90", ".95", ".99")
  ) +
    y = "Power",
    x = "Inferiority bound (d)",
    title = "Power for inferiority",
    subtitle = paste("Power for each experiment to reject effect sizes\n", m,
 " labs each collecting N = 25",
 " participants per experiment,\nassuming an average effect of ", dp,
 " and ", 100*L, "% heterogeneity across labs",
 sep = "")
  ) +
    color = FALSE
  ) +
    ylim = c(0, 1)
  ) +


2020-03-29: In the original version of this post, I used \(\tau\) as the symbol for both the noncentrality parameter (following Julious, 2004) and the meta-analytic heterogeneity estimator. Now, I use \(\lambda\) as the noncentrality parameter. I don’t think I’ve ever seen the noncentrality parameter for a t distribution symbolized as \(\lambda\), so don’t assume that other sources are going to match the notation here.

Monday, February 3, 2020

Yet another decline effect in the deception cue literature

Last year, Lessons from Pinocchio, my paper critically reviewing the literature on deception cues, was published (Luke, 2019). Much of that paper involved reexamining data from the widely-cited meta-analytic review of cues to deception by DePaulo et al. (2003). I won’t summarize the whole paper here, but one of my many observations about the deception literature was that the absolute values of the meta-analytic effect sizes for deception cues were negatively related to the number of studies in which that cue had been studied, as well as the total number of participants with which that cue had been studied (see Bond, Levine, & Hartwig, 2015). In other words, the more a cue is studied, the less strong it tends to be. It is easy to see this pattern when plotting the total sample size used to study each cue against the cues’ effect sizes.

As I argued in Pinocchio, there is really no good reason for meta-analytic effects for deception cues to be distibruted like this. Since this isn’t a funnel plot, the effects should not necessarily, for a priori reasons, be getting more tightly grouped as the sample sizes increase. Once possible explanation for this is that the deception cue literature has been sampling from statistical noise. That would explain why there is a lot of spread in the effects at small sample sizes, and at larger sample sizes, the effects are tightly clumped around zero.

A few months ago, Tim Levine aptly pointed out to me that these estimates include a lot of imputed values for nonsignificant effects that could be calculated precisely. In these cases, DePaulo and her colleagues imputed d = 0 (or d = -.01 or .01 in cases where the direction could be discerned). Thus, the effects would naturally tend to move toward zero when more of these imputed values are added.

If we remove those imputed values, this is what the plot looks like:

You can see the same basic pattern, but it’s substantially less clear-cut than when using the full dataset. For example, two of the most studied cues get very large indeed. It’s important to bear in mind, however, that we’ve lost a substantial amount of information here by excluding effects that were too small to be significant. By only plotting the effects that could be extracted precisely, the effect sizes are all biased upward in this dataset. Nevertheless, we still see a tendency for effects to decline at higher samples, and the largest effects are virtually only observed at smaller samples.

Indeed, if you plot these effect sizes over the “Land of Toys” simulation from Pinocchio that assumes no true cues to deception and no publication bias, we can see that the distribution continues to overlap substantially. For some added perspective, bear in mind that this simulation assumed all researchers were on their very best behavior and reporting everything without bias. There is plenty of evidence that is not the case in the deception cue literature.

Recently, I was poking around in the data from DePaulo et al. (2003) again, and I was fascinated to discover that a pattern similar to what we see above occurs within studies. There is a negative relationship between the number of cues a study measured and the absolute value of its across-cue mean effect size. That is, the more cues a given study measured, the closer the average of all its cue effects was to zero. You can see this in the figure below.

Each point in this figure is a sample from DePaulo et al. (2003). Its mean effect size (calculated across all the cues it measured) is plotted on the horizontal axis, and the number of cues it measured is plotted on the vertical axis. The pattern is striking. With few exceptions, large within-study average effects are only observed when the number of cues is relatively small.

This pattern mostly holds even when we remove data the imputed effect sizes. Again, bear in mind that with this dataset, the effects are virtually guaranteed to be overestimates.

As with the cue-level analysis, there is no a priori reason to expect this relationship between the number of cues in a study and the average effect in that study. The only plausible explanation I can think of for this pattern is that deception researchers have been studying effects that are mostly or entirely statistical noise. That is, if you assume you’re sampling noise, when studying just one or a few cues, it is possible to observe large spurious effects, as a function of sampling error. But as you add cues, the average effect you observe converges on zero, since the random errors cancel each other out in the long run. This pattern in the empirical data isn’t positive evidence that all deception cues are noise (or are, at best, very small), but it is yet another observation that is consistent with that possibility and difficult to explain otherwise.

I didn’t report these patterns in Pinocchio because I hadn’t noticed them until recently. But the previous conclusions hold. In fact, the situation looks even bleaker than I had previously described.


Bond, C. F., Levine, T. R., & Hartwig, M. (2015). New findings in nonverbal lie detection. Deception detection: Current challenges and new directions, 37–58. Wiley Online Library.

DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129(1), 74–118.

Luke, T. J. (2019). Lessons from pinocchio: Cues to deception may be highly exaggerated. Perspectives on Psychological Science, 14(4), 646–671. doi: 10.1177/1745691619838258