Thursday, May 17, 2018

Intuitive benchmarks for effect sizes


Seasoned and fledgling researchers alike struggle to understand effect sizes. This difficulty manifests itself in a number of ways. For instance, when planning a study, researchers often find it challenging to select plausible effect sizes for a priori power analysis. Sometimes they choose to power their studies unreasonably large effect sizes – perhaps out of misplaced optimism about the strength of their manipulations or out of a desire to minimize sample size requirements. Powering for a too-large effect invites numerous problems, such as rendering a study too poorly powered to detect a smaller, more plausible effect and greatly inflating significant effects (Gelman & Carlin, 2014; Yarkoni, 2009). All manner of nastiness ensues when we fail to grasp effect sizes and power.

Most psychological scientists are probably familiar with Jacob Cohen’s (1992) benchmarks for effect sizes. On this scale, a standardized mean difference (d) of .20 is “small,” .50 is “medium,” and .80 is “large.” Another set of benchmarks derives from Richard, Bond, and Stokes-Zoota’s (2003) meta-meta-analysis of social psychology. They found that the overall average effect size in social psychology is d = .43, and they provide average effects for subfields as well (e.g., d = .34 for legal psychology, d = .26 for social influence; detailed data are available here). In his own work, Charlie Bond has used these empirical benchmarks to see how other effects stack up (e.g., lie detection accuracy, Bond & DePaulo, 2006). Simine Vazire has argued that these benchmarks provide a basis for assuming we will all need samples of at least N = 200.

Despite the availability of these examples and guides, researchers may still struggle to develop strong intuitions about effect sizes. Cohen speculated that researchers find the literature on effect sizes and power inaccessible and too mathematically advanced, but I suspect this issue is not (solely) mathematical. Rather, I think researchers are only rarely exposed to the kind of information that would help them develop functional heuristics for thinking about the magnitudes of effects.

My guess is that few psychologists, given d = .20 or .43, could provide concrete examples of effects approximately those sizes. Moreover, my experience is that researchers often don’t have a “feel” for the size of familiar effects – that is, statistical differences we know exist and encounter on a routine basis. For example, what is the effect size for the difference in temperature between July and August where you live? If you live in a climate that has appreciable seasons, you may have an intuitive idea of what those months tend to feel like, and perhaps you even have a rough idea of how they differ numerically. But if I pressed you for a Cohen’s d, you might draw a blank. I did – until I pulled data from the National Weather Service from the Central Park weather station and calculated it as d = .60*.

We have an intuitive understanding that that some familiar differences are easier to see than others. We know, for example, that people substantially vary in height but that men tend to be taller than women – and that this is a difference that is easy to see just by looking at a handful of people. Having effect size estimates for familiar comparisons can help guide expectations of effect sizes in research because we can compare our hypothesized effects to familiar effects as a kind of “smell test.” Do we expect a planned experimental manipulation to make people more aggressive than a control group with the kind of clearly noticeable magnitude with which men are heavier than women (d = .59; to borrow the example of Simmons, Nelson, and Simonsohn, 2013)? Or should we expect something more subtle?

Below, I’ve compiled some intuitive benchmarks for reference. For simplicity, I’ve restricted the list to standardized mean differences between two groups.

Difference d Source
Song of Fire and Ice books are longer than Harry Potter books 3.05 Electric Literature infographic**
iPhone models after 5 are bigger (length) 3.00 Apple product specs
Barack Obama's State of the Union addresses are more easily readable than George W. Bush's*** 2.76 Miller Center at UVA transcripts
Men are taller than women 1.85 Simmons, Nelson, & Simonsohn (2013)
iPhone models after 5 are heavier 1.39 Apple product specs
Women own more pairs of shoes than men 1.07 Simmons, Nelson, & Simonsohn (2013)
Ryan Gosling movies are better reviewed than Ryan Reynolds movies 0.75 Rotten Tomatoes
Men weigh more than women 0.59 Simmons, Nelson, & Simonsohn (2013)
July is hotter than August in Central Park, since 2000 0.39 National Weather Service (NWS)
Second season episodes of Stranger Things are longer than first season episodes 0.27 Netflix
Swedes are taller than Americans 0.26 NCD Risk Factor Collaboration
Harry Potter books are longer than Lord of the Rings books 0.22 Electric Literature infographic
16-year old girls are taller than 15-year old girls**** 0.10 Center for Disease Control (CDC), Growth Chart for Girls
Meryl Streep movies are better reviewed than Tom Hanks movies***** 0.09 Rotten Tomatoes
Dutch people are taller than Swedes 0.06 NCD Risk Factor Collaboration

In assembling this list, I’ve tried to include differences that (1) are familiar to people or about which people have stereotypes and (2) that can be measured with relatively little error. Obviously, this list is totally unsystematic. There are many gradients of effect sizes that have not been represented (but I would welcome additions to the list!). What are these effect sizes useful for? I am not seriously suggesting that we compare our effect estimates in our research to, say, the difference in critical reception of Ryan Gosling films and Ryan Reynolds films – at least not in a formal way. However, I am suggesting that it is occasionally useful to think in terms of effect sizes when making claims like, “I feel like the episodes were longer in the second season of Stranger Things.” Why? Because it puts numbers to our experiences. If we are able to characterize familiar differences in terms of effect size, we will be better able to intuitively judge whether effects in research – whether observed or hypothesized – seem too big, too small, or just right. In pseudo-Bayesian terms, this kind of intuitive knowledge can improve our informal priors. I humbly suggest using intuitive benchmarks like these as a tool for teaching about effect sizes, to help students develop adaptive heuristics and because they’re kind of funny.

Researchers often fuss about effect sizes when making power calculations, when designing studies or when assessing completed studies. Power lends itself well to visualizations, such as curves, which can be highly informative. Like formal benchmarks of effect sizes, however, these visualizations may not be easily digested. To develop intuitions about power, it can be useful to know what power looks like in a less formal manner.

Imagine there are two subspecies of rabbits, and although individual rabbits vary in size, one group is systematically bigger than the other. If you saw a colony of each of these subspecies, how would they look? The appearance of each group would, of course, depend on the magnitude of their difference in size. Larger effects would be more apparent. How many rabbits would you need in each observed colony before you could see the difference? Thankfully, there is no need to simply imagine. I’ve written a simple Shiny app that illustrates two colonies containing a specified number of rabbits drawn from populations of that differ in size at a specified d. The app displays bunnies from each colony spread out over plots in no particular pattern******. This simulates the experience of simply eyeballing groups in order to get an impressionistic understanding of them. Although looking at bundles of bunnies is no replacement for proper power calculations, this tool can provide an intuitive visualization of effect sizes and their detectability at different sample sizes.

Alfred North Whitehead famously said that “[c]ivilization advances by extending the number of important operations we can perform without thinking about them.” I believe this quip applies to working with effect sizes as well. We will obviously still have to do the math – but I suspect researchers would be better off with handy shortcuts for thinking about effect sizes.

A collateral benefit is that researchers can argue about whether it’s plausible to expect an effect larger than the difference between Drive and Definitely, Maybe.

Update (2018-05-19): I have updated the rabbit colony app with some additional help text and a button to print the observed difference in size between the two colonies. Looking at the observed effect size (and its deviation from the population effect size) can give you a sense of how sampling variation looks, particularly when power is low.

Notes

* This is the difference between the monthly average temperatures for July and August. If d = .60 seems too high to all you New Yorkers, that might be because if you restrict the data to the year 2000 and later, the difference is d = .39. July and August have been more similar in temperature lately, it seems.


*** Barack Obama did not give a State of the Union address in his first year in office, but he did address a joint session of Congress, so I counted that speech too. Readability was calculated using Microsoft Word.

**** I’ve used this comparison elsewhere, and I’m pretty sure I stole it from Jacob Cohen (or maybe someone else) – but I can’t seem to find the original source.

Update (2018-05-19): Thanks to Erik Mac Giolla for referring me to p.26 of Jacob Cohen's (1988) power analysis book. Cohen describes d = .20 as the difference in height between 16 and 15 year old girls, assuming mean difference = .5 inches, SD = 2.1. When calculating the effect size as d = .10, I assumed the mean of 15-year olds to be 162cm, the mean of 16-year olds to be 162.5, and the SD to be 5cm.

***** Meryl Streep and Tom Hanks have been in several films together, so this comparison violates the non-independence assumption.

****** Well, the pattern is determined by two vectors of random numbers drawn from normal distributions positioned at 0 with lots of variance.

References

Bond, C. F. Jr, & DePaulo, B. M. (2006). Accuracy of deception judgments. Personality and Social Psychology Review, 10, 214-234.
Center for Disease Control: https://www.cdc.gov/growthcharts/data/set2clinical/cj41c072.pdf
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159.
Gelman, A., & Carlin, J. (2014). Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9, 641-651.
Miller Center at University of Virginia: https://millercenter.org/president
National Weather Service: https://www.weather.gov/okx/centralparkhistorical
NCD- RisC: http://ncdrisc.org/data-downloads.html
Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7, 331–363.
Simmons, J., Nelson, L., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the Society for Personality and Social Psychology, New Orleans, LA, 17-19 January 2013. Available at SSRN: https://ssrn.com/abstract=2205186
Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations reflect low statistical power—Commentary on Vul et al.(2009). Perspectives on Psychological Science, 4, 294-298.

Data assembled for the effect size calculations and code for the rabbit colony app can be found here: https://osf.io/kzavb/