Seasoned and fledgling researchers alike struggle to understand
effect sizes. This difficulty manifests itself in a number of ways. For
instance, when planning a study, researchers often find it challenging to select
plausible effect sizes for a priori
power analysis. Sometimes they choose to power their studies unreasonably large
effect sizes – perhaps out of misplaced optimism about the strength of their
manipulations or out of a desire to minimize sample size requirements. Powering
for a too-large effect invites numerous problems, such as rendering a study too
poorly powered to detect a smaller, more plausible effect and greatly inflating
significant effects (Gelman & Carlin, 2014; Yarkoni, 2009). All manner of
nastiness ensues when we fail to grasp effect sizes and power.
Most psychological scientists are probably familiar with
Jacob Cohen’s (1992) benchmarks for effect sizes. On this scale, a standardized
mean difference (d) of .20 is “small,”
.50 is “medium,” and .80 is “large.” Another set of benchmarks derives from
Richard, Bond, and Stokes-Zoota’s (2003) meta-meta-analysis of social psychology.
They found that the overall average effect size in social psychology is d = .43, and they provide average effects
for subfields as well (e.g., d = .34
for legal psychology, d = .26 for
social influence; detailed data are available here). In his own work, Charlie
Bond has used these empirical benchmarks to see how other effects stack up
(e.g., lie detection accuracy, Bond & DePaulo, 2006). Simine
Vazire has argued that these benchmarks provide a basis for assuming we will
all need samples of at least N = 200.
Despite the availability of these examples and guides,
researchers may still struggle to develop strong intuitions about effect sizes.
Cohen speculated that researchers find the literature on effect sizes and power
inaccessible and too mathematically advanced, but I suspect this issue is not (solely)
mathematical. Rather, I think researchers are only rarely exposed to the kind
of information that would help them develop functional heuristics for thinking
about the magnitudes of effects.
My guess is that few psychologists, given d = .20 or .43, could provide concrete
examples of effects approximately those sizes. Moreover, my experience is that
researchers often don’t have a “feel” for the size of familiar effects – that
is, statistical differences we know exist and encounter on a routine basis. For
example, what is the effect size for the difference in temperature between July
and August where you live? If you live in a climate that has appreciable
seasons, you may have an intuitive idea of what those months tend to feel like, and perhaps you even have a rough
idea of how they differ numerically. But if I pressed you for a Cohen’s d, you might draw a blank. I did – until
I pulled data from the National
Weather Service from the Central Park weather station and calculated it as d = .60*.
We have an intuitive understanding that that some familiar differences
are easier to see than others. We know, for example, that people substantially vary
in height but that men tend to be taller than women – and that this is a
difference that is easy to see just by looking at a handful of people. Having effect
size estimates for familiar comparisons can help guide expectations of effect
sizes in research because we can compare our hypothesized effects to familiar
effects as a kind of “smell test.” Do we expect a planned experimental
manipulation to make people more aggressive than a control group with the kind
of clearly noticeable magnitude with which men are heavier than women (d = .59; to borrow the example of Simmons,
Nelson, and Simonsohn, 2013)? Or should we expect something more subtle?
Below, I’ve compiled some intuitive benchmarks for reference.
For simplicity, I’ve restricted the list to standardized mean differences
between two groups.
Difference | d | Source |
Song of Fire and Ice books are longer than Harry Potter books | 3.05 | Electric Literature infographic** |
iPhone models after 5 are bigger (length) | 3.00 | Apple product specs |
Barack Obama's State of the Union addresses are more easily readable than George W. Bush's*** | 2.76 | Miller Center at UVA transcripts |
Men are taller than women | 1.85 | Simmons, Nelson, & Simonsohn (2013) |
iPhone models after 5 are heavier | 1.39 | Apple product specs |
Women own more pairs of shoes than men | 1.07 | Simmons, Nelson, & Simonsohn (2013) |
Ryan Gosling movies are better reviewed than Ryan Reynolds movies | 0.75 | Rotten Tomatoes |
Men weigh more than women | 0.59 | Simmons, Nelson, & Simonsohn (2013) |
July is hotter than August in Central Park, since 2000 | 0.39 | National Weather Service (NWS) |
Second season episodes of Stranger Things are longer than first season episodes | 0.27 | Netflix |
Swedes are taller than Americans | 0.26 | NCD Risk Factor Collaboration |
Harry Potter books are longer than Lord of the Rings books | 0.22 | Electric Literature infographic |
16-year old girls are taller than 15-year old girls**** | 0.10 | Center for Disease Control (CDC), Growth Chart for Girls |
Meryl Streep movies are better reviewed than Tom Hanks movies***** | 0.09 | Rotten Tomatoes |
Dutch people are taller than Swedes | 0.06 | NCD Risk Factor Collaboration |
In assembling this list, I’ve tried to include differences
that (1) are familiar to people or about which people have stereotypes and (2)
that can be measured with relatively little error. Obviously, this list is
totally unsystematic. There are many gradients of effect sizes that have not
been represented (but I would welcome additions to the list!). What are these effect
sizes useful for? I am not seriously suggesting that we compare our effect
estimates in our research to, say, the difference in critical reception of Ryan
Gosling films and Ryan Reynolds films – at least not in a formal way. However,
I am suggesting that it is occasionally useful to think in terms of effect sizes
when making claims like, “I feel like the episodes were longer in the second
season of Stranger Things.” Why?
Because it puts numbers to our experiences. If we are able to characterize familiar
differences in terms of effect size, we will be better able to intuitively judge
whether effects in research – whether observed or hypothesized – seem too big,
too small, or just right. In pseudo-Bayesian terms, this kind of intuitive
knowledge can improve our informal priors. I humbly suggest using intuitive
benchmarks like these as a tool for teaching about effect sizes, to help students
develop adaptive heuristics and because they’re kind of funny.
Researchers often fuss about effect sizes when making power
calculations, when designing studies or when assessing completed studies. Power
lends itself well to visualizations, such as curves, which can be highly
informative. Like formal benchmarks of effect sizes, however, these
visualizations may not be easily digested. To develop intuitions about power, it
can be useful to know what power looks like in a less formal manner.
Imagine there are two subspecies of rabbits, and although
individual rabbits vary in size, one group is systematically bigger than the
other. If you saw a colony of each of these subspecies, how would they look? The
appearance of each group would, of course, depend on the magnitude of their difference
in size. Larger effects would be more apparent. How many rabbits would you need
in each observed colony before you could see the difference? Thankfully, there
is no need to simply imagine. I’ve written a simple Shiny app that
illustrates two colonies containing a specified number of rabbits drawn from
populations of that differ in size at a specified d. The app displays bunnies from each colony spread out over plots
in no particular pattern******. This simulates the experience of simply
eyeballing groups in order to get an impressionistic understanding of them. Although
looking at bundles of bunnies is no replacement for proper power calculations,
this tool can provide an intuitive visualization of effect sizes and their
detectability at different sample sizes.
Alfred North Whitehead famously said that “[c]ivilization
advances by extending the number of important operations we can perform without
thinking about them.” I believe this quip applies to working with effect sizes
as well. We will obviously still have to do the math – but I suspect
researchers would be better off with handy shortcuts for thinking about effect
sizes.
A collateral benefit is that researchers can argue about
whether it’s plausible to expect an effect larger than the difference between Drive and Definitely,
Maybe.
Update (2018-05-19): I have updated the rabbit colony app with some additional help text and a button to print the observed difference in size between the two colonies. Looking at the observed effect size (and its deviation from the population effect size) can give you a sense of how sampling variation looks, particularly when power is low.
Notes
* This is the difference between the monthly average
temperatures for July and August. If d =
.60 seems too high to all you New Yorkers,
that might be because if you restrict the data to the year 2000 and later, the difference
is d = .39. July and August have been
more similar in temperature lately, it seems.
** Infographic can be found here: https://electricliterature.com/infographic-word-counts-of-famous-books-161f025a6b09
*** Barack Obama did not give a State of the Union address
in his first year in office, but he did address a joint session of Congress, so
I counted that speech too. Readability was calculated using Microsoft Word.
**** I’ve used this comparison elsewhere,
and I’m pretty sure I stole it from Jacob Cohen (or maybe someone else) – but I
can’t seem to find the original source.
Update (2018-05-19): Thanks to Erik Mac Giolla for referring me to p.26 of Jacob Cohen's (1988) power analysis book. Cohen describes d = .20 as the difference in height between 16 and 15 year old girls, assuming mean difference = .5 inches, SD = 2.1. When calculating the effect size as d = .10, I assumed the mean of 15-year olds to be 162cm, the mean of 16-year olds to be 162.5, and the SD to be 5cm.
Update (2018-05-19): Thanks to Erik Mac Giolla for referring me to p.26 of Jacob Cohen's (1988) power analysis book. Cohen describes d = .20 as the difference in height between 16 and 15 year old girls, assuming mean difference = .5 inches, SD = 2.1. When calculating the effect size as d = .10, I assumed the mean of 15-year olds to be 162cm, the mean of 16-year olds to be 162.5, and the SD to be 5cm.
***** Meryl Streep and Tom Hanks have been in several films
together, so this comparison violates the non-independence assumption.
****** Well, the pattern is determined by two vectors of
random numbers drawn from normal distributions positioned at 0 with lots of
variance.
References
Bond, C. F. Jr, & DePaulo, B. M. (2006). Accuracy of deception judgments. Personality and Social Psychology Review, 10, 214-234.
Center for Disease Control: https://www.cdc.gov/growthcharts/data/set2clinical/cj41c072.pdf
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159.
Gelman, A., & Carlin, J. (2014). Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9, 641-651.
Miller Center at University of Virginia: https://millercenter.org/president
National Weather Service: https://www.weather.gov/okx/centralparkhistorical
NCD- RisC: http://ncdrisc.org/data-downloads.html
Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7, 331–363.
Simmons, J., Nelson, L., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the Society for Personality and Social Psychology, New Orleans, LA, 17-19 January 2013. Available at SSRN: https://ssrn.com/abstract=2205186
Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations reflect low statistical power—Commentary on Vul et al.(2009). Perspectives on Psychological Science, 4, 294-298.
Bond, C. F. Jr, & DePaulo, B. M. (2006). Accuracy of deception judgments. Personality and Social Psychology Review, 10, 214-234.
Center for Disease Control: https://www.cdc.gov/growthcharts/data/set2clinical/cj41c072.pdf
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159.
Gelman, A., & Carlin, J. (2014). Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9, 641-651.
Miller Center at University of Virginia: https://millercenter.org/president
National Weather Service: https://www.weather.gov/okx/centralparkhistorical
NCD- RisC: http://ncdrisc.org/data-downloads.html
Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7, 331–363.
Simmons, J., Nelson, L., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the Society for Personality and Social Psychology, New Orleans, LA, 17-19 January 2013. Available at SSRN: https://ssrn.com/abstract=2205186
Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations reflect low statistical power—Commentary on Vul et al.(2009). Perspectives on Psychological Science, 4, 294-298.
Data assembled for the effect size calculations and code for the rabbit colony app can be found here: https://osf.io/kzavb/
No comments:
Post a Comment