Criterion-Based
Content Analysis (CBCA) is a component of Statement Validity Assessment (SVA),
which is a tool used to evaluate the credibility of statements – especially claims
of abuse. In brief, CBCA is used to distinguish between credible and fabricated
statements. CBCA comprises a set of 19 criteria on which a verbal statement is
evaluated, which I’ve summarized in a table below. Truthful statements are
expected to contain more of each criterion.
Criterion | Description |
1 | Logical structure |
2 | Unstructured production |
3 | Quantity of details |
4 | Contextual embedding |
5 | Description of interactions |
6 | Reproduction of conversations |
7 | Unexpected complications |
8 | Unusual details |
9 | Superfluous details |
10 | Details misunderstood |
11 | External associations |
12 | Subjective mental state |
13 | Others' mental states |
14 | Spontaneous corrections |
15 | Admitting lack of memory |
16 | Self-doubt |
17 | Self-deprecation |
18 | Pardoning the perpetrator |
19 | Details characteristic of the event |
CBCA is more
thoroughly described elsewhere, and for a better introduction, I direct the
interested reader to the several summaries that exist (e.g., Vrij, 2015). Admittedly,
I am not an expert on the application of CBCA. I only became particularly
interested in it as I was working on a statistical research project on the
potential for effect size inflation in the literature on cues to deception.
As part of my
literature review for that project, I took a closer look at a meta-analysis of
CBCA criterion validity in adults by Amado, Arce, Fariña, and Vilariño (2016,
hereafter AAFV)*. My interest was primarily at assessing the potential for
overestimation of composite measures designed to improve deception detection,
but as I dug deeper into the methods of the meta-analysis and the literature it
reviewed, I became increasingly concerned that the review and primary studies
had serious methodological and statistical flaws.
Causes for concern
One of the things that initially alarmed me about the effect estimates
in AAFV was that, in many of the studies they included, criteria had been left
unreported or minimally reported. That is, sometimes researchers said they
coded certain criteria but omitted them from their report because they didn’t
significantly discriminate between truthful and fabricated statements or
because they occurred so infrequently that they were virtually useless. There
are at least two problems with this, one statistical and one practical.
The statistical problem is that if researchers are in the habit of
selectively reporting criteria like this, it will not be possible to extract
estimates for these underreported criteria, since they are more likely to be
reported if they are significant and “useful.” This means long run
meta-analytic estimates will chronically overestimate the effects because they
are regularly including artificially inflated estimates from primary studies.
The related practical problem is that a practitioner attempting to use
CBCA in an individual case cannot necessarily (or generally) know whether to
ignore a criterion because it isn’t useful or appropriate in this context.
Ignoring specific criteria in a study is a decision made from the privileged
position of being able to examine numerous statements. In a real case, the
absence of a criterion does not occur against the backdrop of absences in other
statements drawn from the same population, so its ignorability cannot be
inferred. Thus, if a criterion fails to distinguish between truthful and
fabricated statements in a given sample, ignoring it in that sample will
overstate the detectability of deceit using CBCA.
Take “reproduction of conversations” for example. This criterion is
sometimes purposefully ignored in experiments in which the stimulus material
described in the sampled statements does not contain any conversations (so it
isn’t sensible to expect conversations to be reproduced). Under real world
conditions, however, a CBCA practitioner could not, by definition, know whether
there were actually conversations in the alleged event, so he or she could not
make an informed decision about the applicability of the criterion. Sometimes
criteria are ignored in studies in which the criterion just so happened to be very
rare or totally absent in the sampled statements, even when they could have
plausibly occurred. This causes similar problems of generalizability to the
real world.
There are ways of attempted to compensate for failures to report
statistics in a way that allows exact effect sizes to be calculated. In their
meta-analytic review of cues to deception, DePaulo and her colleagues (2003)
attempted to correct for minimally reported cue effects (e.g., cues that were
reported as nonsignificant without sufficient statistics to calculate an effect
size) by imputing 0 for those effects (or, when it was possible to determine a
direction of the effect, .01 or -.01). As far as I can tell, AAFV didn’t
implement any similar imputation in order to account for minimally or unreported
CBCA criteria. I’ll return to this later.
Other aspects of AAFV’s methods troubled me. For instance, I noticed
some of the reports marked for inclusion did not seem to fit the stated
inclusion criteria. For example, they report having included Evans et al.
(2013) – but the studies reported in that document don’t use statements as the
unit of analysis. Rather, they report the results of multiple judgments of a
large pool of statements, made participant observers.
Another issue concerned the establishment of ground truth in the primary
studies. In order to draw inferences about the diagnostic value of the CBCA
criteria, we need to know with a high degree of certainty that the statements
classified as true and fabricated are in fact respectively true and fabricated.
In experiments, this tends to be fairly straightforward, since the researchers
are typically under direct control of participants’ experiences and have
assigned people to tell the truth or lie. In field studies, establishing ground
truth can be exceedingly challenging, if not outright impossible. The field
studies included in AAFV employed a variety of methods to establish ground
truth, some of which strike me as quite problematic.
Several field studies relied on documented forensic evidence or court
decisions. This seems like a reasonable approach to establishing ground truth
in the field (indeed, perhaps the only way), but it may be subject to error and
selection biases. There are obvious ways in which these decisions could go
wrong. Forensic evidence can be non-diagnostic, and court decisions are
sometimes incorrect. Selection biases are more subtle: The only cases that can
be included are those in which there are apparently firm decisions and/or
extensive documentation. That is, the included cases may represent the easiest
of cases in which to assess credibility. Perhaps these are the very cases in
which truthful claims look the way CBCA expects them to look, to the exclusion
of truthful claims in more ambiguous cases that may not be rich in CBCA
criteria. Thus, statements drawn from them may overestimate the diagnosticity
of CBCA criteria. Of course, we have no way of knowing whether this possibility
obtains in reality.
Other field studies took alternate approaches. Rassin and van der Sleen
(2005) requested police officers to code statements from their own cases using
a provided checklist. Ground truth was established by the officers’ judgments
and court decisions. This presents a variation on the problem described above,
as it adds another potential source of selection bias in the officers’
discretion. Ternes (2009) solicited statements from a sample of incarcerated
people, without instructing them to lie or tell the truth. Statements were
classified as credible and non-credible using
CBCA. The effect sizes contrasting the credible and non-credible statements
were, of course, tautological. The non-credible statements were classified as
such precisely because they exhibited fewer CBCA criteria. There was no chance
that the effect size estimates could fail to be in the expected direction.
I don’t mean to take individual researchers to task unfairly for the
compromises they had to make in order to conduct their work – particularly for
field studies, which impose hefty logistical challenges. Depending on their
respective research questions, it may have made perfect sense to have done
things the way they did.
That said, all this seemed to me like a recipe for overestimation.
Wanting to get a better understanding of what was going on, I wrote to the
corresponding author of AAFV to request the data**. After getting no response,
I decided the best option was to try to redo the meta-analysis myself. So I
did.
Replicating/reproducing AAFV
General approach
My intention was never to do a proper, original meta-analysis of the
CBCA literature. Rather, my goal was to assess the validity of the conclusions
and accuracy of the criterion estimates in AAFV. For this reason, I didn’t
attempt to exhaustively search the literature and instead just attempted to
obtain the documents included in AAFV. I obtained 38 of the 39 documents
included in AAFV. I didn’t get my hands on an unpublished dataset the original
authors had (i.e., Critchlow, 2011).
I
registered my plan on OSF***.
Registration seemed particularly important here because I was approaching this
with a self-known bias: I already thought there were problems with the
estimates. Although I believe my registration proved useful, I still made
several judgment calls that were outside the coverage of the registration. For
instance, I didn’t register a rule about how to deal with dependent effect
sizes (e.g., several groups of fabricated statements compared against a single
truthful control group). Ultimately, in cases where there were multiple
dependent estimates, I extracted just one that seemed most appropriate (and
recorded which one that was in my notes). I did this to avoid over-weighting
any given sample.
Effect size extraction and
imputation
When effect sizes were reported, I recorded them or converted them into d. More often, effect sizes had to be
manually calculated. I used the compute.es package in R to do these
calculations. Usually, this entailed entering means, standard deviations, and
group ns into the mes() function or
proportions and group ns into the propes()
function, but when these data were unavailable I sometimes converted t or F
statistics using tes() and fes() respectively.
I only extracted effect sizes for the original 19 CBCA criteria. AAFV
also examined some additional criteria, but these are very rarely studied, and
as far as I know, not often used in practice. I also excluded reports of “total
CBCA score” because my issues with those scores are best addressed elsewhere.
Because of the concerns I described above, I wanted to account for the
missing and excluded criterion estimates and decided to take an approach
similar to the one taken by DePaulo et al. (2003). Each time a study reported excluding
a criterion, I coded the stated reason for the exclusion. I had registered a
plan to impute 0’s for effects excluded for several of the possible stated
reasons (e.g., because the behavior rarely occurred in the sample).
We should be clear about the limitations of this conservative approach
to imputing effect sizes. Its effectiveness, I believe, pivots on the
representativeness of the literature (cf. Fieldler, 2011). This is a problem of
ecological validity in the Brunswikian sense of the term (Brunswik, 1955). If
we want to estimate the effects that occur out in the wild, these imputations
will improve the accuracy of the estimates assuming that the situations sampled
in the literature adequately represent the situations in which CBCA is applied.
That is, assuming that underreported criteria are excluded because they cannot
plausibly distinguish between truthful and deceptive statements in a given
situation, weighting underreported criteria toward zero will improve the
estimates, to the extent that situations that render those criteria useless
actually occur in reality. However, non-representative design in the literature
will bias the observed estimates (likely overestimating them, by oversampling
situations in which the criteria are generally more effective), but it will
also change the effect of the imputations. The imputations will underestimate
the criterion effect sizes if the empirical literature has, for example,
oversampled situations in which underreported criteria are ineffective relative
to their actual effect in the population****.
Because this approach is fraught with potential problems, I report the
results both with and without imputations below.
Results
All the data I extracted from the documents, my coding notes, and
scripts for the analyses and visualizations can be found here: osf.io/4w7tr/
Below, I have tabled the effect size estimates for each criterion. The
table presents results from AAFV and the results from my replication, with and
without the imputations.
AAFV
|
Replication
|
Replication with imputations
|
||||||||
Criterion
|
d
|
k
|
d
|
k
|
d
|
k
|
||||
1
|
0.48
|
30
|
0.15
|
22
|
0.11
|
31
|
||||
2
|
0.53
|
27
|
0.41
|
20
|
0.29
|
27
|
||||
3
|
0.55
|
35
|
0.31
|
27
|
0.25
|
33
|
||||
4
|
0.19
|
29
|
0.19
|
19
|
0.11
|
31
|
||||
5
|
0.27
|
29
|
0.21
|
23
|
0.15
|
31
|
||||
6
|
0.34
|
34
|
0.38
|
21
|
0.27
|
29
|
||||
7
|
0.25
|
29
|
0.17
|
21
|
0.13
|
27
|
||||
8
|
0.31
|
35
|
0.19
|
23
|
0.14
|
29
|
||||
9
|
0.14
|
27
|
0.29
|
21
|
0.19
|
29
|
||||
10
|
0.22
|
5
|
-0.01
|
7
|
0.00
|
26
|
||||
11
|
0.26
|
22
|
0.19
|
15
|
0.12
|
26
|
||||
12
|
0.18
|
28
|
0.17
|
21
|
0.11
|
30
|
||||
13
|
0.09
|
31
|
0.18
|
22
|
0.13
|
31
|
||||
14
|
0.16
|
29
|
0.14
|
22
|
0.09
|
29
|
||||
15
|
0.25
|
34
|
0.23
|
22
|
0.18
|
26
|
||||
16
|
0.20
|
26
|
0.23
|
18
|
0.16
|
24
|
||||
17
|
0.04
|
13
|
0.27
|
11
|
0.11
|
25
|
||||
18
|
-0.02
|
8
|
0.08
|
7
|
0.00
|
24
|
||||
19
|
0.28
|
5
|
0.29
|
3
|
0.04
|
23
|
You will notice that the numbers are not the same across the different
methods. There are wide discrepancies between my numbers (both the effect
estimates and the number of effects) and the original AAFV numbers.
AAFV reported a total of 476 effect estimates. I extracted 345. This is
a difference of 131. It is possible I missed some estimates. Some of the
missing estimates presumably came from the one unpublished dataset I didn’t obtain,
and I excluded some documents that AAFV included (e.g., Evans et al., 2013). Perhaps
AAFV included dependent effect sizes, which I excluded. I also made some
admittedly ad hoc judgment calls. For
example, in one case, I excluded a set of effect sizes because the standard
deviations for the criterion measures were implausibly small (e.g., M = 1.85, SD = .05) and would therefore generated effect estimates that were
outrageously large. However, it’s not clear to me that these differences
account for all the discrepancies. Without examining AAFV’s original data, we
can’t say why there’s such a wide discrepancy.
Most of my estimates for the criterion effects are lower than those of
AAFV, but a few are higher. Some are quite different. For example, AFFV estimated
Criterion 1 (logical structure) as d
= .48, whereas I estimated it at d =
.15 (without imputations). AAFV estimated Criterion 10 (accurate details
misunderstood) as d = .22, whereas I
estimated it at d = -.01 (without
imputations). These differences attest to the fact that differing methods of
reviewing the literature can result in widely discrepant estimates.
Intuitively, we might be inclined to trust the effect sizes from the
larger number of estimates – that is, AAFV’s estimates. This intuition may be
misguided, however, if AAFV included many dependent effect sizes that I didn’t.
For example, Ternes (2009; the one with the tautological effects) reported data
to compute several sets of dependent effect sizes. I only extracted one set of
effects (i.e., one estimate per reported criterion) from that document, but if
AAFV extracted several sets, they may have given too much weight to poor
estimates (which had no chance of failing to support the hypotheses). I don’t
know what they did.
If we look at the estimates with the imputations, they are, of course,
generally lower than those without the imputations (as would be expected if you
added a bunch of 0’s to the data). But for many criteria, the imputations had
quite stark effects, suggesting that many criteria were excluded quite
frequently. For example, Criterion 2 (unstructured production) dropped from d = .41 to d = .29. Criterion 19 (details specific to the event) dropped from d = .29 to d = .04. Which estimates are more accurate is, of course, debatable.
To get an overall picture of all the criteria, I created a funnel plot
for all the effect sizes I extracted. This plot uses the data without
imputations. Vertical reference lines are drawn at 0 and the weighted mean
effect size for all criteria (d = .22
– approximately the difference in length between Harry Potter and Lord of the Rings books), with vertical lines drawn for the 95% confidence bounds for the
meta-analytic effect size. The funnel guide lines are drawn for the 95% and 99%
confidence levels.
You can see there are many outlying effect estimates – well outside the
99% confidence bounds – especially positive outliers at lower standard errors.
The positive outliers tend to observe larger absolute effects than the negative
outliers. It’s possible this is a symptom of publication bias. You can also see
that the effect sizes seem to center closer to zero at higher standard errors –
but these estimates come from just a few samples, so we should be cautious
about overinterpreting them.
A common approach to compensating for publication bias is the trim and
fill technique. I applied this technique using the trimfill() function of the metafor
package for R, specifying imputation of missing effects on the negative side*****.
The trim and fill estimated that a total of 59 effects were suppressed by
publication bias, and the adjusted weighted mean effect size was a much more
modest d = .06 (approximately the difference in review quality between
Meryl Streep movies and Tom Hanks movies). Although 59 estimates might seem like a lot,
bear in mind that CBCA comprises 19 criteria. Thus, this is similar to an
imputation of about three studies finding negative effects for all criteria.
Experiments and field studies
I had registered a plan to test whether effect sizes varied as a
function of different methods of establishing ground truth, but there were only
a handful of field studies, so I opted not to run these analyses, as I doubt
they would be meaningful.
However, you’ll notice in the table below that criterion effect
estimates from experiments tend to be substantially lower than estimates from
field studies. This occurred both in my replication and in AAFV (see their
moderator analyses). The experimental effects were quite modest and are
generally in line with the rest of the deception literature. The median effect
size for cues to deception in DePaulo et al. (2003) is |.10|. Without
imputations, the median effect size for experiments is .12 (the mean is .13). Field
studies, in contrast, had much larger effects. Without imputations, the median
effect size for field studies is .46 (the weighted mean is .35). All the
estimates in the table below are from my replication, not from AAFV. AAFV
estimated the mean effects for experiments and field studies to be higher, at
.25 and .75 respectively.
Experiments
|
Experiments with imputations
|
Field studies
|
Field studies with imputations
|
||||||||||
Criterion
|
d
|
k
|
d
|
k
|
d
|
k
|
d
|
k
|
|||||
1
|
0.04
|
16
|
0.02
|
23
|
0.69
|
3
|
0.42
|
5
|
|||||
2
|
0.19
|
13
|
0.12
|
19
|
0.80
|
4
|
0.63
|
5
|
|||||
3
|
0.28
|
20
|
0.21
|
25
|
0.12
|
4
|
0.10
|
5
|
|||||
4
|
0.13
|
13
|
0.08
|
23
|
0.16
|
3
|
0.09
|
5
|
|||||
5
|
0.17
|
17
|
0.12
|
23
|
0.18
|
3
|
0.10
|
5
|
|||||
6
|
0.27
|
15
|
0.18
|
21
|
0.64
|
3
|
0.41
|
5
|
|||||
7
|
0.04
|
15
|
0.03
|
19
|
0.52
|
3
|
0.33
|
5
|
|||||
8
|
0.16
|
18
|
0.13
|
21
|
0.30
|
3
|
0.20
|
5
|
|||||
9
|
0.27
|
15
|
0.16
|
22
|
0.51
|
3
|
0.38
|
4
|
|||||
10
|
-0.12
|
5
|
-0.04
|
19
|
0.69
|
1
|
0.12
|
4
|
|||||
11
|
0.02
|
10
|
0.02
|
19
|
0.60
|
3
|
0.38
|
5
|
|||||
12
|
0.08
|
15
|
0.05
|
23
|
0.22
|
3
|
0.17
|
4
|
|||||
13
|
0.17
|
15
|
0.11
|
23
|
-0.12
|
4
|
-0.08
|
5
|
|||||
14
|
0.04
|
15
|
0.03
|
21
|
0.29
|
4
|
0.23
|
5
|
|||||
15
|
0.26
|
16
|
0.21
|
19
|
0.07
|
4
|
0.07
|
4
|
|||||
16
|
0.12
|
13
|
0.09
|
17
|
0.65
|
3
|
0.47
|
4
|
|||||
17
|
-0.06
|
5
|
-0.02
|
18
|
0.67
|
3
|
0.49
|
4
|
|||||
18
|
-0.28
|
2
|
-0.03
|
17
|
-0.11
|
3
|
-0.08
|
4
|
|||||
19
|
0.00
|
1
|
0.00
|
16
|
0.18
|
1
|
0.06
|
4
|
A prosaic explanation for this apparent difference between experiments
and field studies is ordinary sampling variation. We might expect there to be
greater variation in estimates with fewer studies. It is also possible that
there is a genuine systematic difference between effects in the lab and effects
in the field.
It might be tempting to think that CBCA might be more effective in real
cases than it is in artificially produced statements obtained in control
laboratory conditions. This is possible. Bear in mind, however, that field
studies are also confounded with their method of establishing ground truth,
which is necessarily inferior to experimentally established ground truth. For
the reasons I noted earlier, it’s possible for selection biases to produce
overestimated effects under suboptimal conditions for establishing ground
truth.
No matter the cause, it seems that the larger effects in the literature
are primarily driven by the field studies, which produce dramatically larger
estimates. Seven of them are over .60. Another two are above .50. For intuitive
comparison, d = .59 represents the
difference in weight between men and women (Simmons, Nelson, & Simonsohn,
2013). One wonders if it’s plausible for any criterion to have a true effect
that is quite this large, let alone seven. Anything is possible, but this
strikes me as unlikely, given the overall weakness of cues to deception (DePaulo
et al, 2003).
What does all this mean?
You should not trust my results – at least not blindly. I was operating
alone, with no one to check my work and no one with whom to establish the
reliability of my coding of exclusion justifications. Curious or skeptical
readers should look at my notes and my data and check for errors. I was, of
course, trying to arrive at accurate estimates – but that doesn’t mean anyone
should take my word for it. That said, I believe there are two general lessons
to take away from what happened here. The first is a lesson about meta-analytic
review methods in general and in CBCA in particular. The second is a lesson
about the validity of CBCA criteria.
Little differences, big
differences
In extracting effect sizes from the literature, I tried to follow the
methods of AAFV reasonably closely (though I almost certainly applied their
inclusion criteria differently for some documents). Some estimates were fairly
close to those of AAFV, but for other estimates, I wasn’t even remotely close
to theirs.
How is it possible for relatively minor variations in review process to
cause large discrepancies in results? The explanation is, I believe, that the
informational value of individual CBCA studies has been relatively low, due to
low power and selective reporting. With small sample sizes, individual studies
exhibit high degrees of variation in their results. If CBCA studies were
routinely high powered, their estimates would be more precise, and we would expect
there to be less heterogeneity of effect sizes. However, we can see in the
funnel plot above that there is massive heterogeneity of effect sizes at lower
standard errors – that is, with smaller sample sizes. Because the individual
estimates are so heterogeneous, even small changes in the inclusion criteria or
method of effect size extraction can cause shifts in the meta-analytic
estimates.
Meta-analysis is a powerful tool for getting a birds-eye view of a
research literature, but it can be finicky. Because minor variations in
methodology can lead to different conclusions, it is especially important that
the meta-analytic methods are transparent and reproducible.
Do the CBCA criteria discriminate
between truthful and deceptive statements?
Although the fact my estimates for the validity of the CBCA criteria (without
imputations) were often substantially lower than the estimates of AAFV, I don’t
take this as particularly good evidence that CBCA criteria don’t work as hypothesized. However, given what we see here, we
should be skeptical of claims that the CBCA criteria do work. If my reasoning about criterion exclusions is correct, we
should be concerned about the way researchers have routinely taken a selective
approach to CBCA, as this is quite likely to inflate individual estimates (and
as a consequence, uncorrected long run estimates as well).
I employed two methods to attempt to correct for reporting biases: the
imputation of 0 effect sizes and the trim and fill method. Both these
techniques suggest that the actual ability of CBCA criteria to distinguish
between true and false statements is much more modest than AAFV concluded. We
can and should question the appropriateness of both these approaches to
correcting the effect sizes. However, they point toward the same conclusion:
the validity of CBCA criteria may be quite overstated. Ultimately, I think the
way to resolve the question of the validity of the criteria is with
high-powered research that is transparently conducted and reported.
Indeed, many of the problems I’ve bemoaned throughout this post could
have been ameliorated with transparent, open science practices******. More
complete reporting could have helped unbias effect estimates with greater
precision, for example. To the extent that publication bias has suppressed
studies with null or negative effects, disclosure of the findings would have
helped create a more accurate picture of the literature. Much of the CBCA
research was conducted and reported before open science was conveniently
facilitated by online repositories and other resources. For this research, one
can easily understand the unavailability of supplementary material and original
data. However, for recent research – and for AAFV’s review itself – there is
little excuse. Many of the questions whose answers have eluded me here could
have been resolved swiftly and confidently if the original and meta-analytic
data were openly available.
It is commonly claimed that CBCA is empirically supported, even if it is
not unerringly accurate (see, e.g., Vrij, 2015). However, much of the supposed
support may come from distorted effect estimates and may be artifacts of various
biases and selective reporting. What we see here provides a different, less
encouraging, view of CBCA. As is the case in many areas of science, a lack of
transparency has planted landmines on the path to discovery.
Notes
* CBCA was originally intended for use with children’s statements. The
focus of my larger project, however, is cues to deception in adults, and that
focus has carried over into this side project.
** AAFV reports there are supplementary materials available on the
ResearchGate profile for the corresponding author, but I couldn’t find anything
there. There will be egg on my face if the meta-analytic data are in fact
hosted there somewhere, but I did check.
*** One could reasonably ask, if I care so much about rigor and whatnot,
why didn’t I write this up for peer review and publication in a proper journal?
There are a few reasons. First, I did all this work alone, and if it were to be
done really properly, I think I should have worked with others to recheck my
calculations, establish a reliable coding system, and resolve disputes about
inclusions/exclusions. Although I think collaborating would have improved the
quality of this work, I mostly did this to investigate my hunch that something
was off about the CBCA literature and as a relatively minor supplement to my
larger project examining effect size inflation in the deception literature. Moreover,
I didn’t do an exhaustive search of the CBCA literature and simply relied on
the documents included in AAFV. This limits the ability of this analysis to
speak generally about the validity of CBCA criteria (though I think it is
sufficient nevertheless to make important points about the CBCA literature in
general). I think registration and making my work openly available offsets many
of the problems of working on this project as a lone ranger, but I think there
are still serious drawbacks to the way I approached this, such that I’m not
comfortable submitting it to a journal. Second, writing in blog format rather
than journal format is in many ways liberating. The structure and style of the
post is totally at my discretion, for instance. Third, I think blogging is a
legitimate method of scientific communication. It has many disadvantages
relative to peer-reviewed publications, but the mere fact that something is a
blog does not in itself undermine the quality if the information therein. If
the t-distribution were published on
Student’s blog, we still should have taken it seriously.
**** Representative design is not the most intuitive thing in the world,
and it’s hard to explain. When I reread this paragraph, I thought, “I think
this is accurate, but it’s hard as hell to understand.” Here’s a more
approachable illustration…
Imagine there are two bakeries: Camilla’s Credible Crusts and Fiona’s
Fabulous Fabrications. A group of six researchers gets into a heated argument
about the qualities that characterize the baked goods from each shop. One of
the researchers, Udo, hypothesizes that goods from Camilla’s are generally
higher on the following qualities: sweetness, firmness, lightness of color, and
freshness of smell. Together, the other five agree to test Udo’s hypothesis by
each obtaining a sample of baked goods from each bakery and measuring the four
qualities of each item, to see if they can classify which shop each baked good
came from.
For simplicity, let’s assume they all acquire the same number of items,
so the sample sizes are all equal. But they don’t randomly sample baked goods
from each shop. Rather, they each individually and non-systematically get items
from Camilla’s and Fiona’s.
One researcher, Gunther, is off to a good start: He gets an assortment
of cakes and cookies from each shop. He is able to calculate the extent to
which each shop tends to differ on all four attributes. Sharon has similar luck
with her sample.
Others in the group run into methodological oddities. Alicia ends up
with eight different kinds of meringues and nothing else. She measures the sweetness,
color, and smell, but she decides to ignore the firmness, since all the
meringues have indistinguishable textures. Other members of the team run into
similar kinds of issues but for different variables. All of Mary’s macaroons
were the same color, for instance, and all of Veronica’s pies were equally
sweet. When Alicia, Mary, and Veronica report back to the rest of the group,
they each leave out the variable that seemed safe to ignore in their respective
samples.
Let’s assume we no longer have access to the baked goods (because they
ate them all) or the original data from each person (because it’s the ‘80s and
there are no online data repositories). We only have each person’s summary and
conclusions.
Here’s the problem: If we meta-analyze the results of the five reports,
how are the excluded variables going to affect the assessment of Udo’s
hypothesis? And what do we do about it?
It’s going to matter how well each sample of baked goods represents the
inventory of the bakeries. If meringues are only a tiny fraction of the
inventory at Camilla’s and Fiona’s, then all the estimates might be thrown off
by the fact that meringues make up an entire fifth of the observations. If
meringues actually constitute a fifth of the bakery inventories (or close to
it), then one viable strategy for dealing with Alicia’s missing measure of
firmness is to impute a 0 for the lost effect size. After all, the diagnostic
value of firmness can be expected to be close to 0 for that part of the
population, so imputing a 0 for the missing value makes sense – and it will be
weighted accordingly in the meta-analysis.
Thus, if the samples of baked goods give a good picture of the bakeries’
actual inventories, then imputing 0’s for the missing values is a good idea
because it will bring the estimates closer in line with what you can expect the
values to be in the general population of baked goods. But if the samples don’t
represent the inventories well, the imputations are going to throw off the
estimates. Of course, it’s not all or nothing. It’s the extent of the
representativeness that matters.
In this whimsical example, you could just walk into Camilla’s and
Fiona’s bakeries and check their display cases to see how well the samples
represent the population. In the real world, we don’t know exactly how the
population looks.
Another problem (indeed one of the core problems of representative
design) occurs if the researchers selected baked goods they thought would
support Udo’s hypothesis, rather than attempting to represent the actual bakery
inventories. That is, if they picked items they had a good reason to think
would actually differ on at least some of the four criteria of sweetness,
firmness, color, and smell, then it would bias the estimates upward relative to
the actual difference in the broader population of baked goods. Each
researcher’s report would seem to suggest that the criteria discriminate
between the bakeries to a larger extent than they actually did. If another
person had a cookie and wanted to figure out which bakery it came from, they
might be misled into thinking those criteria could help them make a good guess
about whether it was from Camilla’s or Fiona’s. This problem is difficult to
detect and difficult to correct.
***** I did not register this analysis. Interpret it with caution.
****** Some of the problems would not be solved by open science. The
issue of low-power and low precision in estimation, for instance, would not be
solved by increased transparency. Solving that problem requires lots more data.
References
Resources and registration for this project are available here: https://osf.io/4w7tr/ . In case I need to make updates to this post, a copy of the original
version is hosted there as well.
Amado, B. G., Arce, R., Fariña, F., & Vilariño, M. (2016).
Criteria-Based Content Analysis (CBCA) reality criteria in adults: A
meta-analytic review. International Journal of Clinical and Health
Psychology, 16, 201-210.
Brunswik, E. (1955). Representative design and probabilistic theory in a
functional psychology. Psychological Review, 62,
193-217.
Critchlow, N. (2011). A field
validation of CBCA when assessing authentic police rape statements: Evidence
for discriminant validity to prescribe veracity to adult narrative.
Unpublished raw data.
DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L.,
Charlton, K., & Cooper, H. (2003). Cues to Deception. Psychological
Bulletin, 129, 74-118.
Evans, J. R., Michael, S. W., Meissner, C. A., & Brandon, S. E.
(2013). Validating a new assessment method for deception detection: Introducing
a Psychologically Based Credibility Assessment Tool. Journal of Applied
Research in Memory and Cognition, 2, 33-41.
Rassin, E., &
van der Sleen, J. (2005). Characteristics
of true versus false allegations of sexual offences. Psychological
reports, 97, 589-598.
Simmons, J., Nelson, L., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the
Society for Personality and Social Psychology, New Orleans, LA, 17-19 January
2013. Available at SSRN: https://ssrn.com/abstract=2205186
Ternes, M. (2009). Verbal credibility assesment of incarcerated
violent offenders' memory reports. Doctoral dissertation, University
of British Columbia.
Vrij, A. (2015). Verbal Lie Detection tools: Statement validity
analysis, reality monitoring and scientific content analysis. In P.A.
Granhag, A. Vrij, B. Verschuere (eds.), Detecting Deception: Current Challenges
and Cognitive Approaches (pp.
3-35). John Wiley & Sons.