Thursday, June 7, 2018

An attempt to reproduce a meta-analytic review of Criterion-Based Content Analysis


Criterion-Based Content Analysis (CBCA) is a component of Statement Validity Assessment (SVA), which is a tool used to evaluate the credibility of statements – especially claims of abuse. In brief, CBCA is used to distinguish between credible and fabricated statements. CBCA comprises a set of 19 criteria on which a verbal statement is evaluated, which I’ve summarized in a table below. Truthful statements are expected to contain more of each criterion.

Criterion Description
1 Logical structure
2 Unstructured production
3 Quantity of details
4 Contextual embedding
5 Description of interactions
6 Reproduction of conversations
7 Unexpected complications
8 Unusual details
9 Superfluous details
10 Details misunderstood
11 External associations
12 Subjective mental state
13 Others' mental states
14 Spontaneous corrections
15 Admitting lack of memory
16 Self-doubt
17 Self-deprecation
18 Pardoning the perpetrator
19 Details characteristic of the event

CBCA is more thoroughly described elsewhere, and for a better introduction, I direct the interested reader to the several summaries that exist (e.g., Vrij, 2015). Admittedly, I am not an expert on the application of CBCA. I only became particularly interested in it as I was working on a statistical research project on the potential for effect size inflation in the literature on cues to deception.

As part of my literature review for that project, I took a closer look at a meta-analysis of CBCA criterion validity in adults by Amado, Arce, Fariña, and Vilariño (2016, hereafter AAFV)*. My interest was primarily at assessing the potential for overestimation of composite measures designed to improve deception detection, but as I dug deeper into the methods of the meta-analysis and the literature it reviewed, I became increasingly concerned that the review and primary studies had serious methodological and statistical flaws.

Causes for concern

One of the things that initially alarmed me about the effect estimates in AAFV was that, in many of the studies they included, criteria had been left unreported or minimally reported. That is, sometimes researchers said they coded certain criteria but omitted them from their report because they didn’t significantly discriminate between truthful and fabricated statements or because they occurred so infrequently that they were virtually useless. There are at least two problems with this, one statistical and one practical.

The statistical problem is that if researchers are in the habit of selectively reporting criteria like this, it will not be possible to extract estimates for these underreported criteria, since they are more likely to be reported if they are significant and “useful.” This means long run meta-analytic estimates will chronically overestimate the effects because they are regularly including artificially inflated estimates from primary studies.

The related practical problem is that a practitioner attempting to use CBCA in an individual case cannot necessarily (or generally) know whether to ignore a criterion because it isn’t useful or appropriate in this context. Ignoring specific criteria in a study is a decision made from the privileged position of being able to examine numerous statements. In a real case, the absence of a criterion does not occur against the backdrop of absences in other statements drawn from the same population, so its ignorability cannot be inferred. Thus, if a criterion fails to distinguish between truthful and fabricated statements in a given sample, ignoring it in that sample will overstate the detectability of deceit using CBCA.

Take “reproduction of conversations” for example. This criterion is sometimes purposefully ignored in experiments in which the stimulus material described in the sampled statements does not contain any conversations (so it isn’t sensible to expect conversations to be reproduced). Under real world conditions, however, a CBCA practitioner could not, by definition, know whether there were actually conversations in the alleged event, so he or she could not make an informed decision about the applicability of the criterion. Sometimes criteria are ignored in studies in which the criterion just so happened to be very rare or totally absent in the sampled statements, even when they could have plausibly occurred. This causes similar problems of generalizability to the real world.

There are ways of attempted to compensate for failures to report statistics in a way that allows exact effect sizes to be calculated. In their meta-analytic review of cues to deception, DePaulo and her colleagues (2003) attempted to correct for minimally reported cue effects (e.g., cues that were reported as nonsignificant without sufficient statistics to calculate an effect size) by imputing 0 for those effects (or, when it was possible to determine a direction of the effect, .01 or -.01). As far as I can tell, AAFV didn’t implement any similar imputation in order to account for minimally or unreported CBCA criteria. I’ll return to this later.

Other aspects of AAFV’s methods troubled me. For instance, I noticed some of the reports marked for inclusion did not seem to fit the stated inclusion criteria. For example, they report having included Evans et al. (2013) – but the studies reported in that document don’t use statements as the unit of analysis. Rather, they report the results of multiple judgments of a large pool of statements, made participant observers.

Another issue concerned the establishment of ground truth in the primary studies. In order to draw inferences about the diagnostic value of the CBCA criteria, we need to know with a high degree of certainty that the statements classified as true and fabricated are in fact respectively true and fabricated. In experiments, this tends to be fairly straightforward, since the researchers are typically under direct control of participants’ experiences and have assigned people to tell the truth or lie. In field studies, establishing ground truth can be exceedingly challenging, if not outright impossible. The field studies included in AAFV employed a variety of methods to establish ground truth, some of which strike me as quite problematic.

Several field studies relied on documented forensic evidence or court decisions. This seems like a reasonable approach to establishing ground truth in the field (indeed, perhaps the only way), but it may be subject to error and selection biases. There are obvious ways in which these decisions could go wrong. Forensic evidence can be non-diagnostic, and court decisions are sometimes incorrect. Selection biases are more subtle: The only cases that can be included are those in which there are apparently firm decisions and/or extensive documentation. That is, the included cases may represent the easiest of cases in which to assess credibility. Perhaps these are the very cases in which truthful claims look the way CBCA expects them to look, to the exclusion of truthful claims in more ambiguous cases that may not be rich in CBCA criteria. Thus, statements drawn from them may overestimate the diagnosticity of CBCA criteria. Of course, we have no way of knowing whether this possibility obtains in reality.

Other field studies took alternate approaches. Rassin and van der Sleen (2005) requested police officers to code statements from their own cases using a provided checklist. Ground truth was established by the officers’ judgments and court decisions. This presents a variation on the problem described above, as it adds another potential source of selection bias in the officers’ discretion. Ternes (2009) solicited statements from a sample of incarcerated people, without instructing them to lie or tell the truth. Statements were classified as credible and non-credible using CBCA. The effect sizes contrasting the credible and non-credible statements were, of course, tautological. The non-credible statements were classified as such precisely because they exhibited fewer CBCA criteria. There was no chance that the effect size estimates could fail to be in the expected direction.

I don’t mean to take individual researchers to task unfairly for the compromises they had to make in order to conduct their work – particularly for field studies, which impose hefty logistical challenges. Depending on their respective research questions, it may have made perfect sense to have done things the way they did.

That said, all this seemed to me like a recipe for overestimation. Wanting to get a better understanding of what was going on, I wrote to the corresponding author of AAFV to request the data**. After getting no response, I decided the best option was to try to redo the meta-analysis myself. So I did.

Replicating/reproducing AAFV

General approach

My intention was never to do a proper, original meta-analysis of the CBCA literature. Rather, my goal was to assess the validity of the conclusions and accuracy of the criterion estimates in AAFV. For this reason, I didn’t attempt to exhaustively search the literature and instead just attempted to obtain the documents included in AAFV. I obtained 38 of the 39 documents included in AAFV. I didn’t get my hands on an unpublished dataset the original authors had (i.e., Critchlow, 2011).

I registered my plan on OSF***. Registration seemed particularly important here because I was approaching this with a self-known bias: I already thought there were problems with the estimates. Although I believe my registration proved useful, I still made several judgment calls that were outside the coverage of the registration. For instance, I didn’t register a rule about how to deal with dependent effect sizes (e.g., several groups of fabricated statements compared against a single truthful control group). Ultimately, in cases where there were multiple dependent estimates, I extracted just one that seemed most appropriate (and recorded which one that was in my notes). I did this to avoid over-weighting any given sample.

Effect size extraction and imputation

When effect sizes were reported, I recorded them or converted them into d. More often, effect sizes had to be manually calculated. I used the compute.es package in R to do these calculations. Usually, this entailed entering means, standard deviations, and group ns into the mes() function or proportions and group ns into the propes() function, but when these data were unavailable I sometimes converted t or F statistics using tes() and fes() respectively.

I only extracted effect sizes for the original 19 CBCA criteria. AAFV also examined some additional criteria, but these are very rarely studied, and as far as I know, not often used in practice. I also excluded reports of “total CBCA score” because my issues with those scores are best addressed elsewhere.

Because of the concerns I described above, I wanted to account for the missing and excluded criterion estimates and decided to take an approach similar to the one taken by DePaulo et al. (2003). Each time a study reported excluding a criterion, I coded the stated reason for the exclusion. I had registered a plan to impute 0’s for effects excluded for several of the possible stated reasons (e.g., because the behavior rarely occurred in the sample).

We should be clear about the limitations of this conservative approach to imputing effect sizes. Its effectiveness, I believe, pivots on the representativeness of the literature (cf. Fieldler, 2011). This is a problem of ecological validity in the Brunswikian sense of the term (Brunswik, 1955). If we want to estimate the effects that occur out in the wild, these imputations will improve the accuracy of the estimates assuming that the situations sampled in the literature adequately represent the situations in which CBCA is applied. That is, assuming that underreported criteria are excluded because they cannot plausibly distinguish between truthful and deceptive statements in a given situation, weighting underreported criteria toward zero will improve the estimates, to the extent that situations that render those criteria useless actually occur in reality. However, non-representative design in the literature will bias the observed estimates (likely overestimating them, by oversampling situations in which the criteria are generally more effective), but it will also change the effect of the imputations. The imputations will underestimate the criterion effect sizes if the empirical literature has, for example, oversampled situations in which underreported criteria are ineffective relative to their actual effect in the population****.

Because this approach is fraught with potential problems, I report the results both with and without imputations below.

Results

All the data I extracted from the documents, my coding notes, and scripts for the analyses and visualizations can be found here: osf.io/4w7tr/

Below, I have tabled the effect size estimates for each criterion. The table presents results from AAFV and the results from my replication, with and without the imputations.

AAFV
Replication
Replication with imputations

Criterion
d
k
d
k
d
k
1
0.48
30
0.15
22
0.11
31
2
0.53
27
0.41
20
0.29
27
3
0.55
35
0.31
27
0.25
33
4
0.19
29
0.19
19
0.11
31
5
0.27
29
0.21
23
0.15
31
6
0.34
34
0.38
21
0.27
29
7
0.25
29
0.17
21
0.13
27
8
0.31
35
0.19
23
0.14
29
9
0.14
27
0.29
21
0.19
29
10
0.22
5
-0.01
7
0.00
26
11
0.26
22
0.19
15
0.12
26
12
0.18
28
0.17
21
0.11
30
13
0.09
31
0.18
22
0.13
31
14
0.16
29
0.14
22
0.09
29
15
0.25
34
0.23
22
0.18
26
16
0.20
26
0.23
18
0.16
24
17
0.04
13
0.27
11
0.11
25
18
-0.02
8
0.08
7
0.00
24
19
0.28
5
0.29
3
0.04
23

You will notice that the numbers are not the same across the different methods. There are wide discrepancies between my numbers (both the effect estimates and the number of effects) and the original AAFV numbers.

AAFV reported a total of 476 effect estimates. I extracted 345. This is a difference of 131. It is possible I missed some estimates. Some of the missing estimates presumably came from the one unpublished dataset I didn’t obtain, and I excluded some documents that AAFV included (e.g., Evans et al., 2013). Perhaps AAFV included dependent effect sizes, which I excluded. I also made some admittedly ad hoc judgment calls. For example, in one case, I excluded a set of effect sizes because the standard deviations for the criterion measures were implausibly small (e.g., M = 1.85, SD = .05) and would therefore generated effect estimates that were outrageously large. However, it’s not clear to me that these differences account for all the discrepancies. Without examining AAFV’s original data, we can’t say why there’s such a wide discrepancy.

Most of my estimates for the criterion effects are lower than those of AAFV, but a few are higher. Some are quite different. For example, AFFV estimated Criterion 1 (logical structure) as d = .48, whereas I estimated it at d = .15 (without imputations). AAFV estimated Criterion 10 (accurate details misunderstood) as d = .22, whereas I estimated it at d = -.01 (without imputations). These differences attest to the fact that differing methods of reviewing the literature can result in widely discrepant estimates.

Intuitively, we might be inclined to trust the effect sizes from the larger number of estimates – that is, AAFV’s estimates. This intuition may be misguided, however, if AAFV included many dependent effect sizes that I didn’t. For example, Ternes (2009; the one with the tautological effects) reported data to compute several sets of dependent effect sizes. I only extracted one set of effects (i.e., one estimate per reported criterion) from that document, but if AAFV extracted several sets, they may have given too much weight to poor estimates (which had no chance of failing to support the hypotheses). I don’t know what they did.

If we look at the estimates with the imputations, they are, of course, generally lower than those without the imputations (as would be expected if you added a bunch of 0’s to the data). But for many criteria, the imputations had quite stark effects, suggesting that many criteria were excluded quite frequently. For example, Criterion 2 (unstructured production) dropped from d = .41 to d = .29. Criterion 19 (details specific to the event) dropped from d = .29 to d = .04. Which estimates are more accurate is, of course, debatable.

To get an overall picture of all the criteria, I created a funnel plot for all the effect sizes I extracted. This plot uses the data without imputations. Vertical reference lines are drawn at 0 and the weighted mean effect size for all criteria (d = .22 – approximately the difference in length between Harry Potter and Lord of the Rings books), with vertical lines drawn for the 95% confidence bounds for the meta-analytic effect size. The funnel guide lines are drawn for the 95% and 99% confidence levels.



You can see there are many outlying effect estimates – well outside the 99% confidence bounds – especially positive outliers at lower standard errors. The positive outliers tend to observe larger absolute effects than the negative outliers. It’s possible this is a symptom of publication bias. You can also see that the effect sizes seem to center closer to zero at higher standard errors – but these estimates come from just a few samples, so we should be cautious about overinterpreting them.

A common approach to compensating for publication bias is the trim and fill technique. I applied this technique using the trimfill() function of the metafor package for R, specifying imputation of missing effects on the negative side*****. The trim and fill estimated that a total of 59 effects were suppressed by publication bias, and the adjusted weighted mean effect size was a much more modest d = .06 (approximately the difference in review quality between Meryl Streep movies and Tom Hanks movies). Although 59 estimates might seem like a lot, bear in mind that CBCA comprises 19 criteria. Thus, this is similar to an imputation of about three studies finding negative effects for all criteria.

Experiments and field studies

I had registered a plan to test whether effect sizes varied as a function of different methods of establishing ground truth, but there were only a handful of field studies, so I opted not to run these analyses, as I doubt they would be meaningful.

However, you’ll notice in the table below that criterion effect estimates from experiments tend to be substantially lower than estimates from field studies. This occurred both in my replication and in AAFV (see their moderator analyses). The experimental effects were quite modest and are generally in line with the rest of the deception literature. The median effect size for cues to deception in DePaulo et al. (2003) is |.10|. Without imputations, the median effect size for experiments is .12 (the mean is .13). Field studies, in contrast, had much larger effects. Without imputations, the median effect size for field studies is .46 (the weighted mean is .35). All the estimates in the table below are from my replication, not from AAFV. AAFV estimated the mean effects for experiments and field studies to be higher, at .25 and .75 respectively.

Experiments
Experiments with imputations
Field studies
Field studies with imputations

Criterion
d
k
d
k
d
k
d
k
1
0.04
16
0.02
23
0.69
3
0.42
5
2
0.19
13
0.12
19
0.80
4
0.63
5
3
0.28
20
0.21
25
0.12
4
0.10
5
4
0.13
13
0.08
23
0.16
3
0.09
5
5
0.17
17
0.12
23
0.18
3
0.10
5
6
0.27
15
0.18
21
0.64
3
0.41
5
7
0.04
15
0.03
19
0.52
3
0.33
5
8
0.16
18
0.13
21
0.30
3
0.20
5
9
0.27
15
0.16
22
0.51
3
0.38
4
10
-0.12
5
-0.04
19
0.69
1
0.12
4
11
0.02
10
0.02
19
0.60
3
0.38
5
12
0.08
15
0.05
23
0.22
3
0.17
4
13
0.17
15
0.11
23
-0.12
4
-0.08
5
14
0.04
15
0.03
21
0.29
4
0.23
5
15
0.26
16
0.21
19
0.07
4
0.07
4
16
0.12
13
0.09
17
0.65
3
0.47
4
17
-0.06
5
-0.02
18
0.67
3
0.49
4
18
-0.28
2
-0.03
17
-0.11
3
-0.08
4
19
0.00
1
0.00
16
0.18
1
0.06
4

A prosaic explanation for this apparent difference between experiments and field studies is ordinary sampling variation. We might expect there to be greater variation in estimates with fewer studies. It is also possible that there is a genuine systematic difference between effects in the lab and effects in the field.

It might be tempting to think that CBCA might be more effective in real cases than it is in artificially produced statements obtained in control laboratory conditions. This is possible. Bear in mind, however, that field studies are also confounded with their method of establishing ground truth, which is necessarily inferior to experimentally established ground truth. For the reasons I noted earlier, it’s possible for selection biases to produce overestimated effects under suboptimal conditions for establishing ground truth.

No matter the cause, it seems that the larger effects in the literature are primarily driven by the field studies, which produce dramatically larger estimates. Seven of them are over .60. Another two are above .50. For intuitive comparison, d = .59 represents the difference in weight between men and women (Simmons, Nelson, & Simonsohn, 2013). One wonders if it’s plausible for any criterion to have a true effect that is quite this large, let alone seven. Anything is possible, but this strikes me as unlikely, given the overall weakness of cues to deception (DePaulo et al, 2003).

What does all this mean?

You should not trust my results – at least not blindly. I was operating alone, with no one to check my work and no one with whom to establish the reliability of my coding of exclusion justifications. Curious or skeptical readers should look at my notes and my data and check for errors. I was, of course, trying to arrive at accurate estimates – but that doesn’t mean anyone should take my word for it. That said, I believe there are two general lessons to take away from what happened here. The first is a lesson about meta-analytic review methods in general and in CBCA in particular. The second is a lesson about the validity of CBCA criteria.

Little differences, big differences

In extracting effect sizes from the literature, I tried to follow the methods of AAFV reasonably closely (though I almost certainly applied their inclusion criteria differently for some documents). Some estimates were fairly close to those of AAFV, but for other estimates, I wasn’t even remotely close to theirs.

How is it possible for relatively minor variations in review process to cause large discrepancies in results? The explanation is, I believe, that the informational value of individual CBCA studies has been relatively low, due to low power and selective reporting. With small sample sizes, individual studies exhibit high degrees of variation in their results. If CBCA studies were routinely high powered, their estimates would be more precise, and we would expect there to be less heterogeneity of effect sizes. However, we can see in the funnel plot above that there is massive heterogeneity of effect sizes at lower standard errors – that is, with smaller sample sizes. Because the individual estimates are so heterogeneous, even small changes in the inclusion criteria or method of effect size extraction can cause shifts in the meta-analytic estimates.

Meta-analysis is a powerful tool for getting a birds-eye view of a research literature, but it can be finicky. Because minor variations in methodology can lead to different conclusions, it is especially important that the meta-analytic methods are transparent and reproducible.

Do the CBCA criteria discriminate between truthful and deceptive statements?

Although the fact my estimates for the validity of the CBCA criteria (without imputations) were often substantially lower than the estimates of AAFV, I don’t take this as particularly good evidence that CBCA criteria don’t work as hypothesized. However, given what we see here, we should be skeptical of claims that the CBCA criteria do work. If my reasoning about criterion exclusions is correct, we should be concerned about the way researchers have routinely taken a selective approach to CBCA, as this is quite likely to inflate individual estimates (and as a consequence, uncorrected long run estimates as well).

I employed two methods to attempt to correct for reporting biases: the imputation of 0 effect sizes and the trim and fill method. Both these techniques suggest that the actual ability of CBCA criteria to distinguish between true and false statements is much more modest than AAFV concluded. We can and should question the appropriateness of both these approaches to correcting the effect sizes. However, they point toward the same conclusion: the validity of CBCA criteria may be quite overstated. Ultimately, I think the way to resolve the question of the validity of the criteria is with high-powered research that is transparently conducted and reported.

Indeed, many of the problems I’ve bemoaned throughout this post could have been ameliorated with transparent, open science practices******. More complete reporting could have helped unbias effect estimates with greater precision, for example. To the extent that publication bias has suppressed studies with null or negative effects, disclosure of the findings would have helped create a more accurate picture of the literature. Much of the CBCA research was conducted and reported before open science was conveniently facilitated by online repositories and other resources. For this research, one can easily understand the unavailability of supplementary material and original data. However, for recent research – and for AAFV’s review itself – there is little excuse. Many of the questions whose answers have eluded me here could have been resolved swiftly and confidently if the original and meta-analytic data were openly available.

It is commonly claimed that CBCA is empirically supported, even if it is not unerringly accurate (see, e.g., Vrij, 2015). However, much of the supposed support may come from distorted effect estimates and may be artifacts of various biases and selective reporting. What we see here provides a different, less encouraging, view of CBCA. As is the case in many areas of science, a lack of transparency has planted landmines on the path to discovery.

Notes

* CBCA was originally intended for use with children’s statements. The focus of my larger project, however, is cues to deception in adults, and that focus has carried over into this side project.

** AAFV reports there are supplementary materials available on the ResearchGate profile for the corresponding author, but I couldn’t find anything there. There will be egg on my face if the meta-analytic data are in fact hosted there somewhere, but I did check.

*** One could reasonably ask, if I care so much about rigor and whatnot, why didn’t I write this up for peer review and publication in a proper journal? There are a few reasons. First, I did all this work alone, and if it were to be done really properly, I think I should have worked with others to recheck my calculations, establish a reliable coding system, and resolve disputes about inclusions/exclusions. Although I think collaborating would have improved the quality of this work, I mostly did this to investigate my hunch that something was off about the CBCA literature and as a relatively minor supplement to my larger project examining effect size inflation in the deception literature. Moreover, I didn’t do an exhaustive search of the CBCA literature and simply relied on the documents included in AAFV. This limits the ability of this analysis to speak generally about the validity of CBCA criteria (though I think it is sufficient nevertheless to make important points about the CBCA literature in general). I think registration and making my work openly available offsets many of the problems of working on this project as a lone ranger, but I think there are still serious drawbacks to the way I approached this, such that I’m not comfortable submitting it to a journal. Second, writing in blog format rather than journal format is in many ways liberating. The structure and style of the post is totally at my discretion, for instance. Third, I think blogging is a legitimate method of scientific communication. It has many disadvantages relative to peer-reviewed publications, but the mere fact that something is a blog does not in itself undermine the quality if the information therein. If the t-distribution were published on Student’s blog, we still should have taken it seriously.

**** Representative design is not the most intuitive thing in the world, and it’s hard to explain. When I reread this paragraph, I thought, “I think this is accurate, but it’s hard as hell to understand.” Here’s a more approachable illustration…

Imagine there are two bakeries: Camilla’s Credible Crusts and Fiona’s Fabulous Fabrications. A group of six researchers gets into a heated argument about the qualities that characterize the baked goods from each shop. One of the researchers, Udo, hypothesizes that goods from Camilla’s are generally higher on the following qualities: sweetness, firmness, lightness of color, and freshness of smell. Together, the other five agree to test Udo’s hypothesis by each obtaining a sample of baked goods from each bakery and measuring the four qualities of each item, to see if they can classify which shop each baked good came from.

For simplicity, let’s assume they all acquire the same number of items, so the sample sizes are all equal. But they don’t randomly sample baked goods from each shop. Rather, they each individually and non-systematically get items from Camilla’s and Fiona’s.

One researcher, Gunther, is off to a good start: He gets an assortment of cakes and cookies from each shop. He is able to calculate the extent to which each shop tends to differ on all four attributes. Sharon has similar luck with her sample.

Others in the group run into methodological oddities. Alicia ends up with eight different kinds of meringues and nothing else. She measures the sweetness, color, and smell, but she decides to ignore the firmness, since all the meringues have indistinguishable textures. Other members of the team run into similar kinds of issues but for different variables. All of Mary’s macaroons were the same color, for instance, and all of Veronica’s pies were equally sweet. When Alicia, Mary, and Veronica report back to the rest of the group, they each leave out the variable that seemed safe to ignore in their respective samples.

Let’s assume we no longer have access to the baked goods (because they ate them all) or the original data from each person (because it’s the ‘80s and there are no online data repositories). We only have each person’s summary and conclusions.

Here’s the problem: If we meta-analyze the results of the five reports, how are the excluded variables going to affect the assessment of Udo’s hypothesis? And what do we do about it?

It’s going to matter how well each sample of baked goods represents the inventory of the bakeries. If meringues are only a tiny fraction of the inventory at Camilla’s and Fiona’s, then all the estimates might be thrown off by the fact that meringues make up an entire fifth of the observations. If meringues actually constitute a fifth of the bakery inventories (or close to it), then one viable strategy for dealing with Alicia’s missing measure of firmness is to impute a 0 for the lost effect size. After all, the diagnostic value of firmness can be expected to be close to 0 for that part of the population, so imputing a 0 for the missing value makes sense – and it will be weighted accordingly in the meta-analysis.

Thus, if the samples of baked goods give a good picture of the bakeries’ actual inventories, then imputing 0’s for the missing values is a good idea because it will bring the estimates closer in line with what you can expect the values to be in the general population of baked goods. But if the samples don’t represent the inventories well, the imputations are going to throw off the estimates. Of course, it’s not all or nothing. It’s the extent of the representativeness that matters.

In this whimsical example, you could just walk into Camilla’s and Fiona’s bakeries and check their display cases to see how well the samples represent the population. In the real world, we don’t know exactly how the population looks.

Another problem (indeed one of the core problems of representative design) occurs if the researchers selected baked goods they thought would support Udo’s hypothesis, rather than attempting to represent the actual bakery inventories. That is, if they picked items they had a good reason to think would actually differ on at least some of the four criteria of sweetness, firmness, color, and smell, then it would bias the estimates upward relative to the actual difference in the broader population of baked goods. Each researcher’s report would seem to suggest that the criteria discriminate between the bakeries to a larger extent than they actually did. If another person had a cookie and wanted to figure out which bakery it came from, they might be misled into thinking those criteria could help them make a good guess about whether it was from Camilla’s or Fiona’s. This problem is difficult to detect and difficult to correct.

***** I did not register this analysis. Interpret it with caution.

****** Some of the problems would not be solved by open science. The issue of low-power and low precision in estimation, for instance, would not be solved by increased transparency. Solving that problem requires lots more data.

References

Resources and registration for this project are available here: https://osf.io/4w7tr/ . In case I need to make updates to this post, a copy of the original version is hosted there as well.

Amado, B. G., Arce, R., Fariña, F., & Vilariño, M. (2016). Criteria-Based Content Analysis (CBCA) reality criteria in adults: A meta-analytic review. International Journal of Clinical and Health Psychology16, 201-210.

Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology. Psychological Review62, 193-217.

Critchlow, N. (2011). A field validation of CBCA when assessing authentic police rape statements: Evidence for discriminant validity to prescribe veracity to adult narrative. Unpublished raw data.

DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to Deception. Psychological Bulletin129, 74-118.

Evans, J. R., Michael, S. W., Meissner, C. A., & Brandon, S. E. (2013). Validating a new assessment method for deception detection: Introducing a Psychologically Based Credibility Assessment Tool. Journal of Applied Research in Memory and Cognition2, 33-41.

Rassin, E., & van der Sleen, J. (2005). Characteristics of true versus false allegations of sexual offences. Psychological reports97, 589-598.

Simmons, J., Nelson, L., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the Society for Personality and Social Psychology, New Orleans, LA, 17-19 January 2013. Available at SSRN: https://ssrn.com/abstract=2205186

Ternes, M. (2009). Verbal credibility assesment of incarcerated violent offenders' memory reports. Doctoral dissertation, University of British Columbia.

Vrij, A. (2015). Verbal Lie Detection tools: Statement validity analysis, reality monitoring and scientific content analysis. In P.A. Granhag, A. Vrij, B. Verschuere (eds.), Detecting Deception: Current Challenges and Cognitive Approaches (pp. 3-35). John Wiley & Sons.