Thursday, February 28, 2019

There might be problems with the automated scoring of linguistic concreteness

This blog post is a collaboration by Karl Ask, Sofia Calderon, Erik Mac Giolla, RabbitSnore (Timothy Luke), and Lara Warmelink – collectively, the Puddle-Ducks.

Figure 1. Jemima Puddle-Duck

Introduction

Listen, children, and you shall hear the tale of the linguistic concreteness measures with questionable construct validity...

Once there was a team of researchers (that’s us) interested in human deception. We wanted to know whether truthful and deceptive statements about the future differ in the use of abstract/concrete language. Specifically, we thought lies about what you plan to do might incorporate more abstract words than telling the truth about what you plan to do. If truths and lies differ in this way, it’s plausible that it reflects differences in the way that liars and truth-tellers mentally represent future behavior. Perhaps false intentions are represented more vaguely and abstractly and true intentions are represented with more concrete detail. We think this is theoretically interesting – but it isn’t the primary topic of our story today. Instead, we want to talk about what happened when we tried to measure linguistic concreteness.

Linguists and psychologists have devised several ways of measuring the abstractness/concreteness of language. We began with a method proposed by Brysbaert, Warriner, and Kuperman (BWK, 2014). They had around 4,000 people provide ratings of the concreteness for about 40,000 English words on a 1-5 scale (1 = most abstract, 5 = most concrete). Thus, they produced a dictionary of words and their respective average concreteness ratings – ranging from “essentialness" (M = 1.04) to “waffle iron" (M = 5.00). This dictionary can be used to code the linguistic concreteness of text data (see below for details on how this works). We call this system “folk concreteness" to contrast it with other systems, since it represents a measure of laypeople’s perceptions of linguistic concreteness, rather than a measure defined a priori by theoretical or philosophical considerations.

The coding of folk concreteness scores is quite simple. Each word in a text is matched for its value in the BWK dictionary. If a word is not in the dictionary, it does not receive a score. You then simply take the mean of the concreteness values for all the matched words. Thus, the FC score is an average-based composite of the scores of individual words appearing in the text. Automatic scoring for this system is pretty straightforward in R (https://osf.io/z9mq6/).

To test our hypothesis about statements of true and false intention, we assembled text data from seven experiments in which participants lied and told the truth about some future activity. In total, we had N = 6,599 truthful and deceptive statements. We preregistered our approach and analysis code (https://osf.io/y48cz/), coded the texts using the folk concreteness system, and ran our planned analyses. Full presentation of the results is forthcoming (and will be part of the PhD thesis of Sofia Calderon), but in short, we found no evidence that truthful and deceptive intention statements differ in linguistic concreteness with this coding system.

After we performed the folk concreteness analyses, we found another system for coding concreteness in text data, so we decided to cross-validate the results with the alternative method. And that’s where the trouble started...

Finding nothing where there should be something

Seih, Beier, and Pennebaker (SBP, 2017) recently provided a method for computer-automated coding according to the linguistic category model (LCM). In brief, LCM is a system for classifying words according to their abstractness/concreteness in the context of social behavior and communication (Semin & Fiedler, 1988, 1991). In the SBP system, texts receive LCM scores that vary from 1 (most concrete) to 5 (most abstract). Unlike the folk concreteness approach, LCM is theory-driven and uses specified weights for categories of words. Traditional LCM scoring is performed manually, by human coders. The central contribution of SBP was the automation of the coding, which is of course much more efficient, better for the sanity of research assistants, and potentially more reliable.

SBP used a dictionary of approximately 8,000 verbs, tagged with their LCM classification: descriptive action verbs (DAV), interpretive action verbs (IAV), and state verbs (SV). In addition to these 8,000 verbs, SBP propose using automated classification of parts of speech to count the number of adjectives and nouns in a text. To do this, they used TreeTagger, a free Perl-based utility that uses predictive models to automatically identify parts of speech (POS) in text.

Once the LCM verbs, adjectives, and nouns have been counted, the sums are entered into an equation to obtain an LCM score for that text. SBP propose the following formula:

\begin{equation} LCM_{SBP} = \frac {DAV + 2*IAV + 3*SV + 4*adjectives + 5*nouns} {DAV + IAV + SV + adjectives + nouns} \end{equation}

Each of the weights assigned to the different word classes reflects the hypothesized abstractness of that word class. Thus, nouns are considered the most abstract, and interpretive action verbs are considered the most concrete.

We obtained the SBP dictionary and retested our hypotheses with the LCM system, to see if the results lined up with those of the folk concreteness approach. Following SBP, we used TreeTagger. SBP used Linguistic Inquiry and Word Count (LIWC) to conduct their analyses. Instead, we used R for all analyses. We implemented TreeTagger with the koRpus package. Our code is available on OSF (https://osf.io/gx8mj/).

We registered a plan, scored the deception texts, and compared the LCM scores of truthful and deceptive statements. As expected, LCM told the same story as folk concreteness: No difference between truthful and deceptive statements. At first, this seemed like straightforward corroboration of our previous results. But the best laid plans of mice and researchers go oft awry...

As part of our exploratory analyses, we calculated a simple correlation between the folk concreteness scores and the LCM scores. Because both these coding systems are supposed to measure linguistic concreteness, we would expect a fairly strong correlation (specifically a negative correlation, since the scale poles have opposite meanings). However, the values were almost perfectly uncorrelated, r = 0.025 (see Figure 2). These coding systems are supposed to measure the same thing. How could this possibly be the case?

Figure 2. Folk concreteness and LCM scores in the true and false intention data

This is more or less what you’d expect to see if two continuous variables were totally unrelated. You’ll notice that LCM scores seems to clump at whole and half numbers. Because all LCM weights are whole numbers, possible scores for very short texts are constrained.

“Jemima became much alarmed"

If folk concreteness and the automated LCM scores measure linguistic concreteness, the scores should be correlated. Folk concreteness and LCM aren’t necessarily supposed to measure the same kind of concreteness, though. LCM is purported to capture psychological processes associated with social communication, whereas folk concreteness is simply intended to give a score associated with the perceived concreteness of the individual words used in a given text. So it might be that the correlation is weaker than we originally thought, but it was surprising that it’s this weak.

The near-zero correlation was a cause for serious concern, so we performed a few informal tests of the two coding systems, to see how well they distinguished between texts we were pretty sure would differ in linguistic concreteness. Our reasoning was simple: If these coding systems are working the way they are supposed to, the scores ought to easily distinguish between obviously different texts.

Figure 3. A fox manually coding text according to the linguistic category model (LCM).

What kinds of texts should a good measure of linguistic concreteness be able to distinguish between? One approach would be to test substantive psychological theory using the various measures, to see if the predictions bear out according to the scores. But that has the disadvantage of relying on the validity of the theory in question. We opted instead to test almost painfully obvious predictions. We unsystematically collected some texts: an eclectic batch of philosophical texts, children’s stories, and song lyrics – things that should differ from each other substantially. We scored them using both folk concreteness and LCM and compared the results.

Below is a list of the texts we collected, followed by a table of their folk concreteness and LCM scores.

Short name Description
Carl Gustaf The Wikipedia article for the current king of Sweden
Jemima Puddle-Duck Beatrix Potter’s children’s story about a duck who struggles to care for her eggs
Peter Rabbit Beatrix Potter’s children’s story about a mischievous rabbit
Drywall A tutorial on how to repair damaged drywall
Association of Ideas James Mill’s chapter on "The Association of Ideas" in Analysis of the Phenomena of the Human Mind
On Denoting Bertrand Russell’s classic text "On Denoting"
Judgment of Taste Immanuel Kant’s chapter "Investigation of the question whether in the judgement of taste the feeling of pleasure precedes or follows the judging of the object" in Critique of Pure Reason
Oops, I Did It Again The lyrics for every song on Britney Spears’s album Oops, I Did It Again
Songs of Love and Hate The lyrics for every song on Leonard Cohen’s album Songs of Love and Hate
Their Finest Hour Winston Churchill’s “Finest Hour" speech
Folk concreteness LCM
Carl Gustaf 2.35 4.25
Jemima Puddle Duck 2.71 3.73
Peter Rabbit 2.73 3.67
Drywall 2.87 3.71
Association of Ideas 2.22 3.76
On Denoting 2.14 3.90
Judgment of Taste 2.06 3.94
Oops, I Did It Again 2.67 3.84
Songs of Love and Hate 2.79 3.72
Their Finest Hour 2.31 3.68

Folk concreteness fairly reliably scored philosophical texts as more abstract than other texts, and it predictably scored a tutorial on how to fix drywall as the most concrete of the texts. In short, folk concreteness seemed to work reasonably well at matching our own intuitive feeling about how concrete a text is.

You might notice that the range of scores was constrained between 2 and 3. This restriction of range is not surprising. Folk concreteness scores for any given text can potentially vary from 1 to 5. However, because any given natural language text almost invariably entails a set of diverse words, it stands to reason that we would not expect many (or any) texts of substantial length to receive scores close to the upper and lower bounds of the scale. Any text longer than a few words will have a score close to the midpoint of the scale, since it’s simply implausible (if not impossible) to say or write anything meaningful without using words that vary substantially in their concreteness, which will tend to balance each other out and push the scores to the middle.

In contrast to the folk concreteness system, the automated LCM coding had some trouble. For starters, as can be seen in the table of results, LCM could not meaningfully distinguish between James Mill and Jemima Puddle-Duck, assigning them both scores of around 3.7. Indeed, the automated LCM coding didn’t seem to score any of the texts quite as one might expect. For instance, LCM inexplicably scored the Wikipedia article on the king of Sweden as extremely abstract – even more abstract than Bertrand Russell’s “On Denoting." And the lyrics of the Brittney Spears album Oops, I Did It Again were ranked by LMC as only slightly less abstract than the writings of Russell and Kant (and, in fact, a little more abstract than Mill). Now, we would not want to malign Potter and Spears, but we do feel that their texts (directed as they are at children and teenagers respectively) are more concrete than texts by three philosophers, whose writing was not necessarily intended to be concrete (or readable for a general audience).

This pattern of scores is potentially indicative of a serious problem. One possibility is that the automated LCM coding wasn’t measuring what it intended to measure. Maybe it doesn’t measure anything at all and is essentially noise derived from arbitrarily weighting the counts of different categories of words. Figure 1 certainly looks the way one would expect it to look if one or both variables were nearly or entirely noise.

Another possibility is that automated coding only works for specific types of texts (e.g., descriptions of interpersonal behavior), and perhaps our texts do not conform to the model’s assumptions. Although it is somewhat unclear what kinds of language would be inappropriate for coding, the LCM coding manual suggests that, within a given text, language describing persons should be coded, but descriptions of situations should not be. The automated coding, however, does not discriminate between interpersonal and impersonal descriptions. As such, perhaps the automated coding only works for texts entirely composed of interpersonal language. Maybe we are misusing the coding system in this informal test – but that does not explain the zero correlation in the false intention data, which ought to be largely or exclusively composed of text that is appropriate for LCM coding, as it they entail descriptions of planned activities.

Whatever the cause, our results do not inspire confidence for the use of this automated LCM coding system. Further investigation was warranted. To see if these results were just flukes, we wanted to feed the coding systems many more texts to see if they would occur with other data. Specifically, we planned to feed it (1) several more texts with fairly obvious relative levels of concreteness and (2) large samples of texts that should offer us good estimates of the correlation between the scores given by each of the coding systems.

Before we look at more data, let’s consider an oddity in the automated LCM coding system...

LCM scoring, revised

One of the strange things about SBP’s system of coding is that nouns are so heavily weighted as the most abstract word category. In the LCM coding manual (Coenen, Hedebouw, & Semin, 2006), nouns were only considered when they were used as a qualifier (viz. performing a function similar to an adjective; e.g., “She’s a nature girl."). As support for their weighting of nouns, SBP cite Carnaghi et al. (2008), who report a series of studies in which they found that nouns were more abstract than adjectives. But rather than dealing with all nouns, Carnaghi et al. were focused exclusively on person perception and were considering nouns that, for example, described a group of people (e.g., athletes).

SBP’s approach struck us as problematic, since the automated coding procedure indiscriminately counts all nouns as highly abstract, not just the ones that describe or qualify people. Rocks, string cheese, and pitbulls (all considered highly concrete in the folk system) would all be considered maximally abstract under the SBP formula. On its face, that seems like nonsense*.

Thus, we considered a simple fix: simply remove nouns from consideration. The formula would therefore become as follows:

\begin{equation} LCM_{PD} = \frac {DAV + 2*IAV + 3*SV + 4*adjectives} {DAV + IAV + SV + adjectives} \end{equation}

This revision of LCM score can vary from 1 to 4 – so the bounds of the scale are a bit more constrained than SBP’s version. We call this modification LCM-PD, for “Puddle-Ducks."

We modified our automated LCM code to calculate an additional score according to the formula above. And we started collecting and scoring texts.

Are the measures correlated?

Although folk concreteness and LCM are intended to measure similar constructs, the correlation between them might be relatively weak. For that reason, we wanted loads of data to estimate the correlation.

To get lots and lots of naturally occuring texts, we scraped Amazon product reviews for Daniel Kahneman’s book (n = 2,000), potato chips (n = 427), instant ramen noodles (n = 951), sriracha sauce (n = 668), a reading workbook for school children (n = 1,184), and a vibrator (n = 2,746). This resulted in a total of N = 7,966 texts, with a mean length of M = 35.92 words (SD = 78.13, median = 20).

Does folk concreteness correlate with LCM-SBP in these data? Even the most cursory inspection of Figure 4 suggests the answer is no (or more precisely, r = .0055).

Figure 4. Folk concreteness and LCM-SBP scores in Amazon review data

A whole lot of nothing.

However, LCM-PD scores fared better. As can be seen in Figure 5, the revised scores did correlate with folk concreteness in the expected direction, though relatively weakly, r = -.1055.

Figure 5. Folk concreteness and LCM-PD scores in Amazon review data

A small correlation in the expected direction.

One could reasonably object that the content of these texts conflicts with the stated purpose of LCM – that is, to code concreteness in communication with people and about people. Many, if not most, of those product reviews are likely to be quite impersonal. Fortunately, there is no shortage on the internet of reviews involving other people. We scraped two kinds of Yelp reviews that are presumably relatively personal and social: reviews for therapists (N = 1,213) and for strip clubs (N = 2,692). We selected these review topics because we expected one to prompt people to describe deeply psychologically important human connections and the other to prompt descriptions of therapists.

Figure 6. Folk concreteness and LCM-SBP and -PD scores in Yelp reviews for therapists

Figure 7. Folk concreteness and LCM-SBP and -PD scores in Yelp reviews for strip clubs

One can see in Figures 6 and 7 that the Yelp data tell the same story as the Amazon product data: LCM-SBP is nearly uncorrelated with folk concreteness, r = -.053 for therapists and r = -.020 for strip clubs, and there is a small correlation in the expected direction for LCM-PD and folk concreteness, r = -.299 for therapists and r = -.246 for strip clubs. It’s also easy to see from the scatterplots that the range of scores in these datasets is much more constrained than the scores in the deception data and the Amazon review data. This is because these reviews were longer (for therapists M = 148.30 words, SD = 127.02; for strip clubs, M = 148.00 words, SD = 140.65). When there is more substance, there is less variance.

Consistently, it looks like LCM-PD performs better – that is, more in line with what you’d expect – than LCM-SBP. Perhaps adding nouns to the LCM formula simply adds noise to the measurement. But all these analyses so far have just explored the correlations between the measures. We haven’t yet checked again to see how well the measures distinguish between abstract and concrete texts...

Philosophy vs. concrete

“You can use a piece of PVC pipe and double-sided tape to make your holes part of your mold or you can drill out the holes later." - a tutorial on making a concrete countertop

“His philosophy is chiefly vitiated, to my mind, by this fallacy, and by the uncritical assumption that a metrical coordinate system can be set up independently of any axioms as to space-measurement." - Betrand Russell, Foundations of Geometry

We figured that philosophical texts should be less concrete than texts that are literally about concrete. This seemed like a pretty safe bet. We collected five philosophical texts and six texts about concrete** (see the table below) and scored them with folk concreteness, LCM-SBP, and LCM-PD.

Short name Description
Concrete (Wikipedia) Wikipedia article on concrete
Types of concrete Wikipedia article on types of concrete
Concrete patio Tutorial on concrete patio
Concrete flowers Tutorial on concrete flowers
Concrete counterop Tutorial on concrete countertops
Home repair Idiot’s Guide to Home Repair**
Analysis of Mind James Mill’s Analysis of Mind
Foundations of Geometry Bertrand Russell’s Foundations of Geometry
External World Bertrand Russell’s Our Knowledge of the External World
Scientific Discovery Karl Popper’s Logic of Scientific Discovery
Problems of Philosophy Bertrand Russell’s The Problems of Philosophy

Obviously, this is a fairly small sample of texts, but the pattern is strikingly in line with expectations (see the table below for results), for folk concreteness and LCM-PD. Folk concreteness reliably distinguished between the philosophical texts (M = 2.18, SD = .02) and texts about concrete (M = 2.65, SD = .06), t (5.88) = 17.32, p << .001, d = 10.59, 95% [5.30, 15.87]. Although LCM-SBP scores were higher for the philosophical texts (M = 3.84, SD = .14) than for the concrete texts (M = 3.77, SD = .08), this difference was much smaller than that of folk concreteness, t (6.35) = 1.00, p = .35, d = 0.60, 95% [-0.80, 2.00]***. However, LCM-PD seemed to effectively distinguish between philosophy (M = 2.61, SD = .05) and concrete (M = 2.35, SD = .13), t (6.74) = 4.44, p = .003, d = 2.71, 95% [0.82, 4.61]. Oddly, although the pattern of LCM-PD scores closely reflected folk concreteness scores, LCM-PD scored the two Wikipedia articles on concrete as quite abstract. This might be because those articles are full of fairly complex descriptions of the technical properties of concrete, which LCM might classify as quite abstract.

Folk concreteness LCM-SBP LCM-PD Word length
Concrete (Wikipedia) 2.61 3.76 2.50 7348
Types of concrete 2.59 3.79 2.52 3464
Concrete counterop 2.62 3.77 2.28 5795
Concrete flowers 2.68 3.62 2.26 958
Concrete patio 2.62 3.84 2.31 1315
Home repair 2.76 3.84 2.21 89539
Analysis of Mind 2.21 3.75 2.52 89961
Foundations of Geometry 2.20 3.85 2.65 79461
External World 2.18 3.77 2.62 71974
Scientific Discovery 2.16 4.08 2.64 206502
Problems of Philosophy 2.18 3.76 2.59 43879

The top six rows are texts about concrete. The bottom five rows are philosophical texts.

Conclusions

What does all this mean? Folk concreteness consistently behaved as expected. This raises our confidence about its ability to measure linguistic concreteness (and also gives us more confidence in the null results of the deception study that send us down this rabbit hole). As for LCM – it certainly doesn’t bode well that there is a near-zero correlation between LCM-SBP and folk concreteness, across several datasets, in addition to LCM-SBP being unable to effectively distinguish between philosophical texts and texts about concrete. We haven’t found any data in which LCM-SBP performs the way it is expected to. Given that the revised scale, LCM-PD, seemed to work slightly better, and given that the only difference between the SBP and PD measures is the use of nouns, as we mentioned earlier, it could be the case that the inclusion of nouns severely dilutes LCM scores with error.

Folk concreteness and LCM-PD performed similarly and are correlated with each other, but not at a very high rate. Given this and the fact the two measures were created in very different ways, we think that they aren’t measuring exactly the same things. We don’t have the data here to arrive at strong conclusions about the differences between the measures. However, it seems clear that the folk measures, perhaps predictably considering they’re derived from lay judgments, are much more in line with people’s expectations. One might argue that the LCM measure is tapping into a different aspect of concreteness. However, neither the data nor the stated theoretical background of LCM gives us an idea as to what that aspect of concreteness would be. We have several ideas (ease of reading, frequency of occurrence of the words in the dictionary etc.). However this blog post is already twice as long as the ideal blog post (1600 words; Lee, 2014) and our bosses are beginning to suspect that all this sciency looking typing is in fact nothing of the sort. So we will sadly have to leave you here, at the bottom of a rabbit hole, wondering what will happen next...

Contributions

This project grew out of work related to Sofia’s doctoral thesis. The group collectively conceptualized and planned the project. RabbitSnore wrote the initial draft of the post, wrote most of the code, and performed most of the statistical analyses. Each member of the team helped interpret the results and revise the post. Everyone contributed to the mayhem.
If you would like to get involved in the puddle-ducks’ concrete mayhem (especially if you have been in this concreteness measure rabbit hole yourself and can throw us a rope to get out), please let us know.

Figure 8. An artist’s representation of the blog authors.

Puddle-Ducks (2019). There might be problems with the automated scoring of linguistic concreteness. Rabbit Tracks. Retrieved from https://www.rabbitsnore.com/2019/02/there-might-be-problems-with-automated.html

Open data and code

The data presented here, the original texts we analyzed, and the code used to perform the scoring can be found here: https://osf.io/nf54b/

Notes

*Automatic coding of adjectives as abstract is arguably also problematic. Many adjectives are not abstract (e.g., red, small) and calculating them according to the LCM-SBP might make a text score more abstract, when, in fact, the expression is concrete. Excluding adjectives from the formula did, however, not improve the scoring.

*The Idiot’s Guide to Home Repair isn’t exclusively about concrete, but it is about similarly physical and immediate objects and tasks.

**The effect size for LCM-SBP’s ability to discriminate between the philosophical texts and the concrete texts is in the correct direction, and by the standards of psychology, it is considerable (d = .60). However, this pales in comparison to the other effect sizes (d = 10.59 for folk concreteness and d = 2.71 for LCM-PD). This makes sense if nouns are adding noise to LCM-SBP, which is effectively removed in LCM-PD. The noise would attenuate LCM-SBP’s ability to distinguish between abstract and concrete texts. Considering the extent of the attenuation, it looks like nouns are adding an enormous amount of noise indeed.

References

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods, 46(3), 904-911.

Carnaghi, A., Maass, A., Gresta, S., Bianchi, M., Cadinu, M., & Arcuri, L. (2008). Nomina sunt omina: On the inductive potential of nouns and adjectives in person perception. Journal of Personality and Social Psychology, 94, 839-859.

Coenen, L. H., Hedebouw, L., & Semin, G. R. (2006). The linguistic category model (LCM) manual. Unpublished manuscript. Amsterdam, NL: Free University Amsterdam.

Lee, K. (2014). The ideal length of everything online, backed by research. https://blog.bufferapp.com/the-ideal-length-of-everything-online-according-to-science

Seih, Y.-T., Beier, S., & Pennebaker, J. W. (2017). Development and examination of the linguistic category model in a computerized text analysis method. Journal of Language and Social Psychology, 36(3), 343-355.

Semin, G. R., & Fiedler, K. (1991). The linguistic category model, its bases, applications and range. European review of social psychology, 2(1), 1-30.

Semin, G. R. & Fiedler, K. (1988). The cognitive functions of linguistic categories in describing persons: Social cognition and language. Journal of Personality and Social Psychology, 54, 558-568.

Thursday, June 7, 2018

An attempt to reproduce a meta-analytic review of Criterion-Based Content Analysis


Criterion-Based Content Analysis (CBCA) is a component of Statement Validity Assessment (SVA), which is a tool used to evaluate the credibility of statements – especially claims of abuse. In brief, CBCA is used to distinguish between credible and fabricated statements. CBCA comprises a set of 19 criteria on which a verbal statement is evaluated, which I’ve summarized in a table below. Truthful statements are expected to contain more of each criterion.

Criterion Description
1 Logical structure
2 Unstructured production
3 Quantity of details
4 Contextual embedding
5 Description of interactions
6 Reproduction of conversations
7 Unexpected complications
8 Unusual details
9 Superfluous details
10 Details misunderstood
11 External associations
12 Subjective mental state
13 Others' mental states
14 Spontaneous corrections
15 Admitting lack of memory
16 Self-doubt
17 Self-deprecation
18 Pardoning the perpetrator
19 Details characteristic of the event

CBCA is more thoroughly described elsewhere, and for a better introduction, I direct the interested reader to the several summaries that exist (e.g., Vrij, 2015). Admittedly, I am not an expert on the application of CBCA. I only became particularly interested in it as I was working on a statistical research project on the potential for effect size inflation in the literature on cues to deception.

As part of my literature review for that project, I took a closer look at a meta-analysis of CBCA criterion validity in adults by Amado, Arce, Fariña, and Vilariño (2016, hereafter AAFV)*. My interest was primarily at assessing the potential for overestimation of composite measures designed to improve deception detection, but as I dug deeper into the methods of the meta-analysis and the literature it reviewed, I became increasingly concerned that the review and primary studies had serious methodological and statistical flaws.

Causes for concern

One of the things that initially alarmed me about the effect estimates in AAFV was that, in many of the studies they included, criteria had been left unreported or minimally reported. That is, sometimes researchers said they coded certain criteria but omitted them from their report because they didn’t significantly discriminate between truthful and fabricated statements or because they occurred so infrequently that they were virtually useless. There are at least two problems with this, one statistical and one practical.

The statistical problem is that if researchers are in the habit of selectively reporting criteria like this, it will not be possible to extract estimates for these underreported criteria, since they are more likely to be reported if they are significant and “useful.” This means long run meta-analytic estimates will chronically overestimate the effects because they are regularly including artificially inflated estimates from primary studies.

The related practical problem is that a practitioner attempting to use CBCA in an individual case cannot necessarily (or generally) know whether to ignore a criterion because it isn’t useful or appropriate in this context. Ignoring specific criteria in a study is a decision made from the privileged position of being able to examine numerous statements. In a real case, the absence of a criterion does not occur against the backdrop of absences in other statements drawn from the same population, so its ignorability cannot be inferred. Thus, if a criterion fails to distinguish between truthful and fabricated statements in a given sample, ignoring it in that sample will overstate the detectability of deceit using CBCA.

Take “reproduction of conversations” for example. This criterion is sometimes purposefully ignored in experiments in which the stimulus material described in the sampled statements does not contain any conversations (so it isn’t sensible to expect conversations to be reproduced). Under real world conditions, however, a CBCA practitioner could not, by definition, know whether there were actually conversations in the alleged event, so he or she could not make an informed decision about the applicability of the criterion. Sometimes criteria are ignored in studies in which the criterion just so happened to be very rare or totally absent in the sampled statements, even when they could have plausibly occurred. This causes similar problems of generalizability to the real world.

There are ways of attempted to compensate for failures to report statistics in a way that allows exact effect sizes to be calculated. In their meta-analytic review of cues to deception, DePaulo and her colleagues (2003) attempted to correct for minimally reported cue effects (e.g., cues that were reported as nonsignificant without sufficient statistics to calculate an effect size) by imputing 0 for those effects (or, when it was possible to determine a direction of the effect, .01 or -.01). As far as I can tell, AAFV didn’t implement any similar imputation in order to account for minimally or unreported CBCA criteria. I’ll return to this later.

Other aspects of AAFV’s methods troubled me. For instance, I noticed some of the reports marked for inclusion did not seem to fit the stated inclusion criteria. For example, they report having included Evans et al. (2013) – but the studies reported in that document don’t use statements as the unit of analysis. Rather, they report the results of multiple judgments of a large pool of statements, made participant observers.

Another issue concerned the establishment of ground truth in the primary studies. In order to draw inferences about the diagnostic value of the CBCA criteria, we need to know with a high degree of certainty that the statements classified as true and fabricated are in fact respectively true and fabricated. In experiments, this tends to be fairly straightforward, since the researchers are typically under direct control of participants’ experiences and have assigned people to tell the truth or lie. In field studies, establishing ground truth can be exceedingly challenging, if not outright impossible. The field studies included in AAFV employed a variety of methods to establish ground truth, some of which strike me as quite problematic.

Several field studies relied on documented forensic evidence or court decisions. This seems like a reasonable approach to establishing ground truth in the field (indeed, perhaps the only way), but it may be subject to error and selection biases. There are obvious ways in which these decisions could go wrong. Forensic evidence can be non-diagnostic, and court decisions are sometimes incorrect. Selection biases are more subtle: The only cases that can be included are those in which there are apparently firm decisions and/or extensive documentation. That is, the included cases may represent the easiest of cases in which to assess credibility. Perhaps these are the very cases in which truthful claims look the way CBCA expects them to look, to the exclusion of truthful claims in more ambiguous cases that may not be rich in CBCA criteria. Thus, statements drawn from them may overestimate the diagnosticity of CBCA criteria. Of course, we have no way of knowing whether this possibility obtains in reality.

Other field studies took alternate approaches. Rassin and van der Sleen (2005) requested police officers to code statements from their own cases using a provided checklist. Ground truth was established by the officers’ judgments and court decisions. This presents a variation on the problem described above, as it adds another potential source of selection bias in the officers’ discretion. Ternes (2009) solicited statements from a sample of incarcerated people, without instructing them to lie or tell the truth. Statements were classified as credible and non-credible using CBCA. The effect sizes contrasting the credible and non-credible statements were, of course, tautological. The non-credible statements were classified as such precisely because they exhibited fewer CBCA criteria. There was no chance that the effect size estimates could fail to be in the expected direction.

I don’t mean to take individual researchers to task unfairly for the compromises they had to make in order to conduct their work – particularly for field studies, which impose hefty logistical challenges. Depending on their respective research questions, it may have made perfect sense to have done things the way they did.

That said, all this seemed to me like a recipe for overestimation. Wanting to get a better understanding of what was going on, I wrote to the corresponding author of AAFV to request the data**. After getting no response, I decided the best option was to try to redo the meta-analysis myself. So I did.

Replicating/reproducing AAFV

General approach

My intention was never to do a proper, original meta-analysis of the CBCA literature. Rather, my goal was to assess the validity of the conclusions and accuracy of the criterion estimates in AAFV. For this reason, I didn’t attempt to exhaustively search the literature and instead just attempted to obtain the documents included in AAFV. I obtained 38 of the 39 documents included in AAFV. I didn’t get my hands on an unpublished dataset the original authors had (i.e., Critchlow, 2011).

I registered my plan on OSF***. Registration seemed particularly important here because I was approaching this with a self-known bias: I already thought there were problems with the estimates. Although I believe my registration proved useful, I still made several judgment calls that were outside the coverage of the registration. For instance, I didn’t register a rule about how to deal with dependent effect sizes (e.g., several groups of fabricated statements compared against a single truthful control group). Ultimately, in cases where there were multiple dependent estimates, I extracted just one that seemed most appropriate (and recorded which one that was in my notes). I did this to avoid over-weighting any given sample.

Effect size extraction and imputation

When effect sizes were reported, I recorded them or converted them into d. More often, effect sizes had to be manually calculated. I used the compute.es package in R to do these calculations. Usually, this entailed entering means, standard deviations, and group ns into the mes() function or proportions and group ns into the propes() function, but when these data were unavailable I sometimes converted t or F statistics using tes() and fes() respectively.

I only extracted effect sizes for the original 19 CBCA criteria. AAFV also examined some additional criteria, but these are very rarely studied, and as far as I know, not often used in practice. I also excluded reports of “total CBCA score” because my issues with those scores are best addressed elsewhere.

Because of the concerns I described above, I wanted to account for the missing and excluded criterion estimates and decided to take an approach similar to the one taken by DePaulo et al. (2003). Each time a study reported excluding a criterion, I coded the stated reason for the exclusion. I had registered a plan to impute 0’s for effects excluded for several of the possible stated reasons (e.g., because the behavior rarely occurred in the sample).

We should be clear about the limitations of this conservative approach to imputing effect sizes. Its effectiveness, I believe, pivots on the representativeness of the literature (cf. Fieldler, 2011). This is a problem of ecological validity in the Brunswikian sense of the term (Brunswik, 1955). If we want to estimate the effects that occur out in the wild, these imputations will improve the accuracy of the estimates assuming that the situations sampled in the literature adequately represent the situations in which CBCA is applied. That is, assuming that underreported criteria are excluded because they cannot plausibly distinguish between truthful and deceptive statements in a given situation, weighting underreported criteria toward zero will improve the estimates, to the extent that situations that render those criteria useless actually occur in reality. However, non-representative design in the literature will bias the observed estimates (likely overestimating them, by oversampling situations in which the criteria are generally more effective), but it will also change the effect of the imputations. The imputations will underestimate the criterion effect sizes if the empirical literature has, for example, oversampled situations in which underreported criteria are ineffective relative to their actual effect in the population****.

Because this approach is fraught with potential problems, I report the results both with and without imputations below.

Results

All the data I extracted from the documents, my coding notes, and scripts for the analyses and visualizations can be found here: osf.io/4w7tr/

Below, I have tabled the effect size estimates for each criterion. The table presents results from AAFV and the results from my replication, with and without the imputations.

AAFV
Replication
Replication with imputations

Criterion
d
k
d
k
d
k
1
0.48
30
0.15
22
0.11
31
2
0.53
27
0.41
20
0.29
27
3
0.55
35
0.31
27
0.25
33
4
0.19
29
0.19
19
0.11
31
5
0.27
29
0.21
23
0.15
31
6
0.34
34
0.38
21
0.27
29
7
0.25
29
0.17
21
0.13
27
8
0.31
35
0.19
23
0.14
29
9
0.14
27
0.29
21
0.19
29
10
0.22
5
-0.01
7
0.00
26
11
0.26
22
0.19
15
0.12
26
12
0.18
28
0.17
21
0.11
30
13
0.09
31
0.18
22
0.13
31
14
0.16
29
0.14
22
0.09
29
15
0.25
34
0.23
22
0.18
26
16
0.20
26
0.23
18
0.16
24
17
0.04
13
0.27
11
0.11
25
18
-0.02
8
0.08
7
0.00
24
19
0.28
5
0.29
3
0.04
23

You will notice that the numbers are not the same across the different methods. There are wide discrepancies between my numbers (both the effect estimates and the number of effects) and the original AAFV numbers.

AAFV reported a total of 476 effect estimates. I extracted 345. This is a difference of 131. It is possible I missed some estimates. Some of the missing estimates presumably came from the one unpublished dataset I didn’t obtain, and I excluded some documents that AAFV included (e.g., Evans et al., 2013). Perhaps AAFV included dependent effect sizes, which I excluded. I also made some admittedly ad hoc judgment calls. For example, in one case, I excluded a set of effect sizes because the standard deviations for the criterion measures were implausibly small (e.g., M = 1.85, SD = .05) and would therefore generated effect estimates that were outrageously large. However, it’s not clear to me that these differences account for all the discrepancies. Without examining AAFV’s original data, we can’t say why there’s such a wide discrepancy.

Most of my estimates for the criterion effects are lower than those of AAFV, but a few are higher. Some are quite different. For example, AFFV estimated Criterion 1 (logical structure) as d = .48, whereas I estimated it at d = .15 (without imputations). AAFV estimated Criterion 10 (accurate details misunderstood) as d = .22, whereas I estimated it at d = -.01 (without imputations). These differences attest to the fact that differing methods of reviewing the literature can result in widely discrepant estimates.

Intuitively, we might be inclined to trust the effect sizes from the larger number of estimates – that is, AAFV’s estimates. This intuition may be misguided, however, if AAFV included many dependent effect sizes that I didn’t. For example, Ternes (2009; the one with the tautological effects) reported data to compute several sets of dependent effect sizes. I only extracted one set of effects (i.e., one estimate per reported criterion) from that document, but if AAFV extracted several sets, they may have given too much weight to poor estimates (which had no chance of failing to support the hypotheses). I don’t know what they did.

If we look at the estimates with the imputations, they are, of course, generally lower than those without the imputations (as would be expected if you added a bunch of 0’s to the data). But for many criteria, the imputations had quite stark effects, suggesting that many criteria were excluded quite frequently. For example, Criterion 2 (unstructured production) dropped from d = .41 to d = .29. Criterion 19 (details specific to the event) dropped from d = .29 to d = .04. Which estimates are more accurate is, of course, debatable.

To get an overall picture of all the criteria, I created a funnel plot for all the effect sizes I extracted. This plot uses the data without imputations. Vertical reference lines are drawn at 0 and the weighted mean effect size for all criteria (d = .22 – approximately the difference in length between Harry Potter and Lord of the Rings books), with vertical lines drawn for the 95% confidence bounds for the meta-analytic effect size. The funnel guide lines are drawn for the 95% and 99% confidence levels.



You can see there are many outlying effect estimates – well outside the 99% confidence bounds – especially positive outliers at lower standard errors. The positive outliers tend to observe larger absolute effects than the negative outliers. It’s possible this is a symptom of publication bias. You can also see that the effect sizes seem to center closer to zero at higher standard errors – but these estimates come from just a few samples, so we should be cautious about overinterpreting them.

A common approach to compensating for publication bias is the trim and fill technique. I applied this technique using the trimfill() function of the metafor package for R, specifying imputation of missing effects on the negative side*****. The trim and fill estimated that a total of 59 effects were suppressed by publication bias, and the adjusted weighted mean effect size was a much more modest d = .06 (approximately the difference in review quality between Meryl Streep movies and Tom Hanks movies). Although 59 estimates might seem like a lot, bear in mind that CBCA comprises 19 criteria. Thus, this is similar to an imputation of about three studies finding negative effects for all criteria.

Experiments and field studies

I had registered a plan to test whether effect sizes varied as a function of different methods of establishing ground truth, but there were only a handful of field studies, so I opted not to run these analyses, as I doubt they would be meaningful.

However, you’ll notice in the table below that criterion effect estimates from experiments tend to be substantially lower than estimates from field studies. This occurred both in my replication and in AAFV (see their moderator analyses). The experimental effects were quite modest and are generally in line with the rest of the deception literature. The median effect size for cues to deception in DePaulo et al. (2003) is |.10|. Without imputations, the median effect size for experiments is .12 (the mean is .13). Field studies, in contrast, had much larger effects. Without imputations, the median effect size for field studies is .46 (the weighted mean is .35). All the estimates in the table below are from my replication, not from AAFV. AAFV estimated the mean effects for experiments and field studies to be higher, at .25 and .75 respectively.

Experiments
Experiments with imputations
Field studies
Field studies with imputations

Criterion
d
k
d
k
d
k
d
k
1
0.04
16
0.02
23
0.69
3
0.42
5
2
0.19
13
0.12
19
0.80
4
0.63
5
3
0.28
20
0.21
25
0.12
4
0.10
5
4
0.13
13
0.08
23
0.16
3
0.09
5
5
0.17
17
0.12
23
0.18
3
0.10
5
6
0.27
15
0.18
21
0.64
3
0.41
5
7
0.04
15
0.03
19
0.52
3
0.33
5
8
0.16
18
0.13
21
0.30
3
0.20
5
9
0.27
15
0.16
22
0.51
3
0.38
4
10
-0.12
5
-0.04
19
0.69
1
0.12
4
11
0.02
10
0.02
19
0.60
3
0.38
5
12
0.08
15
0.05
23
0.22
3
0.17
4
13
0.17
15
0.11
23
-0.12
4
-0.08
5
14
0.04
15
0.03
21
0.29
4
0.23
5
15
0.26
16
0.21
19
0.07
4
0.07
4
16
0.12
13
0.09
17
0.65
3
0.47
4
17
-0.06
5
-0.02
18
0.67
3
0.49
4
18
-0.28
2
-0.03
17
-0.11
3
-0.08
4
19
0.00
1
0.00
16
0.18
1
0.06
4

A prosaic explanation for this apparent difference between experiments and field studies is ordinary sampling variation. We might expect there to be greater variation in estimates with fewer studies. It is also possible that there is a genuine systematic difference between effects in the lab and effects in the field.

It might be tempting to think that CBCA might be more effective in real cases than it is in artificially produced statements obtained in control laboratory conditions. This is possible. Bear in mind, however, that field studies are also confounded with their method of establishing ground truth, which is necessarily inferior to experimentally established ground truth. For the reasons I noted earlier, it’s possible for selection biases to produce overestimated effects under suboptimal conditions for establishing ground truth.

No matter the cause, it seems that the larger effects in the literature are primarily driven by the field studies, which produce dramatically larger estimates. Seven of them are over .60. Another two are above .50. For intuitive comparison, d = .59 represents the difference in weight between men and women (Simmons, Nelson, & Simonsohn, 2013). One wonders if it’s plausible for any criterion to have a true effect that is quite this large, let alone seven. Anything is possible, but this strikes me as unlikely, given the overall weakness of cues to deception (DePaulo et al, 2003).

What does all this mean?

You should not trust my results – at least not blindly. I was operating alone, with no one to check my work and no one with whom to establish the reliability of my coding of exclusion justifications. Curious or skeptical readers should look at my notes and my data and check for errors. I was, of course, trying to arrive at accurate estimates – but that doesn’t mean anyone should take my word for it. That said, I believe there are two general lessons to take away from what happened here. The first is a lesson about meta-analytic review methods in general and in CBCA in particular. The second is a lesson about the validity of CBCA criteria.

Little differences, big differences

In extracting effect sizes from the literature, I tried to follow the methods of AAFV reasonably closely (though I almost certainly applied their inclusion criteria differently for some documents). Some estimates were fairly close to those of AAFV, but for other estimates, I wasn’t even remotely close to theirs.

How is it possible for relatively minor variations in review process to cause large discrepancies in results? The explanation is, I believe, that the informational value of individual CBCA studies has been relatively low, due to low power and selective reporting. With small sample sizes, individual studies exhibit high degrees of variation in their results. If CBCA studies were routinely high powered, their estimates would be more precise, and we would expect there to be less heterogeneity of effect sizes. However, we can see in the funnel plot above that there is massive heterogeneity of effect sizes at lower standard errors – that is, with smaller sample sizes. Because the individual estimates are so heterogeneous, even small changes in the inclusion criteria or method of effect size extraction can cause shifts in the meta-analytic estimates.

Meta-analysis is a powerful tool for getting a birds-eye view of a research literature, but it can be finicky. Because minor variations in methodology can lead to different conclusions, it is especially important that the meta-analytic methods are transparent and reproducible.

Do the CBCA criteria discriminate between truthful and deceptive statements?

Although the fact my estimates for the validity of the CBCA criteria (without imputations) were often substantially lower than the estimates of AAFV, I don’t take this as particularly good evidence that CBCA criteria don’t work as hypothesized. However, given what we see here, we should be skeptical of claims that the CBCA criteria do work. If my reasoning about criterion exclusions is correct, we should be concerned about the way researchers have routinely taken a selective approach to CBCA, as this is quite likely to inflate individual estimates (and as a consequence, uncorrected long run estimates as well).

I employed two methods to attempt to correct for reporting biases: the imputation of 0 effect sizes and the trim and fill method. Both these techniques suggest that the actual ability of CBCA criteria to distinguish between true and false statements is much more modest than AAFV concluded. We can and should question the appropriateness of both these approaches to correcting the effect sizes. However, they point toward the same conclusion: the validity of CBCA criteria may be quite overstated. Ultimately, I think the way to resolve the question of the validity of the criteria is with high-powered research that is transparently conducted and reported.

Indeed, many of the problems I’ve bemoaned throughout this post could have been ameliorated with transparent, open science practices******. More complete reporting could have helped unbias effect estimates with greater precision, for example. To the extent that publication bias has suppressed studies with null or negative effects, disclosure of the findings would have helped create a more accurate picture of the literature. Much of the CBCA research was conducted and reported before open science was conveniently facilitated by online repositories and other resources. For this research, one can easily understand the unavailability of supplementary material and original data. However, for recent research – and for AAFV’s review itself – there is little excuse. Many of the questions whose answers have eluded me here could have been resolved swiftly and confidently if the original and meta-analytic data were openly available.

It is commonly claimed that CBCA is empirically supported, even if it is not unerringly accurate (see, e.g., Vrij, 2015). However, much of the supposed support may come from distorted effect estimates and may be artifacts of various biases and selective reporting. What we see here provides a different, less encouraging, view of CBCA. As is the case in many areas of science, a lack of transparency has planted landmines on the path to discovery.

Notes

* CBCA was originally intended for use with children’s statements. The focus of my larger project, however, is cues to deception in adults, and that focus has carried over into this side project.

** AAFV reports there are supplementary materials available on the ResearchGate profile for the corresponding author, but I couldn’t find anything there. There will be egg on my face if the meta-analytic data are in fact hosted there somewhere, but I did check.

*** One could reasonably ask, if I care so much about rigor and whatnot, why didn’t I write this up for peer review and publication in a proper journal? There are a few reasons. First, I did all this work alone, and if it were to be done really properly, I think I should have worked with others to recheck my calculations, establish a reliable coding system, and resolve disputes about inclusions/exclusions. Although I think collaborating would have improved the quality of this work, I mostly did this to investigate my hunch that something was off about the CBCA literature and as a relatively minor supplement to my larger project examining effect size inflation in the deception literature. Moreover, I didn’t do an exhaustive search of the CBCA literature and simply relied on the documents included in AAFV. This limits the ability of this analysis to speak generally about the validity of CBCA criteria (though I think it is sufficient nevertheless to make important points about the CBCA literature in general). I think registration and making my work openly available offsets many of the problems of working on this project as a lone ranger, but I think there are still serious drawbacks to the way I approached this, such that I’m not comfortable submitting it to a journal. Second, writing in blog format rather than journal format is in many ways liberating. The structure and style of the post is totally at my discretion, for instance. Third, I think blogging is a legitimate method of scientific communication. It has many disadvantages relative to peer-reviewed publications, but the mere fact that something is a blog does not in itself undermine the quality if the information therein. If the t-distribution were published on Student’s blog, we still should have taken it seriously.

**** Representative design is not the most intuitive thing in the world, and it’s hard to explain. When I reread this paragraph, I thought, “I think this is accurate, but it’s hard as hell to understand.” Here’s a more approachable illustration…

Imagine there are two bakeries: Camilla’s Credible Crusts and Fiona’s Fabulous Fabrications. A group of six researchers gets into a heated argument about the qualities that characterize the baked goods from each shop. One of the researchers, Udo, hypothesizes that goods from Camilla’s are generally higher on the following qualities: sweetness, firmness, lightness of color, and freshness of smell. Together, the other five agree to test Udo’s hypothesis by each obtaining a sample of baked goods from each bakery and measuring the four qualities of each item, to see if they can classify which shop each baked good came from.

For simplicity, let’s assume they all acquire the same number of items, so the sample sizes are all equal. But they don’t randomly sample baked goods from each shop. Rather, they each individually and non-systematically get items from Camilla’s and Fiona’s.

One researcher, Gunther, is off to a good start: He gets an assortment of cakes and cookies from each shop. He is able to calculate the extent to which each shop tends to differ on all four attributes. Sharon has similar luck with her sample.

Others in the group run into methodological oddities. Alicia ends up with eight different kinds of meringues and nothing else. She measures the sweetness, color, and smell, but she decides to ignore the firmness, since all the meringues have indistinguishable textures. Other members of the team run into similar kinds of issues but for different variables. All of Mary’s macaroons were the same color, for instance, and all of Veronica’s pies were equally sweet. When Alicia, Mary, and Veronica report back to the rest of the group, they each leave out the variable that seemed safe to ignore in their respective samples.

Let’s assume we no longer have access to the baked goods (because they ate them all) or the original data from each person (because it’s the ‘80s and there are no online data repositories). We only have each person’s summary and conclusions.

Here’s the problem: If we meta-analyze the results of the five reports, how are the excluded variables going to affect the assessment of Udo’s hypothesis? And what do we do about it?

It’s going to matter how well each sample of baked goods represents the inventory of the bakeries. If meringues are only a tiny fraction of the inventory at Camilla’s and Fiona’s, then all the estimates might be thrown off by the fact that meringues make up an entire fifth of the observations. If meringues actually constitute a fifth of the bakery inventories (or close to it), then one viable strategy for dealing with Alicia’s missing measure of firmness is to impute a 0 for the lost effect size. After all, the diagnostic value of firmness can be expected to be close to 0 for that part of the population, so imputing a 0 for the missing value makes sense – and it will be weighted accordingly in the meta-analysis.

Thus, if the samples of baked goods give a good picture of the bakeries’ actual inventories, then imputing 0’s for the missing values is a good idea because it will bring the estimates closer in line with what you can expect the values to be in the general population of baked goods. But if the samples don’t represent the inventories well, the imputations are going to throw off the estimates. Of course, it’s not all or nothing. It’s the extent of the representativeness that matters.

In this whimsical example, you could just walk into Camilla’s and Fiona’s bakeries and check their display cases to see how well the samples represent the population. In the real world, we don’t know exactly how the population looks.

Another problem (indeed one of the core problems of representative design) occurs if the researchers selected baked goods they thought would support Udo’s hypothesis, rather than attempting to represent the actual bakery inventories. That is, if they picked items they had a good reason to think would actually differ on at least some of the four criteria of sweetness, firmness, color, and smell, then it would bias the estimates upward relative to the actual difference in the broader population of baked goods. Each researcher’s report would seem to suggest that the criteria discriminate between the bakeries to a larger extent than they actually did. If another person had a cookie and wanted to figure out which bakery it came from, they might be misled into thinking those criteria could help them make a good guess about whether it was from Camilla’s or Fiona’s. This problem is difficult to detect and difficult to correct.

***** I did not register this analysis. Interpret it with caution.

****** Some of the problems would not be solved by open science. The issue of low-power and low precision in estimation, for instance, would not be solved by increased transparency. Solving that problem requires lots more data.

References

Resources and registration for this project are available here: https://osf.io/4w7tr/ . In case I need to make updates to this post, a copy of the original version is hosted there as well.

Amado, B. G., Arce, R., Fariña, F., & Vilariño, M. (2016). Criteria-Based Content Analysis (CBCA) reality criteria in adults: A meta-analytic review. International Journal of Clinical and Health Psychology16, 201-210.

Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology. Psychological Review62, 193-217.

Critchlow, N. (2011). A field validation of CBCA when assessing authentic police rape statements: Evidence for discriminant validity to prescribe veracity to adult narrative. Unpublished raw data.

DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to Deception. Psychological Bulletin129, 74-118.

Evans, J. R., Michael, S. W., Meissner, C. A., & Brandon, S. E. (2013). Validating a new assessment method for deception detection: Introducing a Psychologically Based Credibility Assessment Tool. Journal of Applied Research in Memory and Cognition2, 33-41.

Rassin, E., & van der Sleen, J. (2005). Characteristics of true versus false allegations of sexual offences. Psychological reports97, 589-598.

Simmons, J., Nelson, L., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the Society for Personality and Social Psychology, New Orleans, LA, 17-19 January 2013. Available at SSRN: https://ssrn.com/abstract=2205186

Ternes, M. (2009). Verbal credibility assesment of incarcerated violent offenders' memory reports. Doctoral dissertation, University of British Columbia.

Vrij, A. (2015). Verbal Lie Detection tools: Statement validity analysis, reality monitoring and scientific content analysis. In P.A. Granhag, A. Vrij, B. Verschuere (eds.), Detecting Deception: Current Challenges and Cognitive Approaches (pp. 3-35). John Wiley & Sons.