Thursday, February 28, 2019

There might be problems with the automated scoring of linguistic concreteness

This blog post is a collaboration by Karl Ask, Sofia Calderon, Erik Mac Giolla, RabbitSnore (Timothy Luke), and Lara Warmelink – collectively, the Puddle-Ducks.

Figure 1. Jemima Puddle-Duck

Introduction

Listen, children, and you shall hear the tale of the linguistic concreteness measures with questionable construct validity...

Once there was a team of researchers (that’s us) interested in human deception. We wanted to know whether truthful and deceptive statements about the future differ in the use of abstract/concrete language. Specifically, we thought lies about what you plan to do might incorporate more abstract words than telling the truth about what you plan to do. If truths and lies differ in this way, it’s plausible that it reflects differences in the way that liars and truth-tellers mentally represent future behavior. Perhaps false intentions are represented more vaguely and abstractly and true intentions are represented with more concrete detail. We think this is theoretically interesting – but it isn’t the primary topic of our story today. Instead, we want to talk about what happened when we tried to measure linguistic concreteness.

Linguists and psychologists have devised several ways of measuring the abstractness/concreteness of language. We began with a method proposed by Brysbaert, Warriner, and Kuperman (BWK, 2014). They had around 4,000 people provide ratings of the concreteness for about 40,000 English words on a 1-5 scale (1 = most abstract, 5 = most concrete). Thus, they produced a dictionary of words and their respective average concreteness ratings – ranging from “essentialness" (M = 1.04) to “waffle iron" (M = 5.00). This dictionary can be used to code the linguistic concreteness of text data (see below for details on how this works). We call this system “folk concreteness" to contrast it with other systems, since it represents a measure of laypeople’s perceptions of linguistic concreteness, rather than a measure defined a priori by theoretical or philosophical considerations.

The coding of folk concreteness scores is quite simple. Each word in a text is matched for its value in the BWK dictionary. If a word is not in the dictionary, it does not receive a score. You then simply take the mean of the concreteness values for all the matched words. Thus, the FC score is an average-based composite of the scores of individual words appearing in the text. Automatic scoring for this system is pretty straightforward in R (https://osf.io/z9mq6/).

To test our hypothesis about statements of true and false intention, we assembled text data from seven experiments in which participants lied and told the truth about some future activity. In total, we had N = 6,599 truthful and deceptive statements. We preregistered our approach and analysis code (https://osf.io/y48cz/), coded the texts using the folk concreteness system, and ran our planned analyses. Full presentation of the results is forthcoming (and will be part of the PhD thesis of Sofia Calderon), but in short, we found no evidence that truthful and deceptive intention statements differ in linguistic concreteness with this coding system.

After we performed the folk concreteness analyses, we found another system for coding concreteness in text data, so we decided to cross-validate the results with the alternative method. And that’s where the trouble started...

Finding nothing where there should be something

Seih, Beier, and Pennebaker (SBP, 2017) recently provided a method for computer-automated coding according to the linguistic category model (LCM). In brief, LCM is a system for classifying words according to their abstractness/concreteness in the context of social behavior and communication (Semin & Fiedler, 1988, 1991). In the SBP system, texts receive LCM scores that vary from 1 (most concrete) to 5 (most abstract). Unlike the folk concreteness approach, LCM is theory-driven and uses specified weights for categories of words. Traditional LCM scoring is performed manually, by human coders. The central contribution of SBP was the automation of the coding, which is of course much more efficient, better for the sanity of research assistants, and potentially more reliable.

SBP used a dictionary of approximately 8,000 verbs, tagged with their LCM classification: descriptive action verbs (DAV), interpretive action verbs (IAV), and state verbs (SV). In addition to these 8,000 verbs, SBP propose using automated classification of parts of speech to count the number of adjectives and nouns in a text. To do this, they used TreeTagger, a free Perl-based utility that uses predictive models to automatically identify parts of speech (POS) in text.

Once the LCM verbs, adjectives, and nouns have been counted, the sums are entered into an equation to obtain an LCM score for that text. SBP propose the following formula:

\begin{equation} LCM_{SBP} = \frac {DAV + 2*IAV + 3*SV + 4*adjectives + 5*nouns} {DAV + IAV + SV + adjectives + nouns} \end{equation}

Each of the weights assigned to the different word classes reflects the hypothesized abstractness of that word class. Thus, nouns are considered the most abstract, and interpretive action verbs are considered the most concrete.

We obtained the SBP dictionary and retested our hypotheses with the LCM system, to see if the results lined up with those of the folk concreteness approach. Following SBP, we used TreeTagger. SBP used Linguistic Inquiry and Word Count (LIWC) to conduct their analyses. Instead, we used R for all analyses. We implemented TreeTagger with the koRpus package. Our code is available on OSF (https://osf.io/gx8mj/).

We registered a plan, scored the deception texts, and compared the LCM scores of truthful and deceptive statements. As expected, LCM told the same story as folk concreteness: No difference between truthful and deceptive statements. At first, this seemed like straightforward corroboration of our previous results. But the best laid plans of mice and researchers go oft awry...

As part of our exploratory analyses, we calculated a simple correlation between the folk concreteness scores and the LCM scores. Because both these coding systems are supposed to measure linguistic concreteness, we would expect a fairly strong correlation (specifically a negative correlation, since the scale poles have opposite meanings). However, the values were almost perfectly uncorrelated, r = 0.025 (see Figure 2). These coding systems are supposed to measure the same thing. How could this possibly be the case?

Figure 2. Folk concreteness and LCM scores in the true and false intention data

This is more or less what you’d expect to see if two continuous variables were totally unrelated. You’ll notice that LCM scores seems to clump at whole and half numbers. Because all LCM weights are whole numbers, possible scores for very short texts are constrained.

“Jemima became much alarmed"

If folk concreteness and the automated LCM scores measure linguistic concreteness, the scores should be correlated. Folk concreteness and LCM aren’t necessarily supposed to measure the same kind of concreteness, though. LCM is purported to capture psychological processes associated with social communication, whereas folk concreteness is simply intended to give a score associated with the perceived concreteness of the individual words used in a given text. So it might be that the correlation is weaker than we originally thought, but it was surprising that it’s this weak.

The near-zero correlation was a cause for serious concern, so we performed a few informal tests of the two coding systems, to see how well they distinguished between texts we were pretty sure would differ in linguistic concreteness. Our reasoning was simple: If these coding systems are working the way they are supposed to, the scores ought to easily distinguish between obviously different texts.

Figure 3. A fox manually coding text according to the linguistic category model (LCM).

What kinds of texts should a good measure of linguistic concreteness be able to distinguish between? One approach would be to test substantive psychological theory using the various measures, to see if the predictions bear out according to the scores. But that has the disadvantage of relying on the validity of the theory in question. We opted instead to test almost painfully obvious predictions. We unsystematically collected some texts: an eclectic batch of philosophical texts, children’s stories, and song lyrics – things that should differ from each other substantially. We scored them using both folk concreteness and LCM and compared the results.

Below is a list of the texts we collected, followed by a table of their folk concreteness and LCM scores.

Short name Description
Carl Gustaf The Wikipedia article for the current king of Sweden
Jemima Puddle-Duck Beatrix Potter’s children’s story about a duck who struggles to care for her eggs
Peter Rabbit Beatrix Potter’s children’s story about a mischievous rabbit
Drywall A tutorial on how to repair damaged drywall
Association of Ideas James Mill’s chapter on "The Association of Ideas" in Analysis of the Phenomena of the Human Mind
On Denoting Bertrand Russell’s classic text "On Denoting"
Judgment of Taste Immanuel Kant’s chapter "Investigation of the question whether in the judgement of taste the feeling of pleasure precedes or follows the judging of the object" in Critique of Pure Reason
Oops, I Did It Again The lyrics for every song on Britney Spears’s album Oops, I Did It Again
Songs of Love and Hate The lyrics for every song on Leonard Cohen’s album Songs of Love and Hate
Their Finest Hour Winston Churchill’s “Finest Hour" speech
Folk concreteness LCM
Carl Gustaf 2.35 4.25
Jemima Puddle Duck 2.71 3.73
Peter Rabbit 2.73 3.67
Drywall 2.87 3.71
Association of Ideas 2.22 3.76
On Denoting 2.14 3.90
Judgment of Taste 2.06 3.94
Oops, I Did It Again 2.67 3.84
Songs of Love and Hate 2.79 3.72
Their Finest Hour 2.31 3.68

Folk concreteness fairly reliably scored philosophical texts as more abstract than other texts, and it predictably scored a tutorial on how to fix drywall as the most concrete of the texts. In short, folk concreteness seemed to work reasonably well at matching our own intuitive feeling about how concrete a text is.

You might notice that the range of scores was constrained between 2 and 3. This restriction of range is not surprising. Folk concreteness scores for any given text can potentially vary from 1 to 5. However, because any given natural language text almost invariably entails a set of diverse words, it stands to reason that we would not expect many (or any) texts of substantial length to receive scores close to the upper and lower bounds of the scale. Any text longer than a few words will have a score close to the midpoint of the scale, since it’s simply implausible (if not impossible) to say or write anything meaningful without using words that vary substantially in their concreteness, which will tend to balance each other out and push the scores to the middle.

In contrast to the folk concreteness system, the automated LCM coding had some trouble. For starters, as can be seen in the table of results, LCM could not meaningfully distinguish between James Mill and Jemima Puddle-Duck, assigning them both scores of around 3.7. Indeed, the automated LCM coding didn’t seem to score any of the texts quite as one might expect. For instance, LCM inexplicably scored the Wikipedia article on the king of Sweden as extremely abstract – even more abstract than Bertrand Russell’s “On Denoting." And the lyrics of the Brittney Spears album Oops, I Did It Again were ranked by LMC as only slightly less abstract than the writings of Russell and Kant (and, in fact, a little more abstract than Mill). Now, we would not want to malign Potter and Spears, but we do feel that their texts (directed as they are at children and teenagers respectively) are more concrete than texts by three philosophers, whose writing was not necessarily intended to be concrete (or readable for a general audience).

This pattern of scores is potentially indicative of a serious problem. One possibility is that the automated LCM coding wasn’t measuring what it intended to measure. Maybe it doesn’t measure anything at all and is essentially noise derived from arbitrarily weighting the counts of different categories of words. Figure 1 certainly looks the way one would expect it to look if one or both variables were nearly or entirely noise.

Another possibility is that automated coding only works for specific types of texts (e.g., descriptions of interpersonal behavior), and perhaps our texts do not conform to the model’s assumptions. Although it is somewhat unclear what kinds of language would be inappropriate for coding, the LCM coding manual suggests that, within a given text, language describing persons should be coded, but descriptions of situations should not be. The automated coding, however, does not discriminate between interpersonal and impersonal descriptions. As such, perhaps the automated coding only works for texts entirely composed of interpersonal language. Maybe we are misusing the coding system in this informal test – but that does not explain the zero correlation in the false intention data, which ought to be largely or exclusively composed of text that is appropriate for LCM coding, as it they entail descriptions of planned activities.

Whatever the cause, our results do not inspire confidence for the use of this automated LCM coding system. Further investigation was warranted. To see if these results were just flukes, we wanted to feed the coding systems many more texts to see if they would occur with other data. Specifically, we planned to feed it (1) several more texts with fairly obvious relative levels of concreteness and (2) large samples of texts that should offer us good estimates of the correlation between the scores given by each of the coding systems.

Before we look at more data, let’s consider an oddity in the automated LCM coding system...

LCM scoring, revised

One of the strange things about SBP’s system of coding is that nouns are so heavily weighted as the most abstract word category. In the LCM coding manual (Coenen, Hedebouw, & Semin, 2006), nouns were only considered when they were used as a qualifier (viz. performing a function similar to an adjective; e.g., “She’s a nature girl."). As support for their weighting of nouns, SBP cite Carnaghi et al. (2008), who report a series of studies in which they found that nouns were more abstract than adjectives. But rather than dealing with all nouns, Carnaghi et al. were focused exclusively on person perception and were considering nouns that, for example, described a group of people (e.g., athletes).

SBP’s approach struck us as problematic, since the automated coding procedure indiscriminately counts all nouns as highly abstract, not just the ones that describe or qualify people. Rocks, string cheese, and pitbulls (all considered highly concrete in the folk system) would all be considered maximally abstract under the SBP formula. On its face, that seems like nonsense*.

Thus, we considered a simple fix: simply remove nouns from consideration. The formula would therefore become as follows:

\begin{equation} LCM_{PD} = \frac {DAV + 2*IAV + 3*SV + 4*adjectives} {DAV + IAV + SV + adjectives} \end{equation}

This revision of LCM score can vary from 1 to 4 – so the bounds of the scale are a bit more constrained than SBP’s version. We call this modification LCM-PD, for “Puddle-Ducks."

We modified our automated LCM code to calculate an additional score according to the formula above. And we started collecting and scoring texts.

Are the measures correlated?

Although folk concreteness and LCM are intended to measure similar constructs, the correlation between them might be relatively weak. For that reason, we wanted loads of data to estimate the correlation.

To get lots and lots of naturally occuring texts, we scraped Amazon product reviews for Daniel Kahneman’s book (n = 2,000), potato chips (n = 427), instant ramen noodles (n = 951), sriracha sauce (n = 668), a reading workbook for school children (n = 1,184), and a vibrator (n = 2,746). This resulted in a total of N = 7,966 texts, with a mean length of M = 35.92 words (SD = 78.13, median = 20).

Does folk concreteness correlate with LCM-SBP in these data? Even the most cursory inspection of Figure 4 suggests the answer is no (or more precisely, r = .0055).

Figure 4. Folk concreteness and LCM-SBP scores in Amazon review data

A whole lot of nothing.

However, LCM-PD scores fared better. As can be seen in Figure 5, the revised scores did correlate with folk concreteness in the expected direction, though relatively weakly, r = -.1055.

Figure 5. Folk concreteness and LCM-PD scores in Amazon review data

A small correlation in the expected direction.

One could reasonably object that the content of these texts conflicts with the stated purpose of LCM – that is, to code concreteness in communication with people and about people. Many, if not most, of those product reviews are likely to be quite impersonal. Fortunately, there is no shortage on the internet of reviews involving other people. We scraped two kinds of Yelp reviews that are presumably relatively personal and social: reviews for therapists (N = 1,213) and for strip clubs (N = 2,692). We selected these review topics because we expected one to prompt people to describe deeply psychologically important human connections and the other to prompt descriptions of therapists.

Figure 6. Folk concreteness and LCM-SBP and -PD scores in Yelp reviews for therapists

Figure 7. Folk concreteness and LCM-SBP and -PD scores in Yelp reviews for strip clubs

One can see in Figures 6 and 7 that the Yelp data tell the same story as the Amazon product data: LCM-SBP is nearly uncorrelated with folk concreteness, r = -.053 for therapists and r = -.020 for strip clubs, and there is a small correlation in the expected direction for LCM-PD and folk concreteness, r = -.299 for therapists and r = -.246 for strip clubs. It’s also easy to see from the scatterplots that the range of scores in these datasets is much more constrained than the scores in the deception data and the Amazon review data. This is because these reviews were longer (for therapists M = 148.30 words, SD = 127.02; for strip clubs, M = 148.00 words, SD = 140.65). When there is more substance, there is less variance.

Consistently, it looks like LCM-PD performs better – that is, more in line with what you’d expect – than LCM-SBP. Perhaps adding nouns to the LCM formula simply adds noise to the measurement. But all these analyses so far have just explored the correlations between the measures. We haven’t yet checked again to see how well the measures distinguish between abstract and concrete texts...

Philosophy vs. concrete

“You can use a piece of PVC pipe and double-sided tape to make your holes part of your mold or you can drill out the holes later." - a tutorial on making a concrete countertop

“His philosophy is chiefly vitiated, to my mind, by this fallacy, and by the uncritical assumption that a metrical coordinate system can be set up independently of any axioms as to space-measurement." - Betrand Russell, Foundations of Geometry

We figured that philosophical texts should be less concrete than texts that are literally about concrete. This seemed like a pretty safe bet. We collected five philosophical texts and six texts about concrete** (see the table below) and scored them with folk concreteness, LCM-SBP, and LCM-PD.

Short name Description
Concrete (Wikipedia) Wikipedia article on concrete
Types of concrete Wikipedia article on types of concrete
Concrete patio Tutorial on concrete patio
Concrete flowers Tutorial on concrete flowers
Concrete counterop Tutorial on concrete countertops
Home repair Idiot’s Guide to Home Repair**
Analysis of Mind James Mill’s Analysis of Mind
Foundations of Geometry Bertrand Russell’s Foundations of Geometry
External World Bertrand Russell’s Our Knowledge of the External World
Scientific Discovery Karl Popper’s Logic of Scientific Discovery
Problems of Philosophy Bertrand Russell’s The Problems of Philosophy

Obviously, this is a fairly small sample of texts, but the pattern is strikingly in line with expectations (see the table below for results), for folk concreteness and LCM-PD. Folk concreteness reliably distinguished between the philosophical texts (M = 2.18, SD = .02) and texts about concrete (M = 2.65, SD = .06), t (5.88) = 17.32, p << .001, d = 10.59, 95% [5.30, 15.87]. Although LCM-SBP scores were higher for the philosophical texts (M = 3.84, SD = .14) than for the concrete texts (M = 3.77, SD = .08), this difference was much smaller than that of folk concreteness, t (6.35) = 1.00, p = .35, d = 0.60, 95% [-0.80, 2.00]***. However, LCM-PD seemed to effectively distinguish between philosophy (M = 2.61, SD = .05) and concrete (M = 2.35, SD = .13), t (6.74) = 4.44, p = .003, d = 2.71, 95% [0.82, 4.61]. Oddly, although the pattern of LCM-PD scores closely reflected folk concreteness scores, LCM-PD scored the two Wikipedia articles on concrete as quite abstract. This might be because those articles are full of fairly complex descriptions of the technical properties of concrete, which LCM might classify as quite abstract.

Folk concreteness LCM-SBP LCM-PD Word length
Concrete (Wikipedia) 2.61 3.76 2.50 7348
Types of concrete 2.59 3.79 2.52 3464
Concrete counterop 2.62 3.77 2.28 5795
Concrete flowers 2.68 3.62 2.26 958
Concrete patio 2.62 3.84 2.31 1315
Home repair 2.76 3.84 2.21 89539
Analysis of Mind 2.21 3.75 2.52 89961
Foundations of Geometry 2.20 3.85 2.65 79461
External World 2.18 3.77 2.62 71974
Scientific Discovery 2.16 4.08 2.64 206502
Problems of Philosophy 2.18 3.76 2.59 43879

The top six rows are texts about concrete. The bottom five rows are philosophical texts.

Conclusions

What does all this mean? Folk concreteness consistently behaved as expected. This raises our confidence about its ability to measure linguistic concreteness (and also gives us more confidence in the null results of the deception study that send us down this rabbit hole). As for LCM – it certainly doesn’t bode well that there is a near-zero correlation between LCM-SBP and folk concreteness, across several datasets, in addition to LCM-SBP being unable to effectively distinguish between philosophical texts and texts about concrete. We haven’t found any data in which LCM-SBP performs the way it is expected to. Given that the revised scale, LCM-PD, seemed to work slightly better, and given that the only difference between the SBP and PD measures is the use of nouns, as we mentioned earlier, it could be the case that the inclusion of nouns severely dilutes LCM scores with error.

Folk concreteness and LCM-PD performed similarly and are correlated with each other, but not at a very high rate. Given this and the fact the two measures were created in very different ways, we think that they aren’t measuring exactly the same things. We don’t have the data here to arrive at strong conclusions about the differences between the measures. However, it seems clear that the folk measures, perhaps predictably considering they’re derived from lay judgments, are much more in line with people’s expectations. One might argue that the LCM measure is tapping into a different aspect of concreteness. However, neither the data nor the stated theoretical background of LCM gives us an idea as to what that aspect of concreteness would be. We have several ideas (ease of reading, frequency of occurrence of the words in the dictionary etc.). However this blog post is already twice as long as the ideal blog post (1600 words; Lee, 2014) and our bosses are beginning to suspect that all this sciency looking typing is in fact nothing of the sort. So we will sadly have to leave you here, at the bottom of a rabbit hole, wondering what will happen next...

Contributions

This project grew out of work related to Sofia’s doctoral thesis. The group collectively conceptualized and planned the project. RabbitSnore wrote the initial draft of the post, wrote most of the code, and performed most of the statistical analyses. Each member of the team helped interpret the results and revise the post. Everyone contributed to the mayhem.
If you would like to get involved in the puddle-ducks’ concrete mayhem (especially if you have been in this concreteness measure rabbit hole yourself and can throw us a rope to get out), please let us know.

Figure 8. An artist’s representation of the blog authors.

Puddle-Ducks (2019). There might be problems with the automated scoring of linguistic concreteness. Rabbit Tracks. Retrieved from https://www.rabbitsnore.com/2019/02/there-might-be-problems-with-automated.html

Open data and code

The data presented here, the original texts we analyzed, and the code used to perform the scoring can be found here: https://osf.io/nf54b/

Notes

*Automatic coding of adjectives as abstract is arguably also problematic. Many adjectives are not abstract (e.g., red, small) and calculating them according to the LCM-SBP might make a text score more abstract, when, in fact, the expression is concrete. Excluding adjectives from the formula did, however, not improve the scoring.

*The Idiot’s Guide to Home Repair isn’t exclusively about concrete, but it is about similarly physical and immediate objects and tasks.

**The effect size for LCM-SBP’s ability to discriminate between the philosophical texts and the concrete texts is in the correct direction, and by the standards of psychology, it is considerable (d = .60). However, this pales in comparison to the other effect sizes (d = 10.59 for folk concreteness and d = 2.71 for LCM-PD). This makes sense if nouns are adding noise to LCM-SBP, which is effectively removed in LCM-PD. The noise would attenuate LCM-SBP’s ability to distinguish between abstract and concrete texts. Considering the extent of the attenuation, it looks like nouns are adding an enormous amount of noise indeed.

References

Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods, 46(3), 904-911.

Carnaghi, A., Maass, A., Gresta, S., Bianchi, M., Cadinu, M., & Arcuri, L. (2008). Nomina sunt omina: On the inductive potential of nouns and adjectives in person perception. Journal of Personality and Social Psychology, 94, 839-859.

Coenen, L. H., Hedebouw, L., & Semin, G. R. (2006). The linguistic category model (LCM) manual. Unpublished manuscript. Amsterdam, NL: Free University Amsterdam.

Lee, K. (2014). The ideal length of everything online, backed by research. https://blog.bufferapp.com/the-ideal-length-of-everything-online-according-to-science

Seih, Y.-T., Beier, S., & Pennebaker, J. W. (2017). Development and examination of the linguistic category model in a computerized text analysis method. Journal of Language and Social Psychology, 36(3), 343-355.

Semin, G. R., & Fiedler, K. (1991). The linguistic category model, its bases, applications and range. European review of social psychology, 2(1), 1-30.

Semin, G. R. & Fiedler, K. (1988). The cognitive functions of linguistic categories in describing persons: Social cognition and language. Journal of Personality and Social Psychology, 54, 558-568.

No comments:

Post a Comment