A Bias of Averaging

Παν μέτρον άριστον [Everything in moderation] ~ Kleoboulos of Lindos, attributed, 6th ct bce

The ability to average over noisy data is essential for effective cognition and decision-making. Students introduced to the Gaussian error distribution are spoiled because this distribution is not only normal but also beautiful. The fact that there are different measures of ‘central tendency’ has not hit home yet, because with Gauss, they are all the same: the average (the arithmetic mean), the mode (the peak), and the median (the 50th percentile). When skew is introduced, the three part ways. For a negatively skewed distribution (with the thin tail on the left) and with numbers rising from the left, the mode is greater than the median, which is greater than the average. 

When researchers present participants with a series of numbers and ask them to estimate the average, they do well to explain the three types of central tendency and make clear which of them they seek. Often researchers seem to assume that asking for an ‘average’ will be understood as ‘arithmetic mean,’ and when average estimates depart from true averages, the researchers conclude that something interesting is going on.

If average estimates always hit the true averages well, there would not be much of a psychology (Peterson & Beach, 1967). Discrepancies raise questions about what people actually do to solve the task and how to model that. Parducci (1965) presented a simple and elegant account of averaging. According to his range-frequency theory (RFT), estimates of averages arise from a compromise between a range principle and a rank principle. The range principle takes the halfway point between the smallest and the largest observed value, and the rank principle takes the median. If the two differ, split the difference. RFT has good success predicting human performance in averaging tasks in a wide variety of contexts (Wedell & Parducci, 2000).

From time to time, researchers try to reinvent RFT or improve on it – with limited success. In an earlier essay (Krueger, 2018), I described the efforts of a Harvard team to introduce a new concept of category expansion, only to find that RFT describes the data well without requiring a newfangled psychological process, let alone ‘bias.’

Now, researchers at Yale and Cornell tell us about a binary bias, a purported averaging heuristic that yields systematic error (Fisher & Keil, 2018; Fisher et al., 2018). The psychological sin du jour is dichotomization. Averaging is hard, and respondents are thought to divide the range of observed values into a left half and a right half (recall the range principle), and to then estimate the number of observations in each half and subtract one count from the other to arrive at an imbalance score. This sounds very much like RFT because it picks up on the range principle (by using the half-range as the criterion of dichotomization) and the rank principle (by using variations in distributional skew). Indeed, the critical dependent measure, the imbalance score, predicts estimates of the average over the entire range. Surprisingly, though, the computational model for the binary bias is mute on how the imbalance score translates into an estimated average; it only predicts that the two are correlated over pairs of distributions.

A tale of 2 menus

Source: J. Krueger

To test the binary bias hypothesis, the authors construct pairs of distributions where the two means are the same but the skew is different. Now, skew affects both the imbalance score and the median, thereby confounding the two. Consider the example of the two menus (shown in the first inserted figure). There are 10 items on the menu. Prices range from $12 to $20 in menu 1, and from $10 to $17 in menu 2. Thus, the midrange is $16 in menu 1 and $13.5 in menu 2. In menu 1, 7 items are cheaper than the midrange price, and 3 are more expense. This yields an imbalance score of 4 (7 – 3). In menu 2, 2 items are more expensive than the midrange and 8 are cheaper. This yields an imbalance score of -6 (2 – 8). The prediction is that respondents will estimate a lower average price for menu 1 than for menu 2, and indeed they do. Et voilà, bias yields error.  

Yet, the median shows the same inequality. The price distribution for menu 1 is positively skewed (with most dishes being cheap), whereas the distribution for menu 2 is not skewed. The median price in menu 1 is $14, and the median price in menu 2 is $16. This part of RFT is doing well. If, however, respondents were to give the median and the midrange price equal weight when estimating averages, the estimated average for menu 1 would be slightly higher than the estimated average for menu 2.

The possibility that respondents simply take the median when estimating the average looms as a plausible psychological alternative. The authors repeatedly note the confound between binary bias and median-driven judgment, but do little to break it. The most direct test is found in study 7 of Fisher et al. (2018). Here, we find 3 types of distribution pairs. In all three pairs, the distribution with the positive skew has a slightly lower mean than the distribution with the negative skew. Since numbers present value in this experiment, all respondents should choose from the latter distribution; yet most do not, which is consistent with the binary bias. The findings are virtually the same when the 5 bins are labeled from ‘very poor’ to ‘very good.’ Here, the half-range coincides with the neutral label. In the third condition, however, respondents find a univalent scale running from ‘fair’ (1) to ‘extremely good’ (5). About half of these respondents still prefer the distribution with the lower mean but positive skew. The authors conclude that if skew where the source of error, introducing labels should not matter.

This is an astonishing claim and an almost bizarre attempt to separate competing hypotheses. The introduction of labels from ‘fair’ (1) to ‘extremely good (5) generates new competition for both the binary bias and the skew account. In this condition, the semantically suggested category boundary has moved from 3 down to 1.5. There is now a strong demand to cluster all the ratings containing the word ‘good.’ And as it turns out, the distribution with the lower average has fewer ‘fair’ items than the distribution with the higher average. This test is not strong because it settles for the idea that any significant effect refutes the hypothesis predicting no effect (Krueger & Heck, 2017). By deploying a strong, demand-suggesting, manipulation, the deck is stacked. With significance in hand, it is easily overlooked that even under these urgent circumstances, most responses were similar to rather than different from the responses in the other two conditions.

Let there be steak!

Source: J. Krueger

Though this test may not be terrible, it must be considered weak when it is asked to do all the work. It is not hard to find another, complementary, way to pit the binary bias hypothesis against the skew hypothesis. Let us return to the menu paradigm and add on expensive item (ribeye steak for $30) to each list.  The second figure shows that the averages have gone up, and that the second list retains a higher median price. Critically, the imbalance score is now balanced, so no binary bias is predicted. The RFT, by using both half-range and rank information, predicts a tiny difference.

With a bit more research, then, we might learn whether we need the novel concept of binary bias. Part of the – misleading – appeal of this research as it stands is that it uses the noncontroversial observation that people spontaneously categorize continuous stimuli (Krueger & Clement, 1994; Tajfel, 1969) to claim that this tendency compromises cross-categorical cognition such as grand averaging.

Fisher, M., & Keil, F. C. (2018). The binary bias: A systematic distortion in the integration of information. Psychological Science. DOI:10.1177/09567718792256

Fisher, M., Newman, G. E., &  Dhar, R. (2018). Seeing stars: How the binary bias distorts the interpretation of customer ratings. Journal of Consumer Research. DOI: 10.1093/jcr/ucy017

Krueger, J. I. (2018, Jul 16). Social problems and human cognition. Psychology Today Online. https://www.psychologytoday.com/us/blog/one-among-many/201807/social-pro…

Krueger, J., & Clement, R. W. (1994a). Memory-based judgments about multiple categories: A revision and extension of Tajfel’s accentuation theory. Journal of Personality and Social Psychology, 67, 35-47.

Krueger, J. I., & Heck, P. R. (2017). The heuristic value of p in inductive statistical inference. Frontiers in Psychology: Educational Psychology [Research Topic: Epistemological and ethical aspects of research in the social sciences]. https://doi.org/10.3389/fpsyg.2017.00908

Parducci, A. (1965). Category judgment: A range-frequency model. Psychological Review, 72, 407-418.

Peterson, C. R., & Beach, L. R. (1967). Man as an intuitive statistician. Psychological Bulletin, 68, 29–46.

Wedell D. H., Parducci A. (2000). Social Comparison. In: Suls J., Wheeler L. (eds.), Handbook of social comparison: Theory and research. (pp. 223-252). New York: Plenum/Kluwer.

Tajfel, H. (1969). Cognitive aspects of prejudice. Journal of Social Issues, 25, 79–97. 



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s