Why Quantitative Variables Should Not Be Recoded as Categorical ()
1. Introduction
Imagine a political scientist wants to estimate the effect of income, as measured by a continuous yearly revenue, on partisanship. Before performing data analyses, she decides to split income into three levels: low, medium, and high. Similarly, suppose a physicist wants to examine the effect of age on the likelihood of developing coronary heart diseases. Before running the model, she recodes age into four groups. In this article, we address some of the adverse consequences of dichotomizing quantitative variables. Technically, categorization always implies a loss of information, and it usually leads to misleading results [1] [2] [3] [4] . To make our case, we reproduce data from [5] and [6] . Besides, we employ basic simulation to show how dichotomization generates inefficiency and bias. To increase transparency [7] [8] [9] , we report all computational scripts used to generate statistical analyses.
Our target audience is graduate students in the early stages of training and scholars with a minimum mathematical background. For this reason, we minimized algebraic applications to facilitate the understanding of the original content. In particular, the paper fills a gap in the political methodology literature. We reviewed 24 articles on dichotomization published in 20 journals from 1983 to 2017, and none of them was available in political science journals (see Appendix Table A1). As long as the categorization of quantitative variables is a common practice not only in the Social Sciences but also in the Health Sciences [10] [11] , we believe that considerable progress in our understanding of data analysis can occur if scholars follow the recommendations presented in this article.
The remainder of the paper is structured as follows. Following section reviews the literature on categorization. The second section replicates data from different studies to show how the transformation of quantitative variables into categories may lead to wrong conclusions. The third section uses basic simulation to highlight the shortcomings of dichotomization, focusing on both bias and efficiency. The final section concludes.
2. What Is the Problem?
Information loss, Inefficiency, Bias, concisely, these are the main problems generated by the categorization of quantitative variables [12] . Despite its widespread use, the scholarly literature has accumulated systematic evidence on why scholars should avoid dichotomization. The discretization reduces measurement accuracy, underestimates the magnitude of the coefficients of bivariate relationships, and lowers statistical power [2] [13] . Also, the artificial transformation of quantitative measures into groups may lead to biased coefficients and unreliable standard errors in multivariate models [13] [14] .
Methodological pleas against dichotomization are not new. For example, [15] showed that dichotomizing one of the variables at it’s mean reduces the population correlation coefficient by 20% on average. [16] estimated the effects of dichotomization in the context of analysis of variance (ANOVA). Similarly, [1] argues that dichotomization leads to a loss of one-fifth to two-thirds of the variance that may be accounted for on the original variables. [17] showed that the transformation of quantitative measures into categories underestimates both effect sizes and statistical power. Table 1 summarizes scholarly work against dichotomization.
Table 1. Literature against dichotomization
Note: We reviewed 24 papers published in 20 journals from 1983 to 2017.
Another criticism against dichotomization comes from measurement literature [1] [5] 1. According to [1] , “dichotomizing adds errors of discreteness. That is, the amount of unmeasured true scores variance for the cases at each of the points of the dichotomy is necessarily greater than it would be for cases at each of the multiple points in the original scale” (p. 249). Simirlaly, [5] argue that the categorization of quantitative variables into groups is equivalent to add measurement error to the variable. Therefore, dichotomization increases the difference between true scores and measured values, which is likely to produce unreliable estimates. Figure 1 shows the relationship between dichotomization and measurement error2.
Note: image from [21] . Figure 1 exemplifies a typical problem in dichotomization. A horizontal line depicts variable X, which has a sufficient number of cases, the closer the cases are from one another, the more similar they are. Letters A, B, C, and D (shown inside a triangle) represent four different cases. Case A is distant to case B as well as C is to D. Both cases B and C are nearer to each other, meaning they are more similar (a). If some arbitrary cut point between B and C is chosen (b) to transform the continuous variable X into a dichotomized one (c), the similar cases B and C will end up in two separated groups while more different pairs will be on the same group.
Figure 1. Measurement of individual differences before and after dichotomization.
B and C have similar scores when X is measured continuously. However, the dichotomization leads to an inefficient aggregation of A and B vis-a-vis C and D. Comparatively, the least useless procedure is to split a normal variable at its mean, which reduces the variance of the original variables by a 20% on average. However, it is doubtful to find perfect normal distributions in practice. Therefore, depending on the shape of the distribution, categorization will lead to more significant information loss [1] [19] . In short, the categorization of quantitative variables will always generate information loss, which in turn will reduce estimates efficiency. In some cases, in addition to inefficiency, dichotomization can lead to biased estimates, as we will show in the next section.
3. Replication
In this section, we replicate two secondary datasets to show some of the adverse consequences of dichotomizing quantitative variables. The first example comes from [5] . They created a hypothetical example to represent the relationship between
Source: authors using data from [5] .
Figure 2. Correlation among X1, X2, and Y.
the number of errors made in a cognitive laboratory (X1), the speed of response during the task (X2), and the score on a standardized ability test (Y). Figure 2 shows the Pearson correlation coefficient among those variables.
To explore the impact of categorization, [5] dichotomized both independent variables at their respective medians (13). Then, they estimate a 2 × 2 ANOVA, which revealed an effect of X1 and X2 over the mean of Y. According to [5] , “the bivariate dichotomization of X1, and X2 has led to a situation in which the estimated effects of X1 and X2 on Y are biased” (p. 183). A simple linear regression on the effect of X2 on Y vanishes after we control for X1. In short, these results indicate that categorization may lead to misleading results.
The second example comes from [6] . He simulated five different scatterplots that yield an identical fourfold table when X and Y are dichotomized at cut point 0, misleadingly suggesting no association between the variables. Figure 3 replicates data from [6] .
Dichotomization leads us to overlook the true nature of the relationship between X and Y. According to [6] , “simply dichotomizing continuous variables without previously referring to the original distributions by plotting them and checking consequences of dichotomization is a bad idea and should be discouraged” (p. 3). These two examples show how dichotomization can lead scholars to wrong inferences.
4. Simulation
To stress our distrust on dichotomization, we employ basic simulation to show how the transformation of quantitative variables into categories produces inefficiency. First, we generate two normal variables (X and Y) correlated at.6 for a sample size of 300 cases. Then, we recode X at its mean (0) into two groups: below the average and above the average to produce a dummy variable (0 or 1). Figure 4 shows the distribution of X and its dichotomization cutpoint at 0.
Figure 5 shows the correlation between X and Y and X categorized and Y for all cases (n = 300) and for a small sample of observations (n = 30).
The true correlation coefficient is 0.600. By dichotomizing X at its mean, we observe a linear association of 0.475, which represents a 20.83% difference from the known parameter. For a small sample size (n = 30), the Pearson correlation using the original variables is 0.465, which is closer to the true parameter value compared to the estimate from the dichotomized model. In short, regardless of the
Source: authors using data from [6] .
Figure 3. Different relationships but the same fourfold table when X and Y are dichotomized at 0.
Source: authors.
Figure 4. X dichotomized at 0.
Source: authors.
Figure 5. Correlation between X and Y (n = 300) and (n = 30). (a) r = 0.600; (b) r = 0.475; (c) r = 0.465; (d) r = 0.357.
sample size, dichotomization will lead to information loss, which decreases estimates efficiency. Table 2 shows the estimates of two linear regression models.
Considering all cases (n = 300), the standard error of the dichotomized model is twice as large compared to the model using the original variables. For a bivariate linear regression, the coefficient of determination is calculated by the square of Pearson correlation coefficient (0.6), which is 36%. In the dichotomized model, we observe an r2 close to 23%, which underestimate the goodness of fit of the model. For n equals to 30, the categorization of the independent variable leads to the incorrect retention of the null hypothesis at 5% level (p-value = 0.052). Although our simulation deals with only two variables, the same reasoning applies to multiple linear regression, which is widely used in empirical research in both Human and Natural sciences [23] .
Now let’s consider a slightly more complicated case. We simulate the following model:
(1)
where X1 follows a normal distribution (0, 1), X2 follows an exponential distribution (λ = 2) and ε has average value equals to zero and standard deviation equals to 1 for a population of 100 observations. Table 3 compares the results of a linear regression using original variables to a model when both independent variables are dichotomized at their means.
The dichotomized model displays a lower r2 and F statistic, suggesting poor
Table 2. How dichotomization leads to inefficiency.
Note: we estimated two linear regression models. The first one was estimated with both variables at their original level of measurement (continuous). The second model used X dichotomized at its mean (0).
Table 3. Linear regression (original x dichotomized variables).
Source: authors.
Source: authors.
Figure 6. Residual diagnostics.
goodness of fit. When variables are used at their original level of measurement, regression coefficients are unbiased estimates of the population parameters. However, when both variables are dichotomized at their means, X2 is no longer statistically significant which will lead us to retain the null hypothesis of no effect incorrectly. For public policy, the conclusion would be to cut resources. In medical research, the inference would be that the treatment has no impact on health. Figure 6 depicts the residual diagnostics from the dichotomized model.
5. Conclusions
Despite criticisms from the scholarly community, dichotomization still is a common practice in empirical research. Unfortunately, many researchers categorize quantitative variables before running data analyses. This is true from Biology to Psychology, from Medical research to Sociology. Before statistical software and computers development, categorization played an essential role in science by simplifying mathematical modeling. It is not the case anymore. Since we have more appropriate tools to deal with reality, there is no reason to transform quantitative measures into categories. More than 30 years ago, [24] argued that “scientific questions are better decided by empirical evidence than by methodological default” (p. 833).
Categorization usually leads to misleading results. It can deceive us by increasing inefficiency and affecting the probability of type I and type II errors. Dichotomization also generates biased coefficients since it can hide the correct functional form of the observed relationship. In some cases, when two or more independent variables are dichotomized, a truly null effect will likely reach statistical significance. The artificial transformation of quantitative variables into groups reduces the power of statistical tests and increase errors of discreteness. What will happen if both independent and dependent variables are categorized? Double dichotomization using the mean as cutpoint is equivalent to lose almost 1/2 of the sample cases [1] . In short, dichotomization leads to a systematic loss of information which has detrimental effects on the reliability of statistical estimates.
In sum, the recodification of quantitative variables as categorical is a poor methodological strategy, and scholars must stay away from it. Dichotomization undoubtedly simplifies data analysis, but the costs are too higher to bear. Today, categorization is neither appropriate nor justifiable. Continuous variables are as good as they are. Let’s be cool about it and leave quantitative variables alone.
Appendix
Source: authors (2018).
NOTES
*Authors are listed in alphabetical order. This work was partially supported by FACEPE, Capes, and CNPq. We thank the Berkeley Initiative for Transparency in the Social Sciences (BITSS) and the Project Teaching Integrity in Empirical Research (TIER) for financial support. We thank the Political Science Research Methods Group from the Federal University of Pernambuco for generous feedback. Also, we thank Justin Esarey and Umberto Mignozzetti for helpful comments. Replication materials are available at: https://osf.io/7tsgx/.
1In this paper we adopt the definition of measurement proposed by [22] : “measurement consists of rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (scaling) or (2) define whether the objects fall in the same or different categories with respect with a given attribute (classification)” (p. 1).
2Measurement error can be either random or systematic. Each type of error creates different problems. Measurement error also can plague the dependent, the independent, or both variables. In general, the random error will lead to inefficiency, and systematic error will lead to biased estimates.