R-Factor Analysis of Data Based on Population Models Comprising R- and Q-Factors Leads to Biased Loading Estimates ()
1. Introduction
The factor model [1] [2] allows for the investigation of measurement models in psychology and several areas of the social sciences. There are several estimation methods for the factor model, and researchers have the choice between several different methods for exploratory and confirmatory factor analysis [1] [3] [4] [5] . Although a very large number of studies are based on the factor model, the real-world phenomena may not correspond exactly to this model. [6] emphasized that the factor model may not fit perfectly into real population data. The possible difference between the factor model and population real-world data has been termed “model error” [6] [7] . Accordingly, in a factor analysis performed on a real-world data sample, misfit of the factor model might be due to sampling error and misfit might be due to model error. The modeling of common and unique factors together with a large number of minor factors has successfully been used in order to generate more realistic data containing model error in simulation studies (e.g., [8] ). However, other types of model error that are not based on a large number of minor factors may also be relevant for the fit of the factor model to real data. As other types of model error have not yet been investigated, their effect on the estimation of model parameters remains unknown. The research problem addressed in the present study is therefore the effect of another type of model error on the results of factor analysis.
The model error considered here is that the covariances between observed variables are affected by covariances between individuals. In psychology and social sciences, factor analysis is mainly performed in order to identify latent variables explaining the covariation between variables that are observed in samples of individuals. However, covariances of variables imply a pattern of covariances of individuals as shown in the following example (Table 1, Example 1). The perfect correlation of variables x1 and x2 may be caused by a common factor and the correlation of variables x3 and x4 may be caused by another common factor. As the scores of individuals i1 and i2 have a zero variance, the corresponding inter-correlations of individuals are zero. Only a perfect negative inter-correlation between individuals i3 and i4 occurs. Whereas the inter-correlations of variables can be explained by two uncorrelated factors, the corresponding inter-correlations
Table 1. Mean-centered scores of four individuals (i1 - i4) on four observed variables (x1 - x4), inter-correlations of variables without inter-correlation between individuals and with superimposed inter-correlation between individuals.
Note. Standard deviations are given behind the slash.
of individuals cannot be explained by two uncorrelated factors. In Example 2, perfect negative correlations between individuals i1 and i2 and between i3 and i4 and moderate inter-correlations between the remaining individuals occur, which considerably modifies the inter-correlations of variables when compared to Example 1. The examples demonstrate the mutual inter-relation of inter-correlations of variables and inter-correlations of individuals. In order to elucidate this relationship, the effect of latent factors explaining the common variance of individuals on the common variance of variables is investigated in the present study.
Factor analysis of the covariances or correlations between variables that are observed across many individuals is often termed R-factor analysis whereas factor analysis of the covariances or correlations between individuals observed across many variables is termed Q-factor analysis [9] [10] . A data matrix of individuals for Q-factor analysis is obtained when the matrix of observed variables used for R-factor analysis is transposed. Note that the empirical data used for R- and Q-factor analysis may be the same, although the number of observed variables will typically be larger than the number of individuals in Q-factor analysis whereas the number of individuals will typically be larger than the number of observed variables in R-factor analysis. Moreover, there are other preferences for factor extraction and rotation in Q-factor analysis [11] [12] than in R-factor analysis. Nevertheless, there is consensus that Q-factor analysis may be useful for the investigation of subjective individual views [12] and Q-factor analysis is sometimes preferred over R-factor analysis in the context of questionnaire development (e.g., [13] ).
The similarities and differences of R- and Q-factor analysis have primarily been discussed from the perspective of factor analysis as a tool for data analysis [14] [15] . In consequence, the effects of the R- and Q-factor model as data generating population models on the results of R- or Q-factor analysis have rarely been compared. It is therefore widely unknown what happens when data that are based on a population model comprising R- as well as Q-factors are submitted to R-factor analysis. As models are never true [16] , it is not the fact that model error occurs that is important here, but the question of whether the loading estimates from R-factor analysis are substantially biased when a combined R- and Q-factor model holds. Therefore, and because most studies perform R-factor analysis, the focus of the present study is on the effect of a combined R- and Q-factor model as a population model on subsequent R-factor analysis. It is, however, acknowledged that a combined R- and Q-factor population model might also be a source of error for Q-factor analysis.
An example of R-factors in a context where Q-factors may also be relevant is the analysis of personality types in the context of personality traits [17] , although the robustness of the results has been challenged [18] . [18] also noted that only 42% of the sample was associated with the proposed personality types indicating that the types are probably of moderate relevance. Although [17] used cluster-methodology (Gaussian mixture models) for the identification of types, similarities of individuals have also been investigated by means of Q-factor analysis [9] . Thus, personality research shows that relevant similarities of variables as well as relevant similarities of individuals may co-occur. This does not imply that Q-factors yield a superior representation of personality variance or that they allow for improved predictions of outcomes like, for example, social adjustment or job achievement [19] . For the present study, it is only important to acknowledge that Q-factors may also be relevant for a complete description of the data. However, if we accept the idea that Q-factors may co-occur with R-factors, the consequences of a population model based on a combination of R- and Q-factors for the estimation of model parameters of R-factor analysis should be investigated. This has until now not been done as similarities of individuals have often been investigated by means of cluster analysis [17] [18] , latent class analysis [20] , or factor mixture models [21] . The achievements of these approaches for the analysis of typological variance are not questioned here. The focus of the present study is on the effect of population Q-factors co-occurring with population R-factors on the loading estimates of R-factor analysis which does not take into account the Q-factors.
After some definitions, the effects of population models based on R- and Q-factors on the covariance and correlation of observed variables and the resulting effects on the estimation of R-factor loadings are described for the population. Then, a simulation study is performed in order to give an account of the effect of population models comprising R- and Q-factors on loading estimates of R-factor analysis. Finally, a method indicating whether a data set contains a relevant amount of Q-factor variance is proposed and demonstrated by means of simulated data sets.
2. Definitions
2.1. Separate R- and Q-Factor Models
Let
be a p × n matrix of p variables observed for n individuals [2] . The R-factor model can then be written as
(1)
where
is a qR × n matrix of normally distributed common R-factor scores,
is a p × qR matrix of common R-factor loadings,
a p × n matrix of normally distributed linear independent unique R-factor scores, and
is a p × p diagonal positive definite matrix of unique R-factor loadings. It is furthermore assumed that
,
,
,
, and
, so that
(2)
Let
be a n × p matrix of n individuals for which p variables were observed. The Q-factor model can then be written as
(3)
where
is a qQ × p matrix of normally distributed common Q-factor scores,
is a n × qQ matrix of common Q-factor loadings,
is a n × p matrix of normally distributed linear independent unique Q-factor scores, and
is a n × n diagonal positive definite matrix of unique Q-factor loadings. It is furthermore assumed that
,
,
,
,
, so that
(4)
It is assumed that the observed variables
and
are statistically independent with
(5)
for
and with
,
,
, and
.
2.2. A Combined Model of R- and Q-Factors
The data in the following section are assumed to be analyzed from the perspective of R-factor analysis whereas the observed variables
are based on an aggregation of variables resulting from R- and Q-factors. This can be written as
(6)
where
represents the part of the observed variables based on R-factors and
is the transposed matrix of observed individuals based on Q-factors. Although adding
and
is only possible for n = p, it should be noted that -in the combined model of R- and Q-factors, only
is observed whereas
and
are parts of the assumed population model. Therefore, not all R- and Q-factors need not to be well represented by the observed variables in
when R-factor analysis is performed. For a complete description of the population model n = p is nevertheless assumed in the following. Moreover, as
is not necessarily zero, there is the symmetric and idempotent centering matrix
, based on the n × n identity matrix
and the n × 1 column unit-vector
, for row mean centering of
on the right side of Equation (6). It has been noted by [15] and others that mean centering of
implies that the variance that would be based on a single common factor (qQ = 1) in
would be eliminated in R-factor analysis of
. Therefore, only the condition qQ > 1 is considered here. It follows from Equation (6) that the covariances of
are
(7)
with
and
. The element in the first row and first column of
is computed as
. As
and
are mean-centered, symmetrically distributed and as
all elements in the brackets are from the normal product distribution [22] [23] , which is symmetric so that
. This holds for all elements of
so that
. Therefore, Equation (7) can be written as
(8)
2.3. Bias of Estimated R-Factor Loadings
For
being mean centered (
) Equation (8) implies that the variance of the elements in
is also affected by Q-factors. For
the numerator of the variance of the elements in
is
(9)
where SSQ denotes the sum of squares. It follows from
,
,
, and
that the eigen-decomposition of
, where
is a n × n matrix containing the eigenvalues in the main diagonal in descending order with
. According to [24] (p. 248) the trace of the power of a positive semidefinite square matrix is equal to the trace of the power of the eigenvalues of the matrix so that
. (10)
When all unique Q-factors and all common Q-factors account for the same amount of variance of each observed variable (
), the right-hand side of Equation (10) can be written as
. (11)
It follows from
that
introduces variability into the elements of
. For
the numerator of the variance of
is
(12)
It follows from
that the eigen-decomposition of
, where
is a n × n diagonal matrix with qQ non-zero eigenvalues in decreasing order and
. The numerator of the variance of the elements of
is
(13)
which implies that the variance of the elements in
is greater zero. When all unique Q-factors and all common Q-factors account for the same amount of variance of each observed variable (
), the right-hand side of Equation (13) can be written as
(14)
It follows from Equations (14) and (10) and for n > qQ that
, i.e., that common Q-factors introduce n/qQ times more variability into the elements of
than unique Q-factors. More generally, Equations (10) and (11) imply that some variability in the elements of
is introduced by the common and unique Q-factors. To sum up, Q-factors tend to enhance the variance of the covariances of observed variables (Equations (10) and (11)). However, the abovementioned analyses do not inform on the size of the respective effects and which amount of Q-factor variance might substantially distort an R-factor solution.
3. Simulation Study on the Effect of Q-Factors on R-Factor Loadings
3.1. Conditions and Specifications
A simulation was performed in order to give an account of the bias of R-factor loadings that is due to Q-factors when the data are based on R- as well as Q-factors. As the number of individuals or cases n is part of the Q-factor model, the finite population of the simulation study has to comprise a large number of samples of a given n. The first population was based on 2000 samples of n = 300 cases, the second population comprised 2000 samples of n = 600 cases, and the third population comprised 2000 samples of n = 900 cases. Accordingly, the conditions of the simulation study were qR = 3, qQ = 3, and p = 15. To investigate the effect of Q-factors on the variability of R-factor loading estimates, the salient loading sizes were set equal within each population model. The size of salient loadings in the common R-factor loading matrices
was λR Î {0.50, 0.70} and the size of salient loadings in common Q-factor loading matrices
was λQ = 0.90. The non-salient loadings were zero in all population models. According to Equations (1), (3) and (6), the R- and Q-factor loadings were combined in order to generate the observed variables. This can be written as
(15)
Although the relative effect of R- and Q-factors can be determined by the size of the respective common and unique R- and Q-factor loadings, it is helpful to control for the relative effect of R- and Q-factors more directly by means of
(16)
with
and
, which is needed to standardize the transposed part of the observed variables based on Q-factors. The usual metric of standardized factor loadings was maintained in the population with
and
. The observed variables were computed from Equation (16) by means of qR common factor scores
, p unique factor scores
, n/qQ common factor scores
, and n unique factor scores
, which were generated from normal distributions with μ = 0 and σ = 1 by the Mersenne twister random number generator integrated in IBM SPSS, Version 26.0.
For
and
Equation (16) yields a conventional R-factor model. For
and
, half of the unique R-factor variance is replaced by common and unique Q-factor variance. Four levels of
(1.00, 0.75, 0.50, and 0.25) with the corresponding
were combined with two levels of λR and three sample sizes n, which leads to 4 × 2 × 3 = 24 populations, each comprising 2000 samples.
Each set of p observed variables was submitted to R-factor analysis. The dependent variables of the simulation study were the mean and standard deviation of the estimated loadings
resulting from principal-axis R-factor analysis of the sample data with subsequent orthogonal target-rotation [25] of the estimated R-factor loadings
towards the R-factor loadings
of the population model based on R- and Q-factors. Therefore, differences between the means of
and cannot be due to different rotations of the factors.
3.2. Results
The most important result of the simulation study is that the standard deviation of the salient loadings increases with decreasing
(Table 2). The results of
show the standard deviations of the loadings that are only due to sampling error, as rotational variation of loadings was excluded by means of orthogonal target-rotation towards the population loadings. Especially, the results of
show that the standard deviation of the loading estimates was about twice as large as the variation due to sampling error, when there was a substantial amount of Q-factor variance. This additional loading variation is a bias of the loading estimates as there was no salient loading variation in the population.
In order to show the possible effect of the loading variation (comprising salient and non-salient loadings) on factor identification, a scatterplot of the target-rotated loadings of factors 1 and 2 is presented for λR = 0.50 in Figure 1 and for λR = 0.70 in Figure 2. Obviously, the overlap of salient and non-salient loadings for samples of n = 300 cases is substantial for λR = 0.50 and
and might be an obstacle for factor identification (Figure 1). In contrast, salient and
Table 2. Mean and standard deviation of target-rotated salient loading estimates of R-factor analysis for λR = 0.50, 0.70 for
= 0.25, 0.50, 0.75, and 1.00 (n = 300, 600, 900).
Note. Standard deviations are given behind the slash.
Figure 1. Scatterplot of R-factor loading estimates
of factor 1 and 2 based on 2000 samples (n = 300, 600, 900) drawn from populations based on λR = 0.50, qR = 3 R-factors (
= 1.00) and from populations comprising qR = 3 R- and qQ = 3 Q-factors (
= 0.25).
non-salient loadings can clearly be separated for n = 300 cases, λR = 0.50 and
or for samples sizes of n = 600 and n = 900. For all conditions based on λR = 0.70, the overlap of salient and non-salient loadings was small, indicating that factor identification would be possible (Figure 2). To sum up, when a substantial amount of Q-factor variance is expected, large sample sizes should be analyzed or very large R-factor loadings should be the expected as a basis for successful factor identification.
Figure 2. Scatterplot of R-factor loading estimates
of factor 1 and 2 based on 2000 samples (n = 300, 600, 900) drawn from populations based on λR = 0.70, qR = 3 R-factors (
= 1.00) and from populations comprising qR = 3 R- and qQ = 3 Q-factors (
= 0.25).
4. An Indicator of Q-Factor Variance
As R-factor analysis of data from a population based on a relevant amount of Q-factor variance may result in biased R-factor loadings, it is interesting to know whether there is a relevant amount of Q-factor variance in a data set. Note that a population model based on an additive combination of R- and Q-factors implies that a row-centered matrix of individual R-factor scores is combined with a row-and-column-centered matrix of individual Q-factor scores (Equations (15) and (16)). [26] demonstrated that the eigenvalues of R- and Q-factor analysis of a row-and-column-centered matrix are identical, so that a high similarity of eigenvalues should be expected for combined R- and Q-factor models, even when the resulting matrix is not perfectly column-centered. Therefore, Q-factor analysis will yield a number of substantial eigenvalues, even when the data can perfectly be described by R-factor analysis. Thus, the eigenvalues of Q-factor analysis do not inform unambiguously on the amount of Q-factor variance.
It is therefore proposed to consider the bivariate scatterplot of observed variables in order to ascertain whether between-subject variance that could be due to R-factors is combined with a substantial amount of within-subject variance that could be due to Q-factors. Different within-subject profiles that might be caused by qQ > 1 Q-factors imply that not all differences between two observed z-standardized variables z1 and z2 are equal. For qQ = 2, for example, there could be one group of participants with z1 − z2 > 0 and a second group with z1 − z2 < 0. It follows that the variance of the z-score differences d, σd, is greater zero for qQ ≥ 2. According to [27] (p. 64) the correlation can be written as
(17)
As qQ ≥ 2 implies σd > 0, it follows from Equation (17) that
. An example for n = 145 cases and qQ = 3 is given for
in Figure 3 (dots). The concentration of points on three lines is extreme for qQ = 3, so that the bivariate distribution is quite different from the bivariate distribution for the same correlation and qQ = 0 (Figure 3, crosses). For qQ = 0 there is a bivariate normal distribution, which is clearly not the case for qQ = 3. As the distributions in Figure 3 are not skewed, only tests of the multivariate kurtosis were performed with the macro provided by [28] at α = 0.05. Srivastava’s [29] test for multivariate kurtosis (β2,p = 2.26, N(β2,p) = −2.59, p < 0.01), Small’s [30] test of multivariate kurtosis (Q2 = 298.95, df = 2, p < 0.01), and Mardia’s [31] test indicate a significant departure from multivariate normal kurtosis (β2,p = 6.36, N(β2,p) = −2.47, p < 0.05).
The example shows that a bivariate distribution clearly based on qQ = 3 may result in a platykurtic departure from the kurtosis of the bivariate normal distribution. Even when different reasons for platykurtic multivariate distributions are possible, tests of the multivariate kurtosis may also indicate that qQ > 1. Visual inspection of scatterplots may be performed when significant departures from the multivariate normal distribution occur because a pattern with separable clouds of points will provide further evidence for the presence of Q-factors.
In order to investigate the usefulness of tests for the kurtosis of the multivariate normal distribution as indicators for qQ > 1, tests were performed for qR = qQ = 3 and p = 15. The tests were based on 2000 samples with n = 300, 600 and 900 with λQ = 0.90, λR = 0.50 and 0.70,
= 0.10, 0.25, 0.50, 1.00, and α-levels of 0.05, 0.10, and 0.20. As the test is employed in order to evaluate conditions for R-factor analysis, an alpha-level beyond the conventional 0.05-level might be justified. Note that the
condition is a condition without any effect of Q-factors, so that no detection rate beyond chance level should be expected for this condition. Overall, the highest detection rates for data with substantial Q-facror variance were found for Mardia’s coefficient (see Table 3). However,
Figure 3. Scatterplot of z1 and z2, n = 145, qQ = 3, and
.
Table 3. Percentage of p-values of tests of kurtosis indicating significant departures from multivariate normality at α = 0.05, 0.10, and 0.20, for n = 300, 600, and 900, p = 15, λR = 0.50 and 0.70, λQ = 0.90.
for n = 300 and
the rate of false positives is slightly above chance for Mardia’s coefficient. As the power for the identification of substantial Q-factor variance was sufficiently high for Srivastava’s and Small’s tests without substantial false positives
, these tests might be recommended.
5. Discussion
As R-factor analysis of variables observed for a large number of individuals is the dominant form of factor analysis in several areas of social sciences, it might happen that R-factor analysis is routinely performed even when the population model comprises R- and Q-factors. For example, in the domain of personality research, it has been assumed that Q-factors or type-factors may be relevant in addition to the well-known R-factors (e.g., [17] [32] [33] ). This leads to the question of whether performing an R-factor analysis of data from a population model comprising R- and Q-factors may result in biased loading estimates. R-factor analysis of data from population models comprising R- and Q-factors was therefore investigated.
It was shown that R-factor analysis of data based on a population model comprising R- and Q-factors leads to biased R-factor loading estimates. For such data R-factor analysis introduces variability into the loading estimates. Thus, when the observed variables have equal R-factor loadings in a population model comprising R- and Q-factors, the loading estimates resulting from R-factor analysis of the observed variables will have variability beyond chance level. This bias of R-factor loading estimates and the variation of R-factor loading estimates beyond chance level were also shown in a simulation study. It was illustrated in the simulation study that the additional loading variability may hamper factor identification. These results show that effects of model error beyond the effect of minor factors [7] may be of relevance for factor analysis. The variability of R-factor loadings beyond chance level caused by Q-factors implies that significance testing of R-factor loadings cannot protect completely against erroneous conclusions when the data are drawn from populations comprising R- and Q-factors. Although the terminology of the present study was based on the distinction of variables and individuals, which is important in the social sciences, the present results are of relevance whenever the common variance of scores is combined with the common variance of transposed scores in a two-way array of scores.
From an applied perspective, the present results imply that the reproducibility of R-factor loadings may not only be hampered by sampling error, insufficient reliability of variables and an insufficient number of variables per factor, but also by the presence of Q-factors. The reproducibility crisis [34] resulted in a stronger focus on statistical power, stronger research designs, preregistration, and replication studies. The present study shows that different forms of model error and, more specifically, Q-factors may also be considered as a reason for insufficient reproducibility of research results when results are based on R-factor analysis.
As the use of R-factor analysis for data drawn from a population based on R- and Q-factors may result in biased R-factor loading estimates, it might be of interest to detect Q-factor variance in observed variables as a prerequisite of R-factor analysis. As eigenvalues of correlation matrices may be ambiguous and because Q-factor variance leads to platykurtic multivariate distributions of observed scores, it was proposed to use tests for the multivariate normality as indicators for Q-factor variance. In a simulation study, Mardia’s test of the multivariate kurtosis was more sensitive for the detection of relevant Q-factor variance than Srivastava’s and Small’s test. However, a slight tendency of false positive results was also found with Mardia’s test so that Srivastava’s and Small’s test might also be recommended. As different reasons are possible for departures of the kurtosis from the kurtosis of the multivariate normal distribution are possible, an inspection of scatterplots is recommended when a test of the multivariate kurtosis of the data is significant. The inspection of scatterplots may be combined with pairwise tests of the bivariate kurtosis in order to eliminate observed variables with substantial Q-factor variance from R-factor analysis. As there might be different reasons for departures of the multivariate kurtosis from multivariate normality, it may be considered to normalize the data (e.g., [35] ) before tests for the departure of multivariate kurtosis from normality are applied because normalization may reduce departures from normality due to outliers and other reasons whereas the Q-factor related scatterplot structure of parallel clouds of points is unlikely to be affected by normalization. The possibility to improve the specificity of tests of multivariate kurtosis for the identification of Q-factor patterns by means of normalization may be investigated in future simulation studies.
To sum up, the present paper is a caveat that R-factor analysis of data from population models comprising R- and Q-factors will result in biased R-factor loading estimates. The bias is due to the fact that the model of R-factor analysis does not correspond exactly to the population model comprising R- and Q-factors. Tests of the multivariate kurtosis might be used for the detection of Q-factor variance as a prerequisite for R-factor analysis. Further research should compare the effect of model error due to Q-factor variance on the results of R-factor analysis with the effect of model error based on minor factors as has been discussed by [7] . Another avenue of future research would be the investigation of the combined effect of R- and Q-factors in the context of parallel factor analysis (PARAFAC, [36] ), where several two-way arrays of data are analyzed. It might be interesting to enter a two-way array for R-factor analysis as well as its transposition into PARAFAC in order to investigate whether a simultaneous estimation of R- and Q-factors allows for reduction bias of factor loadings. The combined effect of both types of model error based on minor factors and model error based on Q-factor variance might be investigated in future research as it may occur in real data.