Learn the difference between PCA and factor analysis, and when to use each, with sample Python and R code
PCA, short for Principal Components Analysis and Factor Analysis, are two statistical methods that are often covered together in multivariate statistics courses.
In this article, you will discover the mathematical and practical differences between the two methods.
Multivariate statistics is a group of statistical methods that focuses on studying multiple variables together while also focusing on the variation that these variables have in common.
Multivariate statistics deals with the treatment of data sets with a large number of dimensions.
Therefore, its goals differ from supervised modelling, but also from segmentation and clustering models.
There are many models in the multivariate statistics family. In this article, I will focus on the difference between PCA and factor analysis, two commonly used multivariate models.
Goal: Reduce the number of dimensions in a data set
Modern data sets often have a large number of variables. This makes it difficult, as a practical matter, to examine each of the variables individually, since the human mind cannot easily process data on such a large scale.
When a data set contains a large number of variables, there is often a large amount of overlap between those variables.
The components found by the PCA are ordered from the highest information content to the lowest information content.
PCA is a statistical technique that allows you to "regroup" your variables into a smaller number of variables called components. This regrouping is based on the common variation of multiple variables.
The purpose of PCA is to regroup variables so that the first component (the one just created) contains the greatest variation. The second component contains the second largest variation, etc. etc. The last component logically contains the smallest variation.
Thanks to this arrangement of the components, it is possible to keep only a part of the newly created components, while maintaining the maximum amount of variation. So we can use the components instead of the original variables for data exploration.
Mathematical model: Maximize variance of new components
The mathematical definition of the PCA problem is to find a linear combination of the original variables with the maximum variance.
This means that we will create a (new) component. let's call itz.BezIt is calculated based on our original variables (X1,X2,...) multiplied by a weight for each of our variables (u1,u2,...).
This can be written asz = Xu.
The mathematical goal is to find the values forOfwhich will maximize the variance ofz,with unit length constraint enabledOf. This problem is called mathematicalLimited optimization with the Lagrange multiplier, but in practice we use computers to perform all PCA operations at once.
This can also be described as applying a matrix decomposition to the correlation matrix of the original variables.
The PCA is efficient in finding the components that maximize the variance. This is great when we're interested in reducing the number of variables while maintaining maximum variance.
However, sometimes we're not just interested in maximizing variance: we might want to give our newly defined dimensions the most useful interpretations. And that's not always easier with the solution that a PCA finds. Then we can apply factor analysis: a slightly more flexible alternative to PCA.
Purpose: find latent variables in a data set
Like PCA, factor analysis is a model that allows information to be reduced from a larger number of variables to a smaller number of variables. In factor analysis we call these “latent variables”.
Factor analysis attempts to find latent variables that make sense to us. We can run the solution until we find latent variables that have a clear interpretation and "make sense".
Factor analysis is based on a model called the common factor model. It is assumed that there are multiple factors in a data set and that each of the measured variables captures a portion of one or more of those factors.
An example of a factor analysis is given in the scheme below. Imagine many students in a school. They all get grades in many subjects. One might imagine that these different scores are partially correlated: a more intellectually gifted student would have higher overall scores. This would be an example of a latent variable.
But we can also imagine students who are good at languages but bad at technical subjects. In this case we could try to find a latent variable forlanguage abilityand a second latent variable fortechnical ability.
We now have latent variables that measure a student's overall ability for language and technical subjects. But it is still possible that some students were generally good at languages but bad at German. That's why the common factor model has specific factors: They measure the influence of a particular metric on the metric. We could call it “Ability to learn German taking into account general language learning ability”.
Mathematical Model - The Common Factors Model
As mentioned earlier, the mathematical model in factor analysis is much more conceptual than the PCA model. While the PCA model is more of a pragmatic approach, in factor analysis we assume that there are latent variables.
In a case with two latent variables, we can compute our original X variables by taking part of their variance from our first common latent variable (let's call it k1), part of the second common latent variable (k2), and part by some factor assign (specific to this variable; called d).
In a case with 4 original variables, the factor analysis model would look like this:
X1 = c11 * k1 + c12 * k2 + d1X2 = c21 * k1 + c22 * k2 + d2X3 = c31 * k1 + c32 * k2 + d3X4 = c41 * k1 + c42 * k2 + d4
c are each coefficient in the coefficient matrix, these are the values we need to estimate.
To solve this, the same mathematical solution as PCA is used, with one small difference. In PCA, we apply matrix decomposition to the correlation matrix. In factor analysis, we apply matrix decomposition to a correlation matrix in which diagonal entries are replaced by1 — var(d), one minus the variance of the specific factor of the variable.
In summary, the mathematical difference between PCA and factor analysis is the use of specific factors for each original variable.
Let's go back to the example of students in a school taking 4 exams: two language exams and two subject exams. We expect two basic factors: language skills and technical skills.
- PCA does not estimate specific effects, but simply finds the mathematical definition of the "best" components (variance-maximizing components). This can be language and technical skills, but also something else.
- Factor analysis will also estimate the components, but for now let's call them common factors. In addition, it also estimates the specific factors. Hence we get two common factors (language and technique) and four specific factors (skills in Test 1, Test 2, Test 3 and Test 4 that are not explained by language or technical skills).
While we don't pay much attention to the specific factors, the fact that they were estimated gives us a different definition of the common factors/components.
Because of this mathematical difference, we also have a big difference between applying PCA and factor analysis. In PCA, there is a fixed score that ranks the components from highest explanatory value to lowest explanatory value.
In factor analysis, we can apply rotations to our solution, which allows us to find a solution that has a more coherent business explanation for each of the factors identified.
The ability to apply rotation to a factor analysis makes it a great tool to tackle studies on multivariate questionnaires in marketing and psychology.
The fact that factor analysis is much more flexible for interpretation makes it a great tool for research and interpretation. Two examples:
- In marketing: summarize the product evaluation questionnaire with many questions on few latent factors for product improvement.
- In psychology: reduce very long answers in personality tests to a few personality traits.
PCA allows us to find the components that contain the maximum amount of information in the least amount of variables. This makes it a great tool for dimension reduction.
PCA, on the other hand, is used in cases where we want to keep as much variation in as few variables as possible. This can be used, for example, to simplify further analyses. PCA is also commonly used in data preparation for machine learning tasks where we want to support the machine learning model by "summarizing" the data in a more digestible way.
To get started with PCA and factor analysis in Python or R, here are some quick tips for the useful libraries you'll need. PCA implementations are just as good in Python as they are in R. However, factor analysis in Scikit Learn is not very mature and it might be useful to review the R alternative.
PCA a Python
To get started with PCA in Python, you can use thePCA
The function ofsklearn.decomposition
.
It can be used as follows:
von sklearn.decomposition import PCA
from sklearn import recordsiris = datasets.load_iris()
X = iris.datamy_pca = PCA(n_components=2)
my_pca.fit(X)imprimir (my_pca.explained_variance_ratio_)
print (meu_pca.singular_values_)
Factor Analysis in Python
To get started with factor analysis in Python, you can use theFactor analysis
The function ofsklearn.decomposition
.
Rotation is not yet available in scikit-learn version 0.23. But the Varimax rotation is coming soon and is already availableVersion 0.24 nightly build.
It can be used as follows:
aus sklearn.decomposition import FactorAnalysis
from sklearn import recordsiris = datasets.load_iris()
X = iris.datamy_fa = AnalysisFactor(n_components=2)# in the new version of sklearn:
# my_fa = AnalysisFactor(n_components=2, rotation='varimax')X_transformado = my_fa.fit_transform(X)
PCA me R
Several implementations of PCA are available in R, butpr comp
is a good option that can be used like this:
data (iris)
iris_data <- iris[, 1:4]
my_pca <- prcomp(iris_data, center = TRUE, scale = TRUE)
print (mi_pca)
Factor Analysis in R
Factor analysis is much more developed in the R Psych package compared to the Scikit Learn implementation. It can be used as follows:
install.packages(“psyco”)
Library (Psycho)data (iris)
iris_data <- iris[, 1:4]
iris_cor <- kor(iris)my_fa <- fa(r = iris_cor, nfactors = 2)
print (mi_fa)# or the alternative with varimax rotation:
my_fa <- factor.pa(iris, nfactors=3, rotation="varimax")
print (mi_fa)# or the alternative with Proimax rotation
my_fa <- factor.pa(iris, nfactors=3, rotation="promax")
print (mi_fa)
Another great R library for PCA and factor analysis isFactoMineR. It has a very wide range of additional options to go further in multivariate statistics.
In conclusion, let's look at the difference between PCA and factor analysis in three points:
other target
First of all, the goal is different. The PCA aims to define new variables based on the largest explained variance, etc.
FA aims to define new variables that we can understand and interpret in a commercial/practical way.
different mathematical model
Consequently, while the mathematics behind the two methods are close, they are not exactly the same. Although both methods use decomposition, they differ in details.
This also gives factor analysis an additional opportunity to rotate the final solution, while PCA does not.
Various applications
Because factor analysis is more flexible in interpretation due to the possibility of solution rotation, it is very valuable in marketing and psychology studies.
The advantage of PCA is that it allows for dimensionality reduction while retaining a maximum amount of information in a data set. This is often used to facilitate exploratory analysis or to prepare data for machine learning pipelines.
I hope this article was useful to you. Thanks for reading and don't hesitate to stay tuned for more!
FAQs
What is the difference between PCA and factor analysis? ›
The mathematics of factor analysis and principal component analysis (PCA) are different. Factor analysis explicitly assumes the existence of latent factors underlying the observed data. PCA instead seeks to identify variables that are composites of the observed variables.
What is PCA in factor analysis? ›Principal component analysis, or PCA, is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
How do researchers decide whether to use PCA or EFA? ›In sum, the primary differences between PCA and EFA are that (a) PCA is appropriate when researchers are just exploring for patterns in their data without a theory and therefore want to include unique and error variances in the analysis, and EFA is appropriate when researchers are working from a theory drawn from ...
What is factor analysis in simple terms? ›Factor analysis is a statistical technique that reduces a set of variables by extracting all their commonalities into a smaller number of factors. It can also be called data reduction.
What is the alternative to factor analysis? ›Composite variable analysis: A simple and transparent alternative to factor analysis.
When should I use factor analysis? ›Factor analysis is used when you want to reduce the dimension of your dataset, or compress as much of the information from your input features as possible into a smaller collection of transformed features.
When would you use a factor analysis? ›Factor analysis (FA) allows us to simplify a set of complex variables or items using statistical procedures to explore the underlying dimensions that explain the relationships between the multiple variables/items.
Which are the 2 types of factor analysis? ›There are two types of factor analyses, exploratory and confirmatory.
What are the 3 purposes of factor analysis? ›To definitively understand how many factors are needed to explain common themes amongst a given set of variables. To determine the extent to which each variable in the dataset is associated with a common theme or factor. To provide an interpretation of the common factors in the dataset.
What is an example of a factor analysis? ›Factor analysis is used to identify "factors" that explain a variety of results on different tests. For example, intelligence research found that people who get a high score on a test of verbal ability are also good on other tests that require verbal abilities.
Is Anova the same as factor analysis? ›
One factor analysis of variance (Snedecor and Cochran, 1989) is a special case of analysis of variance (ANOVA), for one factor of interest, and a generalization of the two-sample t-test. The two-sample t-test is used to decide whether two groups (levels) of a factor have the same mean.
What is a weakness of factor analysis? ›Disadvantages of Factor Analysis:
If important attributes are missed the value of procedure is reduced accordingly. 2. Naming of the factors can be difficult multiple attributes can be highly correlated with no apparent reasons.
Another drawback of Factor Analysis is that it does not identify complicated factors that underlie a dataset. While some results could clearly indicate a correlation between two variables, some complicated correlations can go unnoticed in such a method.
What are the advantages of factor analysis over PCA? ›Different Applications. As Factor Analysis is more flexible for interpretation, due to the possibility of rotation of the solution, it is very valuable in studies for marketing and psychology. PCA's advantage is that it allows for dimension reduction while still keeping a maximum amount of information in a data set.
When not to use PCA? ›PCA should be used mainly for variables which are strongly correlated. If the relationship is weak between variables, PCA does not work well to reduce data. Refer to the correlation matrix to determine. In general, if most of the correlation coefficients are smaller than 0.3, PCA will not help.
What is disadvantage of PCA? ›The drawbacks with PCA is that it is difficult to evaluate the covariance matrix in an accurate manner and it also fails to capture the simplest invariance unless the information is explicitly provided to the training data.
What are the 3 factors in PCA? ›Litterman and Scheinkman (1991) use a principal component analysis (PCA) and find that US bond returns are mainly determined by three factors such as level, steepness, and curvature movements in the term structure.
How do you calculate PCA factor score? ›Factor/component scores are given by ˆF=XB, where X are the analyzed variables (centered if the PCA/factor analysis was based on covariances or z-standardized if it was based on correlations). B is the factor/component score coefficient (or weight) matrix.
Why is PCA not factor analysis? ›The mathematics of factor analysis and principal component analysis (PCA) are different. Factor analysis explicitly assumes the existence of latent factors underlying the observed data. PCA instead seeks to identify variables that are composites of the observed variables.
What are the two types of PCA? ›Sparse PCA, similar to LASSO in regression. Non-negative matrix factorization, similar to non-negative least squares. Logistic PCA for binary data, similar to Logistic regression.
What are the uses of factor analysis? ›
The purpose of factor analysis is to reduce many individual items into a fewer number of dimensions. Factor analysis can be used to simplify data, such as reducing the number of variables in regression models. Most often, factors are rotated after extraction.
What is a good factor analysis score? ›In the SEM approach, as a rule of thumb, 0.7 or higher factor loading represents that the factor extracts sufficient variance from that variable.
What are the advantages of factor analysis? ›Advantages of Factor Analysis:
It can be used to identify the hidden dimensions or constraints which may or may not be apparent from direct analysis. 3. It is not extremely difficult to do and at the same time its inexpensive and gives accurate results.
Interpretation. Examine the loading pattern to determine the factor that has the most influence on each variable. Loadings close to -1 or 1 indicate that the factor strongly influences the variable. Loadings close to 0 indicate that the factor has a weak influence on the variable.