Set of Assumptions for Factor and Principal Component Analysis

Description:Tests for Kaiser-Meyer-Olkin (KMO) and communalities in a dataset. It provides a final sample by removing variables in a iterable manner while keeping account of the variables that were removed in each step.

*Factor Analysis* and *Principal Components Analysis* (PCA) have some precautions and assumptions to be observed (Hair et al. (2018)).

The first one is the KMO (Kaiser-Meyer-Olkin) measure, which measures the proportion of variance among the variables that can be derived from the common variance, also called systematic variance. KMO is computed between 0 and 1. Low values (close to 0) indicate that there are large partial correlations in comparison to the sum of the correlations, that is, there is a predominance of correlations of the variables that are problematic for the factorial/principal component analysis. Hair et al. (2018) suggest that individual KMOs smaller than 0.5 be removed from the factorial/principal component analysis. Consequently, this removal causes the overall KMO of the remaining variables of the factor/principal component analysis to be greater than 0.5.

The second assumption of a valid factor or PCA analysis is the communality of the rotated variables. The commonalities indicate the common variance shared by factors/components with certain variables. Greater communality indicated that a greater amount of variance in the variable was extracted by the factorial/principal component solution. For a better measurement of factorial/principal component analysis, communalities should be 0.5 or greater (Hair et al. (2018)).

First we will load an example dataset `bfi`

from `psych`

and load the package `FactorAssumptions`

```
library(FactorAssumptions, quietly = T, verbose = F)
bfi_data <- bfi
#Remove rows with missing values and keep only complete cases
bfi_data <- bfi_data[complete.cases(bfi_data),]
head(bfi_data)
```

```
## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
## 61623 6 6 5 6 5 6 6 6 1 3 2 1 6 5 6 3 5 2 2 3 4 3 5 6
## 61629 4 3 1 5 1 3 2 4 2 4 3 6 4 2 1 6 3 2 6 4 3 2 4 5
## 61634 4 4 5 6 5 4 3 5 3 2 1 3 2 5 4 3 3 4 2 3 5 3 5 6
## 61640 4 5 2 2 1 5 5 5 2 2 3 4 3 6 5 2 4 2 2 3 5 2 5 5
## 61661 1 5 6 5 6 4 3 2 4 5 2 1 2 5 2 2 2 2 2 2 6 1 5 5
## 61664 2 6 5 6 5 3 5 6 3 6 2 2 4 6 6 4 4 4 6 6 6 1 5 6
## O5 gender education age
## 61623 1 2 3 21
## 61629 3 1 2 19
## 61634 3 1 1 21
## 61640 5 1 1 17
## 61661 2 1 5 68
## 61664 1 2 2 27
```

First we will perform the \(KMO > 0.5 assumption\) for all individuals variables in the dataset with the `kmo_optimal_solution`

function

`## Final Solution Achieved!`

Note that the `kmo_optimal_solution`

outputs a list:

- the final solution as
`df`

- removed variables with \(invidual KMO < 0.5\) as
`removed`

- Anti-image covariance matrix as
`AIS`

- Anti-image correlation matrix as
`AIR`

In our case none of the variables were removed due to low individual KMO values

`## NULL`

The parallel analysis of `bfi`

data suggests seven factors we will then perform the assumptions for all \(individual communalities > 0.5\) with the argument `nfactors`

set to 7.

We can use either the values `principal`

or `fa`

functions from `psych`

package for argument `type`

as desired:

`principal`

will perform a*Principal Component Analysis*(PCA)`fa`

will perform a*Factor Analysis*

*Note*: we are using the `df`

generated from the `kmo_optimal_solution`

function *Note 2*: the default of rotation employed by the `communalities_optimal_solution`

is `varimax`

. You can change if you want.

`comm_bfi <- communalities_optimal_solution(kmo_bfi$df, type = "principal", nfactors = 7, squared = FALSE)`

`## There is still an individual communality value below 0.5: A4 - 0.423382853387628`

`## There is still an individual communality value below 0.5: O4 - 0.473944505255499`

`## There is still an individual communality value below 0.5: C1 - 0.494613330049183`

Note that the `communalities_optimal_solution`

outputs a list:

- the final solution as
`df`

- removed variables with \(invidual communalities < 0.5\) as
`removed`

- A table with the communalities loadings from the variables final iteration as
`loadings`

- Results of the final iteration of either the
`principal`

or`fa`

functions from`psych`

package as`results`

In our case 3 variables were removed in an iterable fashion due to low individual communality values. And they are listed from the lowest communality that were removed until rendered an optimal solution.

`## [1] "A4" "O4" "C1"`

And finally we arrive at our final principal components analysis rotated matrix. You can export it as a CSV with `write.csv`

or `write.csv2`

```
## Principal Components Analysis
## Call: principal(r = df, nfactors = nfactors, scores = T)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC2 RC1 RC5 RC4 RC3 RC6 RC7 h2 u2 com
## A1 0.13 0.13 -0.51 0.09 0.22 0.46 -0.22 0.60 0.40 3.2
## A2 0.03 0.15 0.69 0.14 -0.05 -0.21 0.08 0.57 0.43 1.4
## A3 -0.02 0.21 0.75 0.11 0.01 0.00 -0.02 0.62 0.38 1.2
## A5 -0.17 0.30 0.67 0.07 0.05 0.09 0.02 0.59 0.41 1.6
## C2 0.11 -0.04 0.17 0.72 -0.08 0.14 -0.05 0.59 0.41 1.3
## C3 -0.02 -0.01 0.13 0.72 0.08 0.06 0.09 0.55 0.45 1.1
## C4 0.22 -0.13 0.04 -0.70 0.22 0.22 -0.06 0.66 0.34 1.8
## C5 0.29 -0.21 0.01 -0.67 0.03 0.12 0.06 0.59 0.41 1.7
## E1 -0.01 -0.74 -0.12 0.10 0.13 0.22 0.00 0.63 0.37 1.3
## E2 0.22 -0.75 -0.16 -0.07 0.06 0.02 -0.06 0.65 0.35 1.3
## E3 0.03 0.51 0.43 0.09 -0.18 0.31 -0.09 0.59 0.41 3.1
## E4 -0.15 0.64 0.41 0.07 0.15 0.07 -0.10 0.64 0.36 2.1
## E5 0.09 0.57 0.17 0.34 -0.14 0.16 0.15 0.55 0.45 2.5
## N1 0.83 0.09 -0.18 -0.05 0.08 0.03 0.01 0.73 0.27 1.1
## N2 0.82 0.07 -0.17 -0.03 0.00 -0.04 0.00 0.71 0.29 1.1
## N3 0.79 -0.07 0.01 -0.07 0.01 -0.03 -0.07 0.65 0.35 1.1
## N4 0.63 -0.42 0.04 -0.18 -0.03 0.09 0.06 0.62 0.38 2.0
## N5 0.61 -0.20 0.15 -0.03 0.17 -0.20 -0.12 0.52 0.48 1.9
## O1 0.01 0.15 0.19 0.12 -0.47 0.49 0.03 0.54 0.46 2.6
## O2 0.15 -0.02 0.12 -0.08 0.72 0.04 -0.03 0.57 0.43 1.2
## O3 0.06 0.26 0.27 0.04 -0.57 0.34 0.02 0.59 0.41 2.7
## O5 0.04 -0.01 -0.02 -0.02 0.76 0.05 -0.03 0.59 0.41 1.0
## gender 0.19 0.16 0.19 0.11 0.01 -0.66 -0.02 0.55 0.45 1.5
## education 0.00 -0.03 0.04 -0.01 -0.07 0.06 0.77 0.60 0.40 1.0
## age -0.08 0.06 0.03 0.06 -0.01 -0.08 0.77 0.61 0.39 1.1
##
## RC2 RC1 RC5 RC4 RC3 RC6 RC7
## SS loadings 3.09 2.69 2.47 2.23 1.90 1.38 1.32
## Proportion Var 0.12 0.11 0.10 0.09 0.08 0.06 0.05
## Cumulative Var 0.12 0.23 0.33 0.42 0.50 0.55 0.60
## Proportion Explained 0.20 0.18 0.16 0.15 0.13 0.09 0.09
## Cumulative Proportion 0.20 0.38 0.55 0.69 0.82 0.91 1.00
##
## Mean item complexity = 1.7
## Test of the hypothesis that 7 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.06
## with the empirical chi square 4194.08 with prob < 0
##
## Fit based upon off diagonal values = 0.92
```

Hair, Joseph F., William C. Black, Barry J. Babin, and Rolph E. Anderson. 2018. *Multivariate Data Analysis*. 8th ed. Cengage Learning.