In this tutorial we will be covering the basics behind Principal Component Analysis (PCA) and Factor Analysis (FA) in R. As mentioned in class, although the techniques are similar, PCA is a descriptive model, while FA is a structural model. Broadly speaking, PCA attempts to account for the entire variance of the dispersion matrix (be it correlation or covariance), while FA accounts only for the common variance.
We will first look at PCA, and see how it can be used in financial modeling. Thereafter we will look at FA in broad terms.
PCA allows the modeler a means of identifying structure in the covariance / correlation matrix to locate low-dimensional subspaces containing most of the variation in the data @ruppert.
There are several very useful functions that can be used to extract the PCs in R. We will do so in three ways:
By hand (doing the math - good for intuition)
Using R’s princomp function
Briefly look at the psych package, which allows principal rotation
Consider the FactoMineR and factoextra packages for adding supplementary information
If you want to replicate the results found in @ruppert, you could use the following package: fEcofin:
See the documentation for fEcofin to get an idea of what the datafile contains.
This is awesome for you to explore different techniques with real (and easily available) data.
Let’s load a dataset a bit closer to home: the SA sector TRIs.
First we calculate the log returns. We do so in order to remove the scaling effect and to ensure log-normality of returns when doing statistical analysis.
# Sectors Considered:
data <-
data_Raw %>%
group_by(Tickers) %>%
arrange(date) %>%
mutate(Returns = ((log(Price)-log(lag(Price))))) %>%
filter(date > first(date)) %>%
ungroup() %>% mutate(Tickers = gsub(" Index", "", Tickers)) %>%
filter(date > ymd(20100101)) %>%
filter(!Tickers %in% c("J205LCTR", "J430TR", "J433TR", "JGROWTR", "JVALUTR", "TOP40TR", "JSHR40TR", "JCAP40TR"))
No let’s plot the returns series in order to check for significant outliers first:
Clearly there are a few periods where incredibly high daily returns will bias our analysis (typically you’d interrogate this carefully of course, but for now let’s fix it).
Let’s fix this by constraining returns to be capped at +/- 25% for daily returns:
To achieve this, we use a logic operator: ifelse.
data <-
data %>% mutate(Returns = ifelse(Returns > 0.25, 0.25,
ifelse(Returns < -0.25, -0.25, Returns)))
# Now plotting:
ggplot(data) + geom_line(aes(x = date,
y = Returns,
color = Tickers,
alpha = 0.9)) +
labs(title = "Sector Log Returns: SA") +
guides(alpha=FALSE) +
fmxdat::theme_fmx() +
fmxdat::fmx_cols()
Next we also need to mean-center our data to do PCA (although the returns do seem largely homogenously distributed, so it is less of an issue here).
Let’s nonetheless mean scale our data:
Remember that PCA is done via an eigen decomposition on a square matrix, which in financial econometrics is ether a covariance matrix or a correlation matrix of asset returns (data in our example).
The eigen decomposition splits this square matrix of dispersion into its corresponding eigenvectors (loadings) and eigenvalues. The eigenvalue is then the variance of the factor associated with the corresponding eigenvalue.
For now, let’s get the PCAs directly by hand.
Remember - any square matrix (such as a covariance matrix) can be decomposed into an eigenvalue and eigenvector combination.
For simplicity, let’s use the wide data frame format:
Note here we used the built-in eigen
function to extract
the eigenvectors and eigenvalues of the covariance matrix of the
returns, rtn.
But what did it do mathematically? Remember from the theory session:
For a symmetric square matrix A (here the covariance matrix of rtn):
AE=λE
Now, notice that this implies that the PCA can decompose this matrix A into orthogonoal statistical factors.
ETΣE=λ
We can check this using:
## [1] TRUE
This is very useful, as we can now see how much of the variation in the dispersion is due to which orthogonal (independent) factor PC1.
Note that as we have 6 unique series, we will have 6 unique Principal Components (or unique sources of variation).
To calculate the proportion of variance subscribed to each orthogonal source of risk, we note that:
N∑i=1PCAi=N
So if we want to see what % of variation in A is due to each orthogonal PC, we can use:
## [1] 0.6001351322 0.1395411775 0.0976162481 0.0667933925 0.0452434436
## [6] 0.0230445986 0.0148134184 0.0102298436 0.0020340428 0.0005487025
Let’s show a barplot to show the % variation due to each component (using base R plotting, just for notalgia):
prop <-
tibble(Loadings = prop) %>%
mutate(PC = paste0("PC_", row_number()))
prop[,"PC"][[1]] <- factor(prop[,"PC"][[1]], levels = prop$PC)
g <-
prop %>%
ggplot() + geom_bar(aes(PC, Loadings), stat = "identity", fill = "steelblue") +
fmxdat::theme_fmx(axis.size.title = fmxdat::ggpts(38), axis.size = fmxdat::ggpts(35), title.size = fmxdat::ggpts(42),
CustomCaption = T) +
scale_y_continuous(breaks = scales::pretty_breaks(10), labels = scales::percent_format(accuracy = 1)) +
labs(x = "Principal Components", y = "Loadings", title = 'Eigenvalue proportions', caption = "Source: Fmxdat Package")
g
Note two things:
Eigenvalues are always ordered by largest to smallest
Sum of the eigenvalues always equal the number of columns (as 100% of variation is explained)
The plot thus says: nearly 60% of variation in the sectors are explained by a single component. It does not give insight into what this component or factor might be, BUT it tells us how this component is calculated linearly. Refer again to equation ???: this implies that if we multiply the first column of the eigenvector by the cov(returns) matrix - then we get the first PC.
Note that the eigenvectors now can be interpreted as the loadings or the weights of each PC. Looking at the first two PC’s loadings we can make an interesting observation:
names <- covmat %>% colnames
evecdf <- tibble(data.frame(evec))
evecdf <-
evecdf %>% purrr::set_names( c(paste0("Eigenv_", 1:ncol(evecdf)) )) %>%
mutate(Names = names)
gg1 <-
evecdf %>%
ggplot() +
geom_bar(aes(Names, Eigenv_1), stat = "identity", fill = "steelblue") +
fmxdat::theme_fmx(axis.size.title = fmxdat::ggpts(38), axis.size = fmxdat::ggpts(35),
title.size = fmxdat::ggpts(42), CustomCaption = T) +
scale_y_continuous(breaks = scales::pretty_breaks(10), labels = scales::percent_format(accuracy = 1)) +
labs(x = "Principal Components", y = "Loadings: Eigenvector 1", title = 'Eigenvector proportions\n', caption = "Source: Fmxdat Package")
gg2 <-
evecdf %>%
ggplot() +
geom_bar(aes(Names, Eigenv_2), stat = "identity", fill = "steelblue") +
labs(x = "Principal Components", y = "Loadings: Eigenvector 2", title = 'Eigenvector proportions\n', caption = "Source: Fmxdat Package") +
fmxdat::theme_fmx(axis.size.title = fmxdat::ggpts(38), axis.size = fmxdat::ggpts(35),
title.size = fmxdat::ggpts(42), CustomCaption = T) +
scale_y_continuous(breaks = scales::pretty_breaks(10), labels = scales::percent_format(accuracy = 1)) +
geom_label(aes(Names, Eigenv_2, label = glue::glue("{Names}: {round(Eigenv_2, 2)*100}%")), alpha = 0.5, size = fmxdat::ggpts(12))
fmxdat::finplot(gg1, x.vert = T)
So we noted from the eigenvalues that a unique linear combination of all the sector returns can explain roughly 60% of the variation in all the returns series. From the first barplot above, we note that this unique combination has an almost equal weighted input for all the series. From the second barplot, we note that the second eigenvalue (which explains about 14% of the total variation) loads nearly equally on Telecoms and Resi (but in opposite direction).
This implies (loosely) that holding an equal long Telecoms, short Resi position explains a sizeable part of the total variation in these sectors’ overall returns.
While doing PCA by hand is informative, a more productionable approach is to use the prcomp command in R. Below follows an example of this approach. Let’s ensure our data is mean centered, as well as scaled, when doing the PCA (this is important, as our returns data are typically skew, non-normal and non-zero meaned. Thus a log transformation, scaling and mean-centering is standard practice).
princomp as a package is useful for allowing us to rotate components after fitting it. If you want to add supplementary information, I suggest using FactoMineR, discussed later.
Quick digression:
When R gives you an object (and not a data.frame) as a result, as e.g. given above - treat it as a little box, whose content you can extract as follows:
## List of 5
## $ sdev : num [1:10] 0.03225 0.01555 0.01301 0.01076 0.00886 ...
## $ rotation: num [1:10, 1:10] 0.368 0.287 0.308 0.472 0.301 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:10] "FINI15TR" "INDI25TR" "JALSHTR" "JMOTETR" ...
## .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:10] 1.07e-19 1.27e-19 6.36e-20 7.80e-20 1.01e-19 ...
## ..- attr(*, "names")= chr [1:10] "FINI15TR" "INDI25TR" "JALSHTR" "JMOTETR" ...
## $ scale : logi FALSE
## $ x : num [1:2644, 1:10] -0.004794 0.000138 0.000813 -0.02051 0.022231 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
Now we can use the $-operator to extract what we want from this object, e.g.:
## [1] 0.0322539134 0.0155528177 0.0130082521 0.0107603065 0.0088559614
## [6] 0.0063203694 0.0050674041 0.0042110693 0.0018777507 0.0009752734
## [1] TRUE
## PC1 PC2 PC3 PC4 PC5 PC6
## FINI15TR 0.3675187 0.06891435 0.32968726 0.22551232 0.32151847 -0.71418519
## INDI25TR 0.2874321 0.08602660 -0.05176250 -0.03401930 0.54444671 0.54660137
## JALSHTR 0.3081486 0.19925627 -0.05405935 0.01681976 0.21379793 0.13940827
## JMOTETR 0.4722423 -0.72101309 -0.43031931 0.22424509 -0.14236473 -0.02296100
## JNCCGTR 0.3009887 -0.20497697 0.25424786 -0.87774653 -0.09680977 -0.06142506
## JSAPYTR 0.2410986 0.02022523 0.54481579 0.33719782 -0.44025822 0.32387720
## JSHRALTR 0.3146571 0.13261673 0.02973097 0.04798887 0.26518509 0.11691076
## JSMLCTR 0.1672359 0.02772952 0.18494091 0.01604428 -0.29244376 0.17524816
## MIDCAPTR 0.2458327 0.10441068 0.20058137 0.05243034 -0.11631452 -0.01903880
## RESI20TR 0.3580656 0.59716411 -0.51355979 -0.08926075 -0.40205894 -0.13338885
## PC7 PC8 PC9 PC10
## FINI15TR 0.013282841 -0.20495927 0.212116755 -0.0042916433
## INDI25TR 0.031602772 -0.05761235 0.546306920 -0.0904494000
## JALSHTR -0.001288173 -0.05725225 -0.426603398 0.7821265070
## JMOTETR 0.012689686 0.02312145 0.006718183 0.0114500238
## JNCCGTR 0.135980606 -0.02432682 -0.008972556 0.0010448556
## JSAPYTR 0.474742819 -0.07985421 0.021314244 -0.0005572392
## JSHRALTR -0.030190017 0.05122516 -0.654645029 -0.6034318831
## JSMLCTR -0.779373765 -0.45954750 0.034249823 -0.0280530714
## MIDCAPTR -0.354131335 0.85192252 0.131368596 0.0579017328
## RESI20TR 0.145075047 -0.06566351 0.164298395 -0.1080577924
You will use this type of information extraction often in R… so take note!
Further, dissecting the princomp command’s output, we see it provides a simple means of plotting the individual components’ importance:
It also gives us a summary of the PCA results:
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 0.03225 0.01555 0.01301 0.01076 0.008856 0.00632
## Proportion of Variance 0.60014 0.13954 0.09762 0.06679 0.045240 0.02304
## Cumulative Proportion 0.60014 0.73968 0.83729 0.90409 0.949330 0.97237
## PC7 PC8 PC9 PC10
## Standard deviation 0.005067 0.004211 0.001878 0.0009753
## Proportion of Variance 0.014810 0.010230 0.002030 0.0005500
## Cumulative Proportion 0.987190 0.997420 0.999450 1.0000000
The table above shows the SD of each component, proportion of vairance explained and the cumulative variance proportion, respectively. Note that component one explains 75% of variation.
In order to save the individual principal component indexes (for use downstream, e.g. in regressions) - do the following:
For plotting PCA output, I’d rather use factoextra and FactoMineR personally - we will do a nice example at the end.
There are many other powerful ways to extract PCAs (see also the caret package), the psych package is quite easy to use and powerful.
Psych allows us a very nice quick visual indication of the relation of the returns to eachother. Consider the function: pairs.panels below. It plots: * bivariate scatterplots below the diagonal * each series’ histogram on the diagonal * pearson correlation above the diagonal
This is a great visualization tool for analysing bivariate relationships using scatterplots, histograms and correlations.
Note that the package uses the Mahalanobis function to identify outliers (see ?outlier). Note: we could also use the package’s scrub function to remove outliers (but this is verrry basic).
Another very nice and quick way of looking at the data’s characteristic available in Psych, is the violin plot. It is essentially a combination of a boxplot and a density plot:
Note: This is a very useful figure to include in your reports on the behaviour of returns data as it shows similarity in distribution and density between the returns.
At last, using psych to do PCA is very simple:
## [1] "values" "rotation" "n.obs" "communality" "loadings"
## [6] "fit" "fit.off" "fn" "Call" "uniquenesses"
## [11] "complexity" "chi" "EPVAL" "R2" "objective"
## [16] "residual" "rms" "factors" "dof" "null.dof"
## [21] "null.model" "criteria" "STATISTIC" "PVAL" "weights"
## [26] "r.scores" "Vaccounted" "Structure" "scores"
See the the vignette for a very detailed discussion on how to use the psych package in more detail.
To see a really interesting application of PCA to portfolio construction, see @meucci paper and visit this site. This could give you some nice ideas for application to SA?
We could also use PCA components in regressions. This is particularly useful if our parameter space is large, and dimension reduction would be of great benefit.
As you can imagine, having 50 independent variables in a regression might be suboptimal. If we can then rather find a linear combination of the 50 variables that explains a large % of its variation, we might as well include the component itself. See this insightful blogpost on doing a regression using PC’s as inputs.
These are excellent packages for using supplementary variables and for plotting PCA results.
Let’s explore it in some detail, followed by an application example on currency data.
First, let’s use factoextra’s data on athletics outcomes to facilitate the interpretation:
Now let’s split focus first on identifying individuals (rows) and variables (columns):
# Important: if using rownames as identifyer for individuals:
# Rows are treated as individuals
# Columns are treated as variables
# So let's make a tbl_df that has rownames:
decathlon2.active <-
decathlon2.active %>% tibble::rownames_to_column() %>% tibble::as_tibble() %>%
column_to_rownames("rowname")
# Added rowname column
# add rownames back... :/
# PCA:
res.pca <- PCA(decathlon2.active, graph = FALSE)
# View the str:
str(res.pca)
## List of 5
## $ eig : num [1:10, 1:3] 4.124 1.839 1.239 0.819 0.702 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:10] "comp 1" "comp 2" "comp 3" "comp 4" ...
## .. ..$ : chr [1:3] "eigenvalue" "percentage of variance" "cumulative percentage of variance"
## $ var :List of 4
## ..$ coord : num [1:10, 1:5] -0.851 0.794 0.734 0.61 -0.702 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
## .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## ..$ cor : num [1:10, 1:5] -0.851 0.794 0.734 0.61 -0.702 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
## .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## ..$ cos2 : num [1:10, 1:5] 0.724 0.631 0.539 0.372 0.492 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
## .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## ..$ contrib: num [1:10, 1:5] 17.54 15.29 13.06 9.02 11.94 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
## .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ ind :List of 4
## ..$ coord : num [1:23, 1:5] 0.196 0.808 -1.359 -0.889 -0.108 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
## .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## ..$ cos2 : num [1:23, 1:5] 0.00753 0.0487 0.1972 0.09611 0.00157 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
## .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## ..$ contrib: num [1:23, 1:5] 0.0403 0.6881 1.9474 0.8331 0.0123 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
## .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## ..$ dist : Named num [1:23] 2.25 3.66 3.06 2.87 2.72 ...
## .. ..- attr(*, "names")= chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
## $ svd :List of 3
## ..$ vs: num [1:10] 2.031 1.356 1.113 0.905 0.838 ...
## ..$ U : num [1:23, 1:5] 0.0963 0.3978 -0.6693 -0.4377 -0.0532 ...
## ..$ V : num [1:10, 1:5] -0.419 0.391 0.361 0.3 -0.345 ...
## $ call:List of 9
## ..$ row.w : num [1:23] 0.0435 0.0435 0.0435 0.0435 0.0435 ...
## ..$ col.w : num [1:10] 1 1 1 1 1 1 1 1 1 1
## ..$ scale.unit: logi TRUE
## ..$ ncp : num 5
## ..$ centre : num [1:10] 11 7.35 14.62 2.01 49.43 ...
## ..$ ecart.type: num [1:10] 0.2945 0.3065 0.8261 0.0946 0.9849 ...
## ..$ X :'data.frame': 23 obs. of 10 variables:
## .. ..$ X100m : num [1:23] 11 10.8 11 11.3 11.1 ...
## .. ..$ Long.jump : num [1:23] 7.58 7.4 7.23 7.09 7.3 7.31 6.81 7.56 6.97 7.27 ...
## .. ..$ Shot.put : num [1:23] 14.8 14.3 14.2 15.2 13.5 ...
## .. ..$ High.jump : num [1:23] 2.07 1.86 1.92 2.1 2.01 2.13 1.95 1.86 1.95 1.98 ...
## .. ..$ X400m : num [1:23] 49.8 49.4 48.9 50.4 48.6 ...
## .. ..$ X110m.hurdle: num [1:23] 14.7 14.1 15 15.3 14.2 ...
## .. ..$ Discus : num [1:23] 43.8 50.7 40.9 46.3 45.7 ...
## .. ..$ Pole.vault : num [1:23] 5.02 4.92 5.32 4.72 4.42 4.42 4.92 4.82 4.72 4.62 ...
## .. ..$ Javeline : num [1:23] 63.2 60.1 62.8 63.4 55.4 ...
## .. ..$ X1500m : num [1:23] 292 302 280 276 268 ...
## ..$ row.w.init: num [1:23] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ call : language PCA(X = decathlon2.active, graph = FALSE)
## - attr(*, "class")= chr [1:2] "PCA" "list"
# The latter allows mapping in ggplot space
# Control variable colors using their contribution
# Possible values for the argument col.var are :
# "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, col.var=c("cos2", "contrib", "coord", "x", "y")[1]) +
scale_color_gradient2(low="white", mid="blue",
high="red", midpoint=0.5) + theme_minimal()
The graph of variables shows the relationships between all variables:
Positively correlated variables are grouped together.
Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).
The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.
The cos2 of variables are calculated as the squared coordinates : var.cos2 = var.coord * var.coord
The cos2 values are used to estimate the quality of the representation
The closer a variable is to the circle of correlations, the better its representation on the factor map (and the more important it is to interpret these components)
To establish the most important variables in the data, use contribution.
A Variable that contributes most to, e.g., PC1 could be given as: (variable.cos2 * 100) / (total cos2 of the component))
E.g., to find the variables that contribute most to each PC:
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 1.754429e+01 1.7505098 7.3386590 0.13755240 5.389252
## Long.jump 1.529317e+01 4.2904162 2.9300944 1.62485936 7.748815
## Shot.put 1.306014e+01 0.3967224 21.6204325 2.01407269 8.824401
## High.jump 9.024811e+00 11.7715838 8.7928883 2.54987951 23.115504
## X400m 1.193554e+01 4.5799296 6.4876363 22.65090599 1.539012
## X110m.hurdle 1.415754e+01 0.0332933 16.2612611 0.03483735 7.166193
## Discus 1.339309e+01 0.1341398 2.5147385 19.04132022 23.755756
## Pole.vault 1.144592e+00 35.4618611 0.7139512 14.02307063 7.005084
## Javeline 4.446377e+00 8.1086683 29.4531777 13.42963254 5.577615
## X1500m 4.438531e-04 33.4728757 3.8871610 24.49386930 9.878367
Visualizing this:
+ Might be that a variable loads very highly on PC1, but not at all on PC2 (e.g. 1500m) - so 100m would be more important as it loads highly on both.
The contribution of variables to PC1 and PC2, collectively, are calculated as:
C1∗Eig1+C2∗Eig2
The contribution of a variable to a given principal component is (in percentage) : (var.cos2 * 100) / (total cos2 of the component)
How much of this component is due to each variable, in essence.
Plot loadings onto first two components and colour by total contribution
## $quanti
## correlation p.value
## Long.jump 0.7941806 6.059893e-06
## Discus 0.7432090 4.842563e-05
## Shot.put 0.7339127 6.723102e-05
## High.jump 0.6100840 1.993677e-03
## Javeline 0.4282266 4.149192e-02
## X400m -0.7016034 1.910387e-04
## X110m.hurdle -0.7641252 2.195812e-05
## X100m -0.8506257 2.727129e-07
##
## attr(,"class")
## [1] "condes" "list"
+ Contribution of variables and individuals together
We can also focus solely on the individuals, and how they load onto the different components:
We can also view the individual contribution of the athletes on the first two components:
Supplementary variables and individuals are not used for the determination of the principal components. Rather, their coordinates are predicted using only the information provided by the performed principal component analysis on active variables/individuals.
Specify supplementary individuals and variables using quanti.sup (quantitative supps) or quali.sup (qualitative supps) for individual supps.
# Let's add back the competition column and use it as a qualitative supplementary variable:
decathlon2 <-
factoextra::decathlon2 %>% tibble::rownames_to_column() %>% tibble::as_tibble() %>%
# Added rowname column
column_to_rownames("rowname")
res.pca <-
PCA(decathlon2, quanti.sup = 11:12, quali.sup = 13, graph=FALSE)
Let’s now map the quantitative supplementary variables in blue (remember: this shows the influence of the PC on the variable)
Let’s now expand our analysis to also include qualitative supplementary variables into the mix (using colours):
The currency PCA example will be loaded in the next practical: Practical 4 Application.
This should give you a nice hand0s-on example of how to do a deep dimension reduction analysis, interpreting the components by virtue of its co-dependence on factor indexes.
As mentioned in class, factor analysis seeks to establish a structural model that explains what is driving return covariance in the data. We distinguish between two broad classes of FA: Exploratory and Confirmatory.
Broadly speaking, Factor Analysis models our returns series as:
rtn=Λ.f+e
After fitting the FA, the factor loadings indicate how strongly each factor influences each of the measured variables. In order to label the factors in the model, one should examine the factor pattern to see which items load highly on which factors and then determine what those items have in common. It is then up to the researcher to make assumptions as to what these factors mean.
Naturally, the first question is how many factors to include? Their are very many ways this can be done (e.g. Kaiser’s eigenvalue-greater-than-unity rule, or using the scree plot to determine where it drops off significantly). Good packages for doing this includes the nFactors package, GPArotation and FactomineR.
To do a simple FA, let’s use R’s built-in factanal function. By default, the factanal function fits a common factor model using the maximum likelihood approach.
Although the screeplot probably suggests we should only include one factor, let’s use two:
##
## Call:
## factanal(x = data_wide, factors = 2, rotation = "varimax")
##
## Uniquenesses:
## FINI15TR INDI25TR JALSHTR JMOTETR JNCCGTR JSAPYTR JSHRALTR JSMLCTR
## 0.186 0.177 0.005 0.632 0.639 0.402 0.017 0.511
## MIDCAPTR RESI20TR
## 0.235 0.297
##
## Loadings:
## Factor1 Factor2
## FINI15TR 0.780 0.453
## INDI25TR 0.482 0.769
## JALSHTR 0.539 0.840
## JMOTETR 0.497 0.347
## JNCCGTR 0.528 0.287
## JSAPYTR 0.762 0.129
## JSHRALTR 0.651 0.748
## JSMLCTR 0.638 0.285
## MIDCAPTR 0.754 0.442
## RESI20TR 0.153 0.824
##
## Factor1 Factor2
## SS loadings 3.661 3.238
## Proportion Var 0.366 0.324
## Cumulative Var 0.366 0.690
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 4444.33 on 26 degrees of freedom.
## The p-value is 0
From the factanal output, we can see that the χ2 fit at the bottom indicates the hypothesis of perfect model fit is rejected (thus it is not a great model for explaining variation in rtn). To get a better fit, let’s use three factors. Also, in the print-output, let’s do the following to make interpretation easier:
##
## Call:
## factanal(x = data_wide, factors = 1, rotation = "varimax")
##
## Uniquenesses:
## FINI15TR INDI25TR JALSHTR JMOTETR JNCCGTR JSAPYTR JSHRALTR JSMLCTR
## 0.28 0.18 0.04 0.65 0.69 0.67 0.00 0.63
## MIDCAPTR RESI20TR
## 0.33 0.51
##
## Loadings:
## [1] 0.85 0.91 0.98 0.59 0.56 0.58 1.00 0.61 0.82 0.70
##
## Factor1
## SS loadings 6.03
## Proportion Var 0.60
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 7578.74 on 35 degrees of freedom.
## The p-value is 0
That’s better. But still our model is not great. This possibly follows from the very high degree of comovement between the sectors, and the fact that the first PC explains most of the variance. For a more intuitive model, we would then be able (after rotating and making the effects clearer) to assign hypothesized reasons for the factors based on which variables loaded. E.g. (hypothetically speaking), we could assign the first factor as local effect and the second as the global effect (although that would probably be unwise, but you see the point).
Let’s do a more intuitive example. Let’s use fEcofin package to get daily prices of the DowJones 30 companies:
if(!require("fEcofin")) install.packages("fEcofin", repos="http://R-Forge.R-project.org")
library(fEcofin)
pacman::p_load("tidyr", "dplyr")
DJ = DowJones30
# Let's dplyr and fix this dataset:
DJ <-
DJ %>%
mutate(Date = as.Date(X.Y..m..d)) %>% select(-X.Y..m..d) %>%
tbl_df()
# Let's create log returns for each column:
DJ <-
DJ %>% mutate_at(vars(-Date), ~log(.)-log(lag(.))) %>%
filter(Date > first(Date))
Let’s check the distributions and scree plots first:
Note again from the Screeplot how the first PC explains most of the variation. Let’s now fit an EFA with 2 factors on the DJ30 companies:
##
## Call:
## factanal(x = DJ %>% select(-Date), factors = 3, rotation = "varimax")
##
## Uniquenesses:
## AA AXP T BA CAT C KO DD EK XOM GE GM HWP HD HON INTC
## 0.69 0.64 0.86 0.83 0.67 0.60 0.65 0.60 0.88 0.85 0.52 0.76 0.67 0.68 0.74 0.56
## IBM IP JPM JNJ MCD MRK MSFT MMM MO PG SBC UTX WMT DIS
## 0.73 0.62 0.65 0.56 0.77 0.60 0.58 0.69 0.86 0.68 0.83 0.70 0.68 0.82
##
## Loadings:
## Factor1 Factor2 Factor3
## KO 0.55
## JNJ 0.65
## MRK 0.61
## PG 0.54
## AA 0.54
## CAT 0.53
## DD 0.24 0.58
## IP 0.60
## HWP 0.55
## INTC 0.66
## MSFT 0.62
## AXP 0.38 0.31 0.35
## T 0.28
## BA 0.24 0.29
## C 0.38 0.32 0.39
## EK 0.25
## XOM 0.28 0.25
## GE 0.47 0.35 0.36
## GM 0.36 0.30
## HD 0.34 0.26 0.37
## HON 0.44
## IBM 0.48
## JPM 0.31 0.29 0.41
## MCD 0.40
## MMM 0.26 0.49
## MO 0.31
## SBC 0.35
## UTX 0.25 0.46
## WMT 0.43 0.23 0.29
## DIS 0.25 0.29
##
## Factor1 Factor2 Factor3
## SS loadings 3.27 3.04 2.70
## Proportion Var 0.11 0.10 0.09
## Cumulative Var 0.11 0.21 0.30
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1906.86 on 348 degrees of freedom.
## The p-value is 7.37e-213
At the bottom of the output you can see that the significance level of the χ2 fit statistic is small. This indicates that the hypothesis of perfect model fit is rejected. We can add more factors (adjust above code and check for yourself), but this might not lead to a satisfactory model at all. So let’s stick with three factors for now.
We can now interpret the top results by looking at which stocks load together. If, e.g., energy stocks all load onto the first common factor, we might conclude that the driver of energy stock prices (or factors that influence energy stock values such as oil prices, e.g.), have a large effect on the overall Dow Jones index movements, explaining roughly 11% of variation. And so our analysis can continue on.
Rotation can be applied easily by setting rotation at the top to one of several inputs. The analysis could then also be repeated on an annual basis - and the consistency of loadings compared during different time periods (e.g. pre-crisis, during and post-crisis e.g.).
The Psych package also offers more functionality for those interested in delving deeper into EFA. Its function uses fa() with corresponding inputs.