Introduction

Principal Component Analysis

Doing PCA by hand
PCA using princomp
PCA using psych package

Pairs.Panel
Outlier detection
PCA Regression

FactoMineR and factroextra
Interpret

Cos2 : quality of representation for variables on the factor map
Contributions of the variables to the principal components

Dimension Description
Adding supplementary variables

PCA Application

Factor Analysis

Exploratory FA

Dow Jones EFA Example

Introduction

In this tutorial we will be covering the basics behind Principal Component Analysis (PCA) and Factor Analysis (FA) in R. As mentioned in class, although the techniques are similar, PCA is a descriptive model, while FA is a structural model. Broadly speaking, PCA attempts to account for the entire variance of the dispersion matrix (be it correlation or covariance), while FA accounts only for the common variance.

We will first look at PCA, and see how it can be used in financial modeling. Thereafter we will look at FA in broad terms.

Principal Component Analysis

PCA allows the modeler a means of identifying structure in the covariance / correlation matrix to locate low-dimensional subspaces containing most of the variation in the data @ruppert.

There are several very useful functions that can be used to extract the PCs in R. We will do so in three ways:

By hand (doing the math - good for intuition)
Using R’s princomp function
Briefly look at the psych package, which allows principal rotation
Consider the FactoMineR and factoextra packages for adding supplementary information

If you want to replicate the results found in @ruppert, you could use the following package: fEcofin:

# Need to explicitly specify the package source if you have a newer R version.
pacman::p_load(fEcofin)

See the documentation for fEcofin to get an idea of what the datafile contains.

This is awesome for you to explore different techniques with real (and easily available) data.

Let’s load a dataset a bit closer to home: the SA sector TRIs.

pacman::p_load("tidyverse", "lubridate")

# Always ensure fmxdat is updated to the latest version please:
# pacman::p_install_gh("Nicktz/fmxdat")

data_Raw <- fmxdat::SA_Indexes

First we calculate the log returns. We do so in order to remove the scaling effect and to ensure log-normality of returns when doing statistical analysis.

# Sectors Considered:
data <-  
  data_Raw %>%
  group_by(Tickers) %>% 
  arrange(date) %>% 
  mutate(Returns = ((log(Price)-log(lag(Price))))) %>% 
  filter(date > first(date)) %>% 
  ungroup() %>% mutate(Tickers = gsub(" Index", "", Tickers)) %>% 
  
  filter(date > ymd(20100101)) %>% 
  filter(!Tickers %in% c("J205LCTR", "J430TR", "J433TR", "JGROWTR", "JVALUTR", "TOP40TR", "JSHR40TR", "JCAP40TR"))

No let’s plot the returns series in order to check for significant outliers first:

ggplot(data) + geom_line(aes(x = date, 
                             y = Returns, 
                             color = Tickers, 
                             alpha = 0.9)) +
  ggtitle("Sector Log Returns: SA") +
  guides(alpha=FALSE)

Clearly there are a few periods where incredibly high daily returns will bias our analysis (typically you’d interrogate this carefully of course, but for now let’s fix it).

Let’s fix this by constraining returns to be capped at +/- 25% for daily returns:

To achieve this, we use a logic operator: ifelse.

data <- 
  data %>% mutate(Returns = ifelse(Returns > 0.25, 0.25,
                                   ifelse(Returns < -0.25, -0.25, Returns)))

# Now plotting:
ggplot(data) + geom_line(aes(x = date, 
                             y = Returns, 
                             color = Tickers, 
                             alpha = 0.9)) +
  labs(title = "Sector Log Returns: SA") +
  guides(alpha=FALSE) + 
  fmxdat::theme_fmx() + 
  fmxdat::fmx_cols()

Next we also need to mean-center our data to do PCA (although the returns do seem largely homogenously distributed, so it is less of an issue here).

Let’s nonetheless mean scale our data:

data <-
  data %>% 
  group_by(Tickers) %>% 
  mutate(Returns = Returns - mean(Returns)) %>% 
  ungroup()

Doing PCA by hand

Remember that PCA is done via an eigen decomposition on a square matrix, which in financial econometrics is ether a covariance matrix or a correlation matrix of asset returns (data in our example).

The eigen decomposition splits this square matrix of dispersion into its corresponding eigenvectors (loadings) and eigenvalues. The eigenvalue is then the variance of the factor associated with the corresponding eigenvalue.

For now, let’s get the PCAs directly by hand.

Remember - any square matrix (such as a covariance matrix) can be decomposed into an eigenvalue and eigenvector combination.

For simplicity, let’s use the wide data frame format:

data_byhand <- 
  data %>% select(date, Tickers, Returns) %>% spread(Tickers, Returns)
covmat <- 
  cov( data_byhand %>% select(-date)) # Covariance Matrix
#eigenvectors:
evec <- eigen(covmat, symmetric=TRUE)$vector
# eigenvalues:
eval <- eigen(covmat, symmetric=TRUE)$values

Note here we used the built-in eigen function to extract the eigenvectors and eigenvalues of the covariance matrix of the returns, rtn.

But what did it do mathematically? Remember from the theory session:

For a symmetric square matrix A (here the covariance matrix of rtn):

with eigenvector, E, and eigenvalue

Now, notice that this implies that the PCA can decompose this matrix A into orthogonoal statistical factors.

where E and

the eigenvectors and eigenvalues, and

the demeaned covariance matrix. Note, eigen does not demean the data, we have to (as done using earlier).

We can check this using:

lambda = diag(t(evec) %*% covmat %*% evec)
#Which should be equal to eval:
all.equal(lambda, eval)

## [1] TRUE

This is very useful, as we can now see how much of the variation in the dispersion is due to which orthogonal (independent) factor PC1.

Note that as we have 6 unique series, we will have 6 unique Principal Components (or unique sources of variation).

To calculate the proportion of variance subscribed to each orthogonal source of risk, we note that:

So if we want to see what % of variation in A is due to each orthogonal PC, we can use:

prop = eval / sum(eval)
print(prop)

##  [1] 0.6001351322 0.1395411775 0.0976162481 0.0667933925 0.0452434436
##  [6] 0.0230445986 0.0148134184 0.0102298436 0.0020340428 0.0005487025

Let’s show a barplot to show the % variation due to each component (using base R plotting, just for notalgia):

prop <- 
  tibble(Loadings = prop) %>% 
  mutate(PC = paste0("PC_", row_number()))

prop[,"PC"][[1]] <- factor(prop[,"PC"][[1]], levels = prop$PC)

g <- 
prop %>% 
  
  ggplot() + geom_bar(aes(PC, Loadings), stat = "identity", fill = "steelblue") + 
  
  fmxdat::theme_fmx(axis.size.title = fmxdat::ggpts(38), axis.size = fmxdat::ggpts(35), title.size = fmxdat::ggpts(42), 
                    CustomCaption = T) + 
  
  scale_y_continuous(breaks = scales::pretty_breaks(10), labels = scales::percent_format(accuracy = 1)) + 
  
  labs(x = "Principal Components", y = "Loadings", title = 'Eigenvalue proportions', caption = "Source: Fmxdat Package")

g

Note two things:

Eigenvalues are always ordered by largest to smallest
Sum of the eigenvalues always equal the number of columns (as 100% of variation is explained)

The plot thus says: nearly 60% of variation in the sectors are explained by a single component. It does not give insight into what this component or factor might be, BUT it tells us how this component is calculated linearly. Refer again to equation : this implies that if we multiply the first column of the eigenvector by the cov(returns) matrix - then we get the first PC.

Note that the eigenvectors now can be interpreted as the loadings or the weights of each PC. Looking at the first two PC’s loadings we can make an interesting observation:

names <- covmat %>% colnames

evecdf <- tibble(data.frame(evec))

evecdf <- 
  evecdf %>% purrr::set_names( c(paste0("Eigenv_", 1:ncol(evecdf)) )) %>% 
  mutate(Names = names)

gg1 <- 
evecdf %>% 
  ggplot() + 
  geom_bar(aes(Names, Eigenv_1), stat = "identity", fill = "steelblue") + 
  fmxdat::theme_fmx(axis.size.title = fmxdat::ggpts(38), axis.size = fmxdat::ggpts(35), 
                    title.size = fmxdat::ggpts(42), CustomCaption = T) + 
  
  scale_y_continuous(breaks = scales::pretty_breaks(10), labels = scales::percent_format(accuracy = 1)) + 
  
  labs(x = "Principal Components", y = "Loadings: Eigenvector 1", title = 'Eigenvector proportions\n', caption = "Source: Fmxdat Package")

gg2 <- 
  
evecdf %>% 
  ggplot() + 
  geom_bar(aes(Names, Eigenv_2), stat = "identity", fill = "steelblue") + 
  
    labs(x = "Principal Components", y = "Loadings: Eigenvector 2", title = 'Eigenvector proportions\n', caption = "Source: Fmxdat Package") + 

  fmxdat::theme_fmx(axis.size.title = fmxdat::ggpts(38), axis.size = fmxdat::ggpts(35), 
                    title.size = fmxdat::ggpts(42), CustomCaption = T) + 
  
  scale_y_continuous(breaks = scales::pretty_breaks(10), labels = scales::percent_format(accuracy = 1)) + 
  
  geom_label(aes(Names, Eigenv_2, label = glue::glue("{Names}: {round(Eigenv_2, 2)*100}%")), alpha = 0.5, size = fmxdat::ggpts(12))
  
fmxdat::finplot(gg1, x.vert = T)

fmxdat::finplot(gg2, x.vert = T)

So we noted from the eigenvalues that a unique linear combination of all the sector returns can explain roughly 60% of the variation in all the returns series. From the first barplot above, we note that this unique combination has an almost equal weighted input for all the series. From the second barplot, we note that the second eigenvalue (which explains about 14% of the total variation) loads nearly equally on Telecoms and Resi (but in opposite direction).

This implies (loosely) that holding an equal long Telecoms, short Resi position explains a sizeable part of the total variation in these sectors’ overall returns.

PCA using princomp

While doing PCA by hand is informative, a more productionable approach is to use the prcomp command in R. Below follows an example of this approach. Let’s ensure our data is mean centered, as well as scaled, when doing the PCA (this is important, as our returns data are typically skew, non-normal and non-zero meaned. Thus a log transformation, scaling and mean-centering is standard practice).

princomp as a package is useful for allowing us to rotate components after fitting it. If you want to add supplementary information, I suggest using FactoMineR, discussed later.

# prcomp requires wide, numeric data:
data_wide <- data %>% select(date, Tickers, Returns) %>% spread(Tickers, Returns) %>% select(-date)

pca <- 
  prcomp(data_wide)
  # We have already cebtered our data, but we could also use these built in measures:
  # prcomp(data_wide, center = TRUE, scale. = TRUE)

Quick digression:

When R gives you an object (and not a data.frame) as a result, as e.g. given above - treat it as a little box, whose content you can extract as follows:

Find the names:

str(pca)

## List of 5
##  $ sdev    : num [1:10] 0.03225 0.01555 0.01301 0.01076 0.00886 ...
##  $ rotation: num [1:10, 1:10] 0.368 0.287 0.308 0.472 0.301 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "FINI15TR" "INDI25TR" "JALSHTR" "JMOTETR" ...
##   .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:10] 1.07e-19 1.27e-19 6.36e-20 7.80e-20 1.01e-19 ...
##   ..- attr(*, "names")= chr [1:10] "FINI15TR" "INDI25TR" "JALSHTR" "JMOTETR" ...
##  $ scale   : logi FALSE
##  $ x       : num [1:2644, 1:10] -0.004794 0.000138 0.000813 -0.02051 0.022231 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

Now we can use the $-operator to extract what we want from this object, e.g.:

pca$sdev

##  [1] 0.0322539134 0.0155528177 0.0130082521 0.0107603065 0.0088559614
##  [6] 0.0063203694 0.0050674041 0.0042110693 0.0018777507 0.0009752734

# From the documentation, note that sdev is the square root of the eigenvalues.
# To get these (and compare to the by hand calcs) - we need to sqaure it:

eigenvalues <- pca$sdev^2

# With these now equivalent to our earlier by hand calcs:

all.equal(eigenvalues, eval)

## [1] TRUE

pca$rotation

##                PC1         PC2         PC3         PC4         PC5         PC6
## FINI15TR 0.3675187  0.06891435  0.32968726  0.22551232  0.32151847 -0.71418519
## INDI25TR 0.2874321  0.08602660 -0.05176250 -0.03401930  0.54444671  0.54660137
## JALSHTR  0.3081486  0.19925627 -0.05405935  0.01681976  0.21379793  0.13940827
## JMOTETR  0.4722423 -0.72101309 -0.43031931  0.22424509 -0.14236473 -0.02296100
## JNCCGTR  0.3009887 -0.20497697  0.25424786 -0.87774653 -0.09680977 -0.06142506
## JSAPYTR  0.2410986  0.02022523  0.54481579  0.33719782 -0.44025822  0.32387720
## JSHRALTR 0.3146571  0.13261673  0.02973097  0.04798887  0.26518509  0.11691076
## JSMLCTR  0.1672359  0.02772952  0.18494091  0.01604428 -0.29244376  0.17524816
## MIDCAPTR 0.2458327  0.10441068  0.20058137  0.05243034 -0.11631452 -0.01903880
## RESI20TR 0.3580656  0.59716411 -0.51355979 -0.08926075 -0.40205894 -0.13338885
##                   PC7         PC8          PC9          PC10
## FINI15TR  0.013282841 -0.20495927  0.212116755 -0.0042916433
## INDI25TR  0.031602772 -0.05761235  0.546306920 -0.0904494000
## JALSHTR  -0.001288173 -0.05725225 -0.426603398  0.7821265070
## JMOTETR   0.012689686  0.02312145  0.006718183  0.0114500238
## JNCCGTR   0.135980606 -0.02432682 -0.008972556  0.0010448556
## JSAPYTR   0.474742819 -0.07985421  0.021314244 -0.0005572392
## JSHRALTR -0.030190017  0.05122516 -0.654645029 -0.6034318831
## JSMLCTR  -0.779373765 -0.45954750  0.034249823 -0.0280530714
## MIDCAPTR -0.354131335  0.85192252  0.131368596  0.0579017328
## RESI20TR  0.145075047 -0.06566351  0.164298395 -0.1080577924

You will use this type of information extraction often in R… so take note!

Further, dissecting the princomp command’s output, we see it provides a simple means of plotting the individual components’ importance:

plot(pca, type = "l")

It also gives us a summary of the PCA results:

summary(pca)

## Importance of components:
##                            PC1     PC2     PC3     PC4      PC5     PC6
## Standard deviation     0.03225 0.01555 0.01301 0.01076 0.008856 0.00632
## Proportion of Variance 0.60014 0.13954 0.09762 0.06679 0.045240 0.02304
## Cumulative Proportion  0.60014 0.73968 0.83729 0.90409 0.949330 0.97237
##                             PC7      PC8      PC9      PC10
## Standard deviation     0.005067 0.004211 0.001878 0.0009753
## Proportion of Variance 0.014810 0.010230 0.002030 0.0005500
## Cumulative Proportion  0.987190 0.997420 0.999450 1.0000000

The table above shows the SD of each component, proportion of vairance explained and the cumulative variance proportion, respectively. Note that component one explains 75% of variation.

In order to save the individual principal component indexes (for use downstream, e.g. in regressions) - do the following:

PC_Indexes <- predict(pca) 
# This is simply multiplying the eigenvectors with the rescaled and adjusted X variables to give us the principal components...

For plotting PCA output, I’d rather use factoextra and FactoMineR personally - we will do a nice example at the end.

PCA using psych package

There are many other powerful ways to extract PCAs (see also the caret package), the psych package is quite easy to use and powerful.

pacman::p_load("psych")

Psych allows us a very nice quick visual indication of the relation of the returns to eachother. Consider the function: pairs.panels below. It plots: * bivariate scatterplots below the diagonal * each series’ histogram on the diagonal * pearson correlation above the diagonal

Pairs.Panel

pairs.panels( data_wide )

This is a great visualization tool for analysing bivariate relationships using scatterplots, histograms and correlations.

Outlier detection

g <- outlier(data_wide, plot = TRUE, na.rm = TRUE)

Note that the package uses the Mahalanobis function to identify outliers (see ?outlier). Note: we could also use the package’s scrub function to remove outliers (but this is verrry basic).

Another very nice and quick way of looking at the data’s characteristic available in Psych, is the violin plot. It is essentially a combination of a boxplot and a density plot:

violinBy( data_wide %>% as.matrix )

# or ggplot:
gviolion <- 
data_wide %>% 
  gather(Type, val) %>% 
  ggplot() + geom_violin(aes(Type, val, fill = Type), alpha = 0.7) + 
  fmxdat::theme_fmx() + 
  fmxdat::fmx_fills()

fmxdat::finplot(gviolion, y.pct = T, y.pct_acc = 1, x.vert = T)

Note: This is a very useful figure to include in your reports on the behaviour of returns data as it shows similarity in distribution and density between the returns.

At last, using psych to do PCA is very simple:

pcapsych <- principal(data_wide ,rotate='varimax')
# Rotations: "none", "varimax", "quatimax", 
# "promax", "oblimin", "simplimax", or "cluster".

# To see the options of what you can save using the $ sign:
names(pcapsych)

##  [1] "values"       "rotation"     "n.obs"        "communality"  "loadings"    
##  [6] "fit"          "fit.off"      "fn"           "Call"         "uniquenesses"
## [11] "complexity"   "chi"          "EPVAL"        "R2"           "objective"   
## [16] "residual"     "rms"          "factors"      "dof"          "null.dof"    
## [21] "null.model"   "criteria"     "STATISTIC"    "PVAL"         "weights"     
## [26] "r.scores"     "Vaccounted"   "Structure"    "scores"

See the the vignette for a very detailed discussion on how to use the psych package in more detail.

Application to portfolio construction

To see a really interesting application of PCA to portfolio construction, see @meucci paper and visit this site. This could give you some nice ideas for application to SA?

PCA Regression

We could also use PCA components in regressions. This is particularly useful if our parameter space is large, and dimension reduction would be of great benefit.

As you can imagine, having 50 independent variables in a regression might be suboptimal. If we can then rather find a linear combination of the 50 variables that explains a large % of its variation, we might as well include the component itself. See this insightful blogpost on doing a regression using PC’s as inputs.

FactoMineR and factroextra

These are excellent packages for using supplementary variables and for plotting PCA results.

Let’s explore it in some detail, followed by an application example on currency data.

First, let’s use factoextra’s data on athletics outcomes to facilitate the interpretation:

if(!require(FactoMineR)) install.packages("FactoMineR")
if(!require(factoextra)) install.packages("factoextra")
data(decathlon2)
# Define active variables:
decathlon2.active <- decathlon2[1:23, 1:10]

Now let’s split focus first on identifying individuals (rows) and variables (columns):

# Important: if using rownames as identifyer for individuals:

#   Rows are treated as individuals
#   Columns are treated as variables

# So let's make a tbl_df that has rownames:
decathlon2.active <- 
  decathlon2.active %>% tibble::rownames_to_column() %>% tibble::as_tibble() %>% 
  column_to_rownames("rowname")
  # Added rowname column
 # add rownames back... :/

# PCA:
res.pca <- PCA(decathlon2.active, graph = FALSE)

# View the str:
str(res.pca)

## List of 5
##  $ eig : num [1:10, 1:3] 4.124 1.839 1.239 0.819 0.702 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "comp 1" "comp 2" "comp 3" "comp 4" ...
##   .. ..$ : chr [1:3] "eigenvalue" "percentage of variance" "cumulative percentage of variance"
##  $ var :List of 4
##   ..$ coord  : num [1:10, 1:5] -0.851 0.794 0.734 0.61 -0.702 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
##   .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##   ..$ cor    : num [1:10, 1:5] -0.851 0.794 0.734 0.61 -0.702 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
##   .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##   ..$ cos2   : num [1:10, 1:5] 0.724 0.631 0.539 0.372 0.492 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
##   .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##   ..$ contrib: num [1:10, 1:5] 17.54 15.29 13.06 9.02 11.94 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:10] "X100m" "Long.jump" "Shot.put" "High.jump" ...
##   .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ ind :List of 4
##   ..$ coord  : num [1:23, 1:5] 0.196 0.808 -1.359 -0.889 -0.108 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
##   .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##   ..$ cos2   : num [1:23, 1:5] 0.00753 0.0487 0.1972 0.09611 0.00157 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
##   .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##   ..$ contrib: num [1:23, 1:5] 0.0403 0.6881 1.9474 0.8331 0.0123 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
##   .. .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##   ..$ dist   : Named num [1:23] 2.25 3.66 3.06 2.87 2.72 ...
##   .. ..- attr(*, "names")= chr [1:23] "SEBRLE" "CLAY" "BERNARD" "YURKOV" ...
##  $ svd :List of 3
##   ..$ vs: num [1:10] 2.031 1.356 1.113 0.905 0.838 ...
##   ..$ U : num [1:23, 1:5] 0.0963 0.3978 -0.6693 -0.4377 -0.0532 ...
##   ..$ V : num [1:10, 1:5] -0.419 0.391 0.361 0.3 -0.345 ...
##  $ call:List of 9
##   ..$ row.w     : num [1:23] 0.0435 0.0435 0.0435 0.0435 0.0435 ...
##   ..$ col.w     : num [1:10] 1 1 1 1 1 1 1 1 1 1
##   ..$ scale.unit: logi TRUE
##   ..$ ncp       : num 5
##   ..$ centre    : num [1:10] 11 7.35 14.62 2.01 49.43 ...
##   ..$ ecart.type: num [1:10] 0.2945 0.3065 0.8261 0.0946 0.9849 ...
##   ..$ X         :'data.frame':   23 obs. of  10 variables:
##   .. ..$ X100m       : num [1:23] 11 10.8 11 11.3 11.1 ...
##   .. ..$ Long.jump   : num [1:23] 7.58 7.4 7.23 7.09 7.3 7.31 6.81 7.56 6.97 7.27 ...
##   .. ..$ Shot.put    : num [1:23] 14.8 14.3 14.2 15.2 13.5 ...
##   .. ..$ High.jump   : num [1:23] 2.07 1.86 1.92 2.1 2.01 2.13 1.95 1.86 1.95 1.98 ...
##   .. ..$ X400m       : num [1:23] 49.8 49.4 48.9 50.4 48.6 ...
##   .. ..$ X110m.hurdle: num [1:23] 14.7 14.1 15 15.3 14.2 ...
##   .. ..$ Discus      : num [1:23] 43.8 50.7 40.9 46.3 45.7 ...
##   .. ..$ Pole.vault  : num [1:23] 5.02 4.92 5.32 4.72 4.42 4.42 4.92 4.82 4.72 4.62 ...
##   .. ..$ Javeline    : num [1:23] 63.2 60.1 62.8 63.4 55.4 ...
##   .. ..$ X1500m      : num [1:23] 292 302 280 276 268 ...
##   ..$ row.w.init: num [1:23] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ call      : language PCA(X = decathlon2.active, graph = FALSE)
##  - attr(*, "class")= chr [1:2] "PCA" "list"

Now with this PCA object, we can do so much more in terms of visualization with this package:

# Extract eigenvalues:
eigenvalues <- get_eigenvalue(res.pca)

# Scree plot:
fviz_screeplot(res.pca, ncp=10)

quality of representation of the variables: cos2

fviz_pca_var(res.pca, col.var="steelblue")+
  theme_minimal()

# The latter allows mapping in ggplot space

# Control variable colors using their contribution
# Possible values for the argument col.var are :
  # "cos2", "contrib", "coord", "x", "y"
fviz_pca_var(res.pca, col.var=c("cos2", "contrib", "coord", "x", "y")[1]) +
  scale_color_gradient2(low="white", mid="blue", 
      high="red", midpoint=0.5) + theme_minimal()

fviz_pca_var(res.pca, alpha.var=c("cos2", "contrib", "coord", "x", "y")[2]) # have transparency...

Interpret

The graph of variables shows the relationships between all variables:

Positively correlated variables are grouped together.
Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).
The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.
- Individuals on the components (coordinates)

Cos2 : quality of representation for variables on the factor map

The cos2 of variables are calculated as the squared coordinates : var.cos2 = var.coord * var.coord
- The cos2 values are used to estimate the quality of the representation
- The closer a variable is to the circle of correlations, the better its representation on the factor map (and the more important it is to interpret these components)

Contributions of the variables to the principal components

To establish the most important variables in the data, use contribution.

A Variable that contributes most to, e.g., PC1 could be given as: (variable.cos2 * 100) / (total cos2 of the component))
E.g., to find the variables that contribute most to each PC:

res.pca$var$contrib  # Gives percentage of PC due to this variable, i.e. measure of contribution to the component.

##                     Dim.1      Dim.2      Dim.3       Dim.4     Dim.5
## X100m        1.754429e+01  1.7505098  7.3386590  0.13755240  5.389252
## Long.jump    1.529317e+01  4.2904162  2.9300944  1.62485936  7.748815
## Shot.put     1.306014e+01  0.3967224 21.6204325  2.01407269  8.824401
## High.jump    9.024811e+00 11.7715838  8.7928883  2.54987951 23.115504
## X400m        1.193554e+01  4.5799296  6.4876363 22.65090599  1.539012
## X110m.hurdle 1.415754e+01  0.0332933 16.2612611  0.03483735  7.166193
## Discus       1.339309e+01  0.1341398  2.5147385 19.04132022 23.755756
## Pole.vault   1.144592e+00 35.4618611  0.7139512 14.02307063  7.005084
## Javeline     4.446377e+00  8.1086683 29.4531777 13.42963254  5.577615
## X1500m       4.438531e-04 33.4728757  3.8871610 24.49386930  9.878367

Visualizing this:

# Contributions of variables on PC1
fviz_contrib(res.pca, choice = "var", axes = 1)

# Red line:
# If the contribution of the variables were uniform, the expected value would be 1/length(variables) = 1/10 = 10%. If above red line, this variable is important in determining the PC.

# Contributions of variables on PC2
fviz_contrib(res.pca, choice = "var", axes = 2)

# Total contribution on PC1 and PC2
fviz_contrib(res.pca, choice = "var", axes = 1:2)

+ Might be that a variable loads very highly on PC1, but not at all on PC2 (e.g. 1500m) - so 100m would be more important as it loads highly on both.

The contribution of variables to PC1 and PC2, collectively, are calculated as:

where C is the variable’s contribution and Eig the eigenvalues of each PC.

The contribution of a variable to a given principal component is (in percentage) : (var.cos2 * 100) / (total cos2 of the component)

How much of this component is due to each variable, in essence.
Plot loadings onto first two components and colour by total contribution

# Change the gradient color
fviz_pca_var(res.pca, col.var="contrib") +
scale_color_gradient2(low="green", mid="yellow", 
                  high="red", midpoint=9) + theme_minimal()

Dimension Description

Here we find the variables most correlated with a given principal component

DimDesc <- 
dimdesc(res.pca, 
        axes = 1:3, # Which axes to check
        proba = 0.05) # Significance level considered

Used as…

res.desc <- dimdesc(res.pca, axes = c(1,2), proba = 0.05)
# Description of dimension 1
res.desc$Dim.1

## $quanti
##              correlation      p.value
## Long.jump      0.7941806 6.059893e-06
## Discus         0.7432090 4.842563e-05
## Shot.put       0.7339127 6.723102e-05
## High.jump      0.6100840 1.993677e-03
## Javeline       0.4282266 4.149192e-02
## X400m         -0.7016034 1.910387e-04
## X110m.hurdle  -0.7641252 2.195812e-05
## X100m         -0.8506257 2.727129e-07
## 
## attr(,"class")
## [1] "condes" "list"

+ Contribution of variables and individuals together

fviz_pca_biplot(res.pca,  geom = "text")

We can also focus solely on the individuals, and how they load onto the different components:

fviz_pca_ind(res.pca, col.ind = "cos2", pointsize = "cos2",
             gradient.cols = c("darkred", "steelblue", "springgreen3"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )

We can also view the individual contribution of the athletes on the first two components:

# Total contribution on PC1 and PC2
fviz_contrib(res.pca, choice = "ind", axes = 1:2, fill = "darkorange")

Adding supplementary variables

We can add supplementary variables to our analysis too…

Supplementary variables and individuals are not used for the determination of the principal components. Rather, their coordinates are predicted using only the information provided by the performed principal component analysis on active variables/individuals.

Specify supplementary individuals and variables using quanti.sup (quantitative supps) or quali.sup (qualitative supps) for individual supps.

# Let's add back the competition column and use it as a qualitative supplementary variable:
decathlon2 <- 
  factoextra::decathlon2 %>% tibble::rownames_to_column() %>% tibble::as_tibble() %>% 
  # Added rowname column
  column_to_rownames("rowname")
res.pca <- 
  PCA(decathlon2, quanti.sup = 11:12, quali.sup = 13, graph=FALSE)

Let’s now map the quantitative supplementary variables in blue (remember: this shows the influence of the PC on the variable)

fviz_pca_var(res.pca)

Let’s now expand our analysis to also include qualitative supplementary variables into the mix (using colours):

res.pca <- PCA(decathlon2, ind.sup = 24:27, quanti.sup = 11:12, quali.sup = 13, graph=FALSE)

fviz_pca_ind(res.pca, habillage = 13,
             addEllipses =TRUE, ellipse.type = "confidence",
             palette = "jco", repel = TRUE) 

fviz_pca_ind(res.pca, habillage = 13,
  addEllipses =TRUE, ellipse.level = 0.68) +
  scale_color_brewer(palette="Dark2") +
  theme_minimal()

Colour by group

If we have a grouping column (groups of individuals) - we can add colour to it

data(iris)
iris <- tbl_df(iris)
# The variable Species (index = 5) is removed
# before PCA analysis
iris.pca <- PCA(iris %>% select(-Species), graph = FALSE)

Look at biplots, while colouring according to Species. Note below we

Make a biplot of individuals and variables
Change the color of individuals by groups
Change the transparency of variable colors by their contribution values
Show only the labels for variables

fviz_pca_biplot(iris.pca, 
  habillage = iris$Species, addEllipses = TRUE,
  col.var = "red", alpha.var ="cos2",
  label = "var") +
  scale_color_brewer(palette="Dark2")+
  theme_minimal()

PCA Application

The currency PCA example will be loaded in the next practical: Practical 4 Application.

This should give you a nice hand0s-on example of how to do a deep dimension reduction analysis, interpreting the components by virtue of its co-dependence on factor indexes.

Factor Analysis

As mentioned in class, factor analysis seeks to establish a structural model that explains what is driving return covariance in the data. We distinguish between two broad classes of FA: Exploratory and Confirmatory.

Exploratory FA

Broadly speaking, Factor Analysis models our returns series as:

is the loadings matrix, while

is a k-dimensional vector of scores, leaving the error vector

. Note that only

is observed, with the added restriction that the scores are uncorrelated and have unit variance (think of these as type of PCA). The errors (unexplained parts) then have variances

which are interpreted as unique variation. Relating this back to PCA, we use the correlation matrix (dispersion) decomposed as:

The model remains unchanged if replaced by

for any orthogonal matrix

. These matrixes,

, are known as rotations. The goal of factor rotation is to rotate factors in multidimensional space to arrive at a solution with the simplest structure for interpretation. Two types of rotations exist:

Orthogonal rotations which constrain factors to remain uncorrelated (include varimax (most used), quartimax, varimin, bifactor).
Oblique transformations, which allow factors to correlate, but rotates it to make them not correlate (include oblimin, quartimin, biquartimin)

After fitting the FA, the factor loadings indicate how strongly each factor influences each of the measured variables. In order to label the factors in the model, one should examine the factor pattern to see which items load highly on which factors and then determine what those items have in common. It is then up to the researcher to make assumptions as to what these factors mean.

Naturally, the first question is how many factors to include? Their are very many ways this can be done (e.g. Kaiser’s eigenvalue-greater-than-unity rule, or using the scree plot to determine where it drops off significantly). Good packages for doing this includes the nFactors package, GPArotation and FactomineR.

To do a simple FA, let’s use R’s built-in factanal function. By default, the factanal function fits a common factor model using the maximum likelihood approach.

Although the screeplot probably suggests we should only include one factor, let’s use two:

fa <- 
  factanal(data_wide, factors = 2, rotation="varimax")
print(fa)

## 
## Call:
## factanal(x = data_wide, factors = 2, rotation = "varimax")
## 
## Uniquenesses:
## FINI15TR INDI25TR  JALSHTR  JMOTETR  JNCCGTR  JSAPYTR JSHRALTR  JSMLCTR 
##    0.186    0.177    0.005    0.632    0.639    0.402    0.017    0.511 
## MIDCAPTR RESI20TR 
##    0.235    0.297 
## 
## Loadings:
##          Factor1 Factor2
## FINI15TR 0.780   0.453  
## INDI25TR 0.482   0.769  
## JALSHTR  0.539   0.840  
## JMOTETR  0.497   0.347  
## JNCCGTR  0.528   0.287  
## JSAPYTR  0.762   0.129  
## JSHRALTR 0.651   0.748  
## JSMLCTR  0.638   0.285  
## MIDCAPTR 0.754   0.442  
## RESI20TR 0.153   0.824  
## 
##                Factor1 Factor2
## SS loadings      3.661   3.238
## Proportion Var   0.366   0.324
## Cumulative Var   0.366   0.690
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 4444.33 on 26 degrees of freedom.
## The p-value is 0

From the factanal output, we can see that the fit at the bottom indicates the hypothesis of perfect model fit is rejected (thus it is not a great model for explaining variation in rtn). To get a better fit, let’s use three factors. Also, in the print-output, let’s do the following to make interpretation easier:

hide loadings which are small (let’s use a for importance)
Make decimal spaces only equal 2 (digits = 2)
sort the loadings to make structure more obvious (sort = TRUE)

fa <- factanal(data_wide, factors = 1, rotation="varimax")
print(fa, digits = 2, cutoff = .2, sort = TRUE) 

## 
## Call:
## factanal(x = data_wide, factors = 1, rotation = "varimax")
## 
## Uniquenesses:
## FINI15TR INDI25TR  JALSHTR  JMOTETR  JNCCGTR  JSAPYTR JSHRALTR  JSMLCTR 
##     0.28     0.18     0.04     0.65     0.69     0.67     0.00     0.63 
## MIDCAPTR RESI20TR 
##     0.33     0.51 
## 
## Loadings:
##  [1] 0.85 0.91 0.98 0.59 0.56 0.58 1.00 0.61 0.82 0.70
## 
##                Factor1
## SS loadings       6.03
## Proportion Var    0.60
## 
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 7578.74 on 35 degrees of freedom.
## The p-value is 0

That’s better. But still our model is not great. This possibly follows from the very high degree of comovement between the sectors, and the fact that the first PC explains most of the variance. For a more intuitive model, we would then be able (after rotating and making the effects clearer) to assign hypothesized reasons for the factors based on which variables loaded. E.g. (hypothetically speaking), we could assign the first factor as local effect and the second as the global effect (although that would probably be unwise, but you see the point).

Dow Jones EFA Example

Let’s do a more intuitive example. Let’s use fEcofin package to get daily prices of the DowJones 30 companies:

if(!require("fEcofin")) install.packages("fEcofin", repos="http://R-Forge.R-project.org")
library(fEcofin)
pacman::p_load("tidyr", "dplyr")
DJ = DowJones30
# Let's dplyr and fix this dataset:
DJ <- 
  DJ %>% 
  mutate(Date = as.Date(X.Y..m..d)) %>% select(-X.Y..m..d) %>% 
  tbl_df()

# Let's create log returns for each column:
DJ <- 
  DJ %>% mutate_at(vars(-Date), ~log(.)-log(lag(.))) %>% 
  filter(Date > first(Date))

Let’s check the distributions and scree plots first:

# Return densities again quite similar:
violinBy(DJ %>% select(-Date) %>% as.matrix())

# Let's check the Scree plot:
d.pca <- princomp(DJ %>% select(-Date), cor=F)
plot(d.pca, type='l')

Note again from the Screeplot how the first PC explains most of the variation. Let’s now fit an EFA with 2 factors on the DJ30 companies:

d.fa <- 
  factanal(DJ %>% select(-Date), factors = 3, rotation="varimax")
print(d.fa, digits = 2, cutoff = .2, sort = TRUE)

## 
## Call:
## factanal(x = DJ %>% select(-Date), factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##   AA  AXP    T   BA  CAT    C   KO   DD   EK  XOM   GE   GM  HWP   HD  HON INTC 
## 0.69 0.64 0.86 0.83 0.67 0.60 0.65 0.60 0.88 0.85 0.52 0.76 0.67 0.68 0.74 0.56 
##  IBM   IP  JPM  JNJ  MCD  MRK MSFT  MMM   MO   PG  SBC  UTX  WMT  DIS 
## 0.73 0.62 0.65 0.56 0.77 0.60 0.58 0.69 0.86 0.68 0.83 0.70 0.68 0.82 
## 
## Loadings:
##      Factor1 Factor2 Factor3
## KO   0.55                   
## JNJ  0.65                   
## MRK  0.61                   
## PG   0.54                   
## AA           0.54           
## CAT          0.53           
## DD   0.24    0.58           
## IP           0.60           
## HWP                  0.55   
## INTC                 0.66   
## MSFT                 0.62   
## AXP  0.38    0.31    0.35   
## T                    0.28   
## BA   0.24    0.29           
## C    0.38    0.32    0.39   
## EK           0.25           
## XOM  0.28    0.25           
## GE   0.47    0.35    0.36   
## GM           0.36    0.30   
## HD   0.34    0.26    0.37   
## HON          0.44           
## IBM                  0.48   
## JPM  0.31    0.29    0.41   
## MCD  0.40                   
## MMM  0.26    0.49           
## MO   0.31                   
## SBC  0.35                   
## UTX  0.25    0.46           
## WMT  0.43    0.23    0.29   
## DIS  0.25            0.29   
## 
##                Factor1 Factor2 Factor3
## SS loadings       3.27    3.04    2.70
## Proportion Var    0.11    0.10    0.09
## Cumulative Var    0.11    0.21    0.30
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 1906.86 on 348 degrees of freedom.
## The p-value is 7.37e-213

At the bottom of the output you can see that the significance level of the fit statistic is small. This indicates that the hypothesis of perfect model fit is rejected. We can add more factors (adjust above code and check for yourself), but this might not lead to a satisfactory model at all. So let’s stick with three factors for now.

We can now interpret the top results by looking at which stocks load together. If, e.g., energy stocks all load onto the first common factor, we might conclude that the driver of energy stock prices (or factors that influence energy stock values such as oil prices, e.g.), have a large effect on the overall Dow Jones index movements, explaining roughly 11% of variation. And so our analysis can continue on.

Rotation can be applied easily by setting rotation at the top to one of several inputs. The analysis could then also be repeated on an annual basis - and the consistency of loadings compared during different time periods (e.g. pre-crisis, during and post-crisis e.g.).

The Psych package also offers more functionality for those interested in delving deeper into EFA. Its function uses fa() with corresponding inputs.

Financial Econometrics Practical

Practical 4: PCA and EFA in R

NF Katzke