Introduction

In this tutorial, we will do a PCA application to show you the practical use of this powerful dimension reduction technique - particularly in using it in regression analysis.

Suppose our question is: how much of the global movement in currencies affects the rand? We then want to find the global PCs that explain the majority of common movement globally, as well as preserve flexibility to include other exogenous variables as well.

Including principal components in regressions allows for reduced regressor dimensions, as well as reduced multicollinearity between regressors.

PCA effectively decomposes a matrix of returns into \(k\) statistical factors. These are latent (unobserved) factors that are the implicit driving forces behind our data.

Initialization

pacman::p_load("tidyverse", "devtools", "FactoMineR", "factoextra",
    "broom", "rmsfuns")

Factors <- fmxdat::PCA_EX_Factors
FactorInfo <- fmxdat::PCA_EX_Factors_Info
Spots <- fmxdat::PCA_EX_Spots
SpotsInfo <- fmxdat::PCA_EX_Spots_Info

Let’s now do the following:

Create principal component indexes using all the currencies
Include the supplementary variables in order to understand what these components represent
Run regressions on each spot by including PC1, PC2, PC3 : saving and comparing the R-squareds in the process.

Inspect the data

# Let's see which spots we have:
SpotsInfo %>%
    View()

# Let's see which Factor Indexes we have:

FactorInfo %>%
    View()

So based on the information above, let’s do the following:

Exclude the Location SA from our Factors;
Exclude all Middle East spots (as these are typically strongly pegged), G10 currencies and all African Currencies except for the rand:

# ========== Spots:

SpotsExclude <- SpotsInfo %>%
    filter(Type1 %in% c("MiddleEast", "G10") | Type3 %in% "AfricaME") %>%
    pull(Spot)
# Note the use of pull...

# Now we want to exclude these, but note spots is in the
# format of xxx_Spot. So let's correct SpotsExclude:
SpotsExclude <- gsub(" Curncy", "_Spot", SpotsExclude)
# Translation: sub --first part-- with --second part-- for
# SpotsExclude.

# Let's now filter these spots out of our dataframe:
Spots <- Spots %>%
    filter(!Spots %in% SpotsExclude)
# Read as: Spots is not in SpotsExclude

# ========== Indexes:
FactorsExclude <- FactorInfo %>%
    filter(Location %in% "SA") %>%
    pull(Index) %>%
    paste0("Supp_", .)
# Make it equivalent to Factors Index definition which has
# 'Supp_' at start...
Factors <- Factors %>%
    filter(!Index %in% FactorsExclude)

View our returns

Spots %>%
    ggplot() + geom_line(aes(date, Return)) + facet_wrap(~Spots) +
    theme_bw()

Note - in addition to a few return outliers, it is clear that the series are centrally meaned, but have distinct vairance levels. This should therefore definitely be scaled prior to doing any comparative analysis…

Fit the PCA

# Make the data wide (as the PCA process requires this...)

# Then also, we are joining the Factors df with the Spots
# df.  This is done using left_join, which joins by the
# 'date' column of the first df below. Right join is the
# opposite, joining by the second df's date column.
# full_join joins both sides. See dplyr cheatsheet for more
# on this, but you tend to mostly use left_join.

PCA_AllData <- left_join(Factors %>%
    spread(Index, Return), Spots %>%
    spread(Spots, Return), by = "date")

PCA_AllData <- # First remove Columns with only NA: PCA_AllData
PCA_AllData <- # First remove Columns with only NA: <- #
PCA_AllData <- # First remove Columns with only NA: First
PCA_AllData <- # First remove Columns with only NA: remove
PCA_AllData <- # First remove Columns with only NA: Columns
PCA_AllData <- # First remove Columns with only NA: with
PCA_AllData <- # First remove Columns with only NA: only
PCA_AllData <- # First remove Columns with only NA: NA:
PCA_AllData[, colSums(is.na(PCA_AllData)) < nrow(PCA_AllData)] %>%
    # Also, columns must have at most 150 NA or zero points
    # ( | stands for 'or')
.[, colSums(is.na(.) | . == 0, na.rm = TRUE) <= 150] %>%
    select(-date)

PCA_AllData[is.na(PCA_AllData)] <- 0  # Set NA to zero for the following reason:
# If not, the PCA function will by definition replace NA
# with column means. This could be problematic if a column
# has, e.g., only 5 valid returns - with all other returns
# now equal to the mean...

# Next, save the supplementary variables' positions:

SuppPos <- grep("Supp", colnames(PCA_AllData))

# PCA FAQ factominer:
# http://factominer.free.fr/faq/index.html
df_PCA <- PCA(PCA_AllData, quanti.sup = SuppPos, graph = FALSE,
    scale.unit = TRUE)

Analyse

Now that we have the PCA saved, we can analyse it in some detail.
Please explore the plotting capabilities here and what they imply…

fviz_screeplot(df_PCA, ncp = 10)

Now to calculate the loadings of our supplementary variables onto the principal component indexes, use the following:

fviz_pca_var(df_PCA)

Using the principal components in regressions

Let’s use the first three component indexes and the VIX index in a regression for each of the spot exchanges. I.e. Let’s fit:

\[ S_i = \alpha + \beta_1 . PCA_1 + ... + \beta_3 . PCA_3 + \beta_V. Vix_Index \]

PC Indexes

Suppose you are now left with running these regressions and printing the tables in your paper - and you want to keep the code flexible so as to include different factors and study different currencies…

Sound daunting?

See the following neat code that allow you to stay super flexible and have your results in no time. For yourself, wrap the following in a function with appropriate parameters.

Let’s first save and rescale each of the principal indexes (scaling implies standardization)…

In order to calculate the principal component series - we will use the following formula:

\[ R_{pca,i} = E_i^{-1}*R^T \] with \(E_i\) the ith eigenvector, \(R\) the returns and \(R_{pca,i}\) the ith principal component series used in our regression.

pcaseries <- Spots %>%
    spread(Spots, Return) %>%
    select(-date)
demean = scale(pcaseries, center = TRUE, scale = TRUE)
# Probably should consider a more robustifid covariance
# matrix here (as in previous tuts)
sigma <- cov(demean)

evec <- eigen(sigma, symmetric = TRUE)$vector  #eigen vectors
eval <- eigen(sigma, symmetric = TRUE)$values  #eigen values

# or done automagically (using standard covariance matrix
# warts and all):
pcrun <- prcomp(demean, center = T, scale. = T)
ev <- pcrun$rotation

e_min1 <- solve(ev)  # Invert eigenvector as in equation above)
R_PCA <- e_min1 %*% t(pcaseries)  # Note: pcaseries, not demean.

PC_Series <- t(R_PCA) %>%
    tibble::as_tibble()

PC_Series <- PC_Series %>%
    mutate(date = unique(Spots$date)) %>%
    gather(Type, Val, -date)

#---- Let's look at these series:

# Returns:
PC_Series %>%
    filter(Type %in% paste0("PC", 1:5)) %>%
    ggplot() + geom_line(aes(date, Val, color = Type))

# Cunulative:

PC_Series %>%
    filter(Type %in% paste0("PC", 1:5)) %>%
    group_by(Type) %>%
    mutate(Idx = cumprod(1 + Val)) %>%
    ggplot() + geom_line(aes(date, Idx, color = Type), size = 1.2,
    alpha = 0.7) + fmxdat::theme_fmx()

Next, we merge the PC Indexes with the standardized spots and factor indexes to run the regressions:

# Regression calls don't like spaces in colnames, so let's
# fix it by replace spaces with '.'
Regression_data <- left_join(PC_Series %>%
    filter(Type %in% paste0("PC", 1:3)) %>%
    spread(Type, Val), left_join(Factors %>%
    spread(Index, Return), Spots %>%
    spread(Spots, Return), by = "date"), by = "date") %>%
    # Now gather the spots...
gather(Spots, Returns, ends_with("_Spot"))

head(Regression_data)

## # A tibble: 6 × 63
##   date            PC1      PC2     PC3 `Supp_.G7FXVOL3 Index` `Supp_BBDXY Index`
##   <date>        <dbl>    <dbl>   <dbl>                  <dbl>              <dbl>
## 1 2012-01-04 -0.0152   0.0428  0.0204                -0.0215            -0.00911
## 2 2012-01-11  0.00156  0.0351  0.0217                -0.0935             0.0107 
## 3 2012-01-18 -0.0684  -0.0159  0.0138                -0.0384            -0.0103 
## 4 2012-01-25 -0.0715  -0.0290  0.0141                 0.00116           -0.0115 
## 5 2012-02-01 -0.0539   0.00409 0.0160                -0.00717           -0.00716
## 6 2012-02-08 -0.0418   0.00455 0.00544               -0.0443            -0.00374
## # ℹ 57 more variables: `Supp_BCOM Index` <dbl>, `Supp_BCOMF1 Index` <dbl>,
## #   `Supp_BEMS Index` <dbl>, `Supp_BGSV Index` <dbl>,
## #   `Supp_BLCNTRUU Index` <dbl>, `Supp_BRIT Index` <dbl>,
## #   `Supp_CO1 Comdty` <dbl>, `Supp_CRY Index` <dbl>,
## #   `Supp_CTOTCNY Index` <dbl>, `Supp_CTOTUSD Index` <dbl>,
## #   `Supp_CVICMRIS Index` <dbl>, `Supp_DBHVG10U index` <dbl>,
## #   `Supp_DBHVGUSI index` <dbl>, `Supp_DBMOMUSF Index` <dbl>, …

colnames(Regression_data) <- gsub(" ", ".", colnames(Regression_data))


Regressions <- Regression_data %>%
    group_by(Spots) %>%
    do(reg = lm(Returns ~ lag(Returns) + PC1 + PC2 + PC3 + Supp_VIX.Index,
        data = .)) %>%
    ungroup()



# And that's it.  Now all the regressions are in their own
# list according to each Spot And now we use broom to tidy
# up our results...


tidymaker <- function(X) {
    X %>%
        pull(reg) %>%
        .[[1]] %>%
        broom::tidy() %>%
        # Add spots label
    mutate(Spots = X$Spots)
}

RegressionCoeffs <- Regressions %>%
    group_split(Spots) %>%
    map_df(~tidymaker(.))

# Previously, this worked: RegressionCoeffs <- Regressions
# %>% rowwise() %>% tidy(reg)

head(RegressionCoeffs)

## # A tibble: 6 × 6
##   term           estimate std.error statistic  p.value Spots   
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl> <chr>   
## 1 (Intercept)     0.00489   0.00142     3.44  0.000667 ARS_Spot
## 2 lag(Returns)    0.0202    0.0592      0.341 0.733    ARS_Spot
## 3 PC1             0.0884    0.0516      1.71  0.0877   ARS_Spot
## 4 PC2            -0.0549    0.0730     -0.752 0.453    ARS_Spot
## 5 PC3             0.0985    0.101       0.972 0.332    ARS_Spot
## 6 Supp_VIX.Index -0.00978   0.0103     -0.951 0.342    ARS_Spot

# Again, had to do some gymnastics here...

Rsqd_Maker <- function(X) {
    X %>%
        pull(reg) %>%
        .[[1]] %>%
        glance(reg) %>%
        ungroup() %>%
        select(r.squared) %>%
        mutate(Spots = X$Spots)
}
RegressionStats <- Regressions %>%
    group_split(Spots) %>%
    map_df(~Rsqd_Maker(.)) %>%
    arrange(desc(r.squared))

head(RegressionStats)

## # A tibble: 6 × 2
##   r.squared Spots   
##       <dbl> <chr>   
## 1     0.924 BGN_Spot
## 2     0.881 CZK_Spot
## 3     0.870 PLN_Spot
## 4     0.860 RON_Spot
## 5     0.832 HUF_Spot
## 6     0.732 KRW_Spot

And finally… To include the regression for ZAR in your pdf, simply use the huxtable package and the following code:

pacman::p_load("huxtable")

SpotsToTable <- c("BRL_Spot", "ZAR_Spot", "RUB_Spot", "GBP_Spot",
    "EUR_Spot")

Title <- "PCA Regression Table"

ht <- huxreg(Regressions %>%
    filter(Spots %in% SpotsToTable) %>%
    select(reg) %>%
    .[[1]], statistics = c(N = "nobs", R2 = "r.squared"), note = "%stars%.")

for (i in 1:(ncol(ht) - 1)) {
    ht[1, ][[1 + i]] <- SpotsToTable[i]
}

ht %>%
    set_caption(Title)

PCA Regression Table
	BRL_Spot	ZAR_Spot	RUB_Spot
(Intercept)	0.000	0.000	-0.000
	(0.001)	(0.001)	(0.001)
lag(Returns)	-0.063	0.008	-0.016
	(0.042)	(0.043)	(0.036)
PC1	0.214 ***	0.165 ***	0.305 ***
	(0.033)	(0.039)	(0.028)
PC2	-0.191 ***	-0.120 *	-0.071
	(0.046)	(0.055)	(0.039)
PC3	-0.425 ***	-0.518 ***	-0.320 ***
	(0.065)	(0.077)	(0.055)
Supp_VIX.Index	-0.020 **	0.012	0.001
	(0.006)	(0.008)	(0.006)
N	290	290	290
R2	0.534	0.486	0.648
* p < 0.001; p < 0.01; * p < 0.05.

The above will look especially neat in your Texevier format.

As shown in the Texevier template - you will have to add the results = ‘asis’ in the chunk heading for it to work though.

Parts of Meucci’s study replicated

In this remaining section we calculate the effective number of bets as discussed in Meucci’s (2009) paper.

This is a very nice way of understanding exactly the amount true diversification a portfolio adopts - by considering its exposure to the true risk sources that affect the portfolio.

See the paper here, below follows an application to our DAX data again.

Let’s fit the MVO on our data, and compare the effective number of bets / entropy between that and the EW portfolio.

The following is from practical 3 last time around:

StartDate <- lubridate::ymd(20190601)
Dax <- fmxdat::DAX %>% filter(date > StartDate)

# Let's now pick and only use the top 50 stocks from the last date:
Top_50 <- 
  Dax %>% filter(date == last(date)) %>% 
  arrange(desc(weight)) %>% top_n(30, weight) %>% pull(Tickers)

# Right, with this let's subset our entire dataframe:
Dax <- 
  Dax %>% filter(Tickers %in% Top_50)

return_mat <- Dax %>% select(date, Tickers, return) %>% spread(Tickers, return)

impute_missing_returns <- function(return_mat, impute_returns_method = "NONE", Seed = 1234){
  # Make sure we have a date column called date:
  if( !"date" %in% colnames(return_mat) ) stop("No 'date' column provided in return_mat. Try again please.")

  # Note my use of 'any' below...
  # Also note that I 'return' return_mat - which stops the function and returns return_mat. 
  if( impute_returns_method %in% c("NONE", "None", "none") ) {
    if( any(is.na(return_mat)) ) warning("There are missing values in the return matrix.. Consider maybe using impute_returns_method = 'Drawn_Distribution_Own' / 'Drawn_Distribution_Collective'")
    return(return_mat)
  }

  
  if( impute_returns_method  == "Average") {

    return_mat <-
      return_mat %>% gather(Stocks, Returns, -date) %>%
      group_by(date) %>%
      mutate(Avg = mean(Returns, na.rm=T)) %>%
      mutate(Avg = coalesce(Avg, 0)) %>% # date with no returns - set avg to zero
      ungroup() %>%
      mutate(Returns = coalesce(Returns, Avg)) %>% select(-Avg) %>% spread(Stocks, Returns)

    # That is just so much easier when tidy right? See how I gathered and spread again to give back a wide df?
    
  } else

    if( impute_returns_method  == "Drawn_Distribution_Own") {

      set.seed(Seed)
      N <- nrow(return_mat)
      return_mat <-

        left_join(return_mat %>% gather(Stocks, Returns, -date),
                  return_mat %>% gather(Stocks, Returns, -date) %>% group_by(Stocks) %>%
                    do(Dens = density(.$Returns, na.rm=T)) %>%
                    ungroup() %>% group_by(Stocks) %>% # done to avoid warning.
                    do(Random_Draws = sample(.$Dens[[1]]$x, N, replace = TRUE, prob=.$Dens[[1]]$y)),
                  by = "Stocks"
        ) %>%  group_by(Stocks) %>% mutate(Row = row_number()) %>% mutate(Returns = coalesce(Returns, Random_Draws[[1]][Row])) %>%
        select(-Random_Draws, -Row) %>% ungroup() %>% spread(Stocks, Returns)

    } else

      if( impute_returns_method  == "Drawn_Distribution_Collective") {

        set.seed(Seed)
        NAll <- nrow(return_mat %>% gather(Stocks, Returns, -date))

        return_mat <-
          bind_cols(
          return_mat %>% gather(Stocks, Returns, -date),
          return_mat %>% gather(Stocks, Returns, -date) %>%
            do(Dens = density(.$Returns, na.rm=T)) %>%
            do(Random_Draws = sample(.$Dens[[1]]$x, NAll, replace = TRUE, prob=.$Dens[[1]]$y)) %>% unnest(Random_Draws)
          ) %>%
          mutate(Returns = coalesce(Returns, Random_Draws)) %>% select(-Random_Draws) %>% spread(Stocks, Returns)

      } else

        if( impute_returns_method  == "Zero") {
        warning("This is probably not the best idea but who am I to judge....")
          return_mat[is.na(return_mat)] <- 0

        } else
          stop("Please provide a valid impute_returns_method method. Options include:\n'Average', 'Drawn_Distribution_Own', 'Drawn_Distribution_Collective' and 'Zero'.")
}

# Now we will use this function as follows (after saving and sourcing it of course....):
# Note my seed is the year, day hour and minute - so unless you do this multiple times a minute, it will always differ.

options(scipen = 999) # Stop the scientific notation of
return_mat <- 
  impute_missing_returns(return_mat, impute_returns_method = "Drawn_Distribution_Collective", Seed = as.numeric(format( Sys.time(), "%Y%d%H%M")))

# Drop date column for this...
return_mat_Nodate <- data.matrix(return_mat[, -1])

# Simple Sample covariance and mean:
Sigma <- RiskPortfolios::covEstimation(return_mat_Nodate)
Mu <- RiskPortfolios::meanEstimation(return_mat_Nodate)

Entropy

Now let’s calculate comparable entropy estimates for MVO and EW

pacman::p_load(RiskPortfolios)

LB = 0
UB = 0.08

Type = "mv"
# Type = 'erc'
Opt_W_MVO <- RiskPortfolios::optimalPortfolio(mu = Mu, Sigma = Sigma,
    control = list(type = Type, constraint = "user", LB = rep(LB,
        ncol(Sigma)), UB = rep(UB, ncol(Sigma))))

df_W <- tibble(Tickers = colnames(Sigma), MVO = Opt_W_MVO) %>%
    mutate(EW = 1/n())

Now that we have weights, let’s proceed to calculate entropy

rt <- return_mat_Nodate
# From the paper:

# Follow Meucci (2009) in calculating the effective number
# of bets:
scaledrtn = rt
scaledrtn = scale(rt, center = TRUE, scale = TRUE)
sigma <- cov(scaledrtn)
evec <- eigen(sigma, symmetric = TRUE)$vector  #eigen vectors
eval <- eigen(sigma, symmetric = TRUE)$values  #eigen values

e_min1 <- solve(evec)  # Invert eigenvector (E^-1)
wt_mvo <- df_W %>%
    pull(MVO)
wt_EW <- df_W %>%
    pull(EW)

wtilde_MVO <- e_min1 %*% wt_mvo
v_n_MVO <- wtilde_MVO^2 * eval
p_n_MVO <- v_n_MVO/sum(v_n_MVO)  # Proportion of variance attributed to each PC.
p_n_MVO <- ifelse(p_n_MVO < 0, 0.00001, p_n_MVO)  # eq 18
entropy_MVO <- exp(-sum(p_n_MVO * log(p_n_MVO)))
print(entropy_MVO)

## [1] 1.707157

And comparing it to EW:

wtilde_EW <- e_min1 %*% wt_EW
v_n_EW <- wtilde_EW^2 * eval
p_n_EW <- v_n_EW/sum(v_n_EW)  # Proportion of variance attributed to each PC.
p_n_EW <- ifelse(p_n_EW < 0, 0.00001, p_n_EW)  # eq 18
entropy_EW <- exp(-sum(p_n_EW * log(p_n_EW)))
print(entropy_EW)

## [1] 1.053738

Following the interpretation from Meucci - a lower level of entropy suggests less diversification (less effective bets across risk sources), with zero risk source diversification implied by entropy = 1. Conversely, higher entropy suggests better risk source diversification.

Oddly - MVO performs best, with EW the worst (you can also uncomment Type = ERC above to compare to equal risk contribution setting). This shows risk source diversification is not necessarily related to traditional portfolio diversification measures (e.g. HHI) - where EW is by definition the best diversifier.

Considering risk source diversification is therefore key - although the above needs to be compared in an out-sample setting of course (above is applied in-sample) to make it a real comparison (iteratively re-estimating the entropy using rebalancing weights and proceeding quarter returns, e.g.).

Financial Econometrics Practical

Practical 4: PCA Application

NF Katzke