• Introduction
    • What is clustering?
    • Distinction - overlapping or not
      • Distances?
    • k-means
  • Hierarchical Clustering
    • Clustering stocks
    • Application: Stock comovement visualization
    • Use in portfolio optimization

Introduction

In this session we will be discussing clustering and partitioning our data.

This follows from our previous sessions on data dimension reduction using PCA, but extends this into subdividing our data in sources of risk.

we are, in effect, trying to identify statistical sources of risk within our datasets / group of assets.

What is clustering?

Simply put, clustering means grouping assets according to some underlying similarity metric (e.g. correlation) - such that similar objects are placed in similar groups.

With clustering, the ultimate goal is always to:

Maximize inter-cluster differences

Minimize intra-cluster distances

Notice from the above - we want the distances between bubbles to be maximimzed (distinct groupings), and dot distance within be minimized (tight groups).

Typically, identifying clusters is hard for humans - as we are constrained by an inability to comprehend multi-dimensions. Clustering allows a means of identifying statistical clustering groups.

Distinction - overlapping or not

  • Partitioning: Creating non-overlapping clusters

    • Ensuring clusters are distinct

    • Mostly not feasible as multi-dimensional series are normal multiplicative, not additive - hard to motivate this exclusive clusters.

  • Hierarchical: building a cluster tree

    • Involves clusters nesting other clusters.

Distances?

First things first, let’s discuss what we mean by distance between objects in an n-dimensional space.

Using the Pythagorean theorem, the minimum distance between two points with coordinates (xA,yA) and (xB,yB) is (xAxB)2+(yAyB)2

So if we have two points A = (3, 7) and B = (4, 2), the minimum distance is: (34)2+(72)2=265.099

Adding a point C = (4, 6) then, shows that C is closer to A than B, as 21.4142

In R, this is luckily easily to calculate using dist:

library(tidyverse)
A <- c(3,7)
B <- c(4,2)
C <- c(4,6)

Distmat <- rbind(A, B, C)
colnames(Distmat) <- c("X", "Y")

dist(Distmat, method = "euclidean")
##          A        B
## B 5.099020         
## C 1.414214 4.000000

Of course there are other distance measures, with other merits, but for this session let’s use Euclidean distances.

Notice from the above, that the data coordinates need to be first placed on the same scale (can use the scale function, or do scaling by hand, e.g.).

In our later applications to financial data - we calculate distances on correlation estimates, which is a scaled version of covariance. We return to this later.

k-means

  • Form of partition clustering;

  • Each observation is assigned to a cluster (total of k);

  • Algorithm tries to optimally assign each series to a cluster by minimizing the sum of distances to each cluster’s centroid.1

Effectively, we want to find the structure which produces minimal intra-cluster variation

We will use the function kmeans, and use the argument centers to identify the N amount of clusters to identify (we decide this, not the machine - i.e. it is supervised).

Using first a common dataset - on USA arrests (from the datasets package) - let’s interrogate our data using k-means clustering (before applying it on our currencies set):

pacman::p_load("datasets")

df <- datasets::USArrests

# Compute k-means with k = 5
km_est <- kmeans(df, 5)

km_est
## K-means clustering with 5 clusters of sizes 14, 6, 4, 16, 10
## 
## Cluster means:
##      Murder   Assault UrbanPop     Rape
## 1  8.214286 173.28571 70.64286 22.84286
## 2  2.533333  50.83333 56.33333 11.71667
## 3  3.575000  80.50000 50.25000 11.20000
## 4 11.812500 272.56250 68.31250 28.37500
## 5  5.590000 112.40000 65.60000 17.27000
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              4              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              5              4              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              5              4              5              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              5              5              4              3              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              1              4              3              4              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              5              5              4              2              1 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              4              2              5 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              5              1              4 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              1              5              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              1              3              2              1 
## 
## Within cluster sum of squares by cluster:
## [1]  9136.6429  1792.9083   550.6775 19563.8625  1480.2100
##  (between_SS / total_SS =  90.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

From the above we see the cluster centers (Cluster means), the cluster vectors - telling us which states belong to which cluster as well as several other measures (e.g. cluster sum of squares).

Joining our cluster info to the dataset makes it easier to analyse:

df_Joined <- left_join(df %>% rownames_to_column("State"), 
                data.frame(km_est$cluster) %>% rownames_to_column("State"),
                by = "State"
) %>% tbl_df()

df_Joined
## # A tibble: 50 × 6
##    State       Murder Assault UrbanPop  Rape km_est.cluster
##    <chr>        <dbl>   <int>    <int> <dbl>          <int>
##  1 Alabama       13.2     236       58  21.2              4
##  2 Alaska        10       263       48  44.5              4
##  3 Arizona        8.1     294       80  31                4
##  4 Arkansas       8.8     190       50  19.5              1
##  5 California     9       276       91  40.6              4
##  6 Colorado       7.9     204       78  38.7              1
##  7 Connecticut    3.3     110       77  11.1              5
##  8 Delaware       5.9     238       72  15.8              4
##  9 Florida       15.4     335       80  31.9              4
## 10 Georgia       17.4     211       60  25.8              1
## # … with 40 more rows

Visualizing this

In order to visualize the above, we can make use of the package factoextra:

g <- factoextra::fviz_cluster(object = km_est, data = df, ellipse = T, show.clust.cent = T)
g + theme_bw() + labs(title = "Illustrating the k-means partitioning done on USA Arrests by state") + theme(legend.position = "bottom")

Typically one would estimate several k-means clusters- and choose a level of k that produces the lowest within-cluster sum of squares.

  • If you do want to utilize this form of data partitioning - consider the pam algorithm from the cluster package - which finds clusters around the medoids, which is unarguably more robust than k-means as done above (do this yourself.), and crucially, less distorted by inevitable outliers.

Application of k-means cluster

Let’s apply this on our currency dataset from the PCA session.

In this application, we will now extend our analysis to the comovement space using clustering analysis.

Spots <- fmxdat::PCA_EX_Spots

# Let's calculate the sample covariance matrix (from previous practical on using multi-variate t for our cov calc):
Sigma <- 
  Spots %>% spread(Spots, Return) %>% select(-date) %>% 
  fitHeavyTail::fit_mvt(return_mat) %>% .$cov

# To calculate the correlation structure from our covariance matrix, let's use cov2cor:
# cov2cor scales a covariance matrix into the corresponding correlation matrix efficiently...

corr <- cov2cor(Sigma)
# Let's now calculate our distance matrix as:
distmat <- ((1 - corr) / 2)^0.5

# Use PAM for mediod grouping:
clust <- cluster::pam(dist(distmat), k = 3)  
# df grouping:
cluster_df <-
left_join(
  Spots,
  data.frame(clust$clustering) %>% rownames_to_column() %>% set_names("Spots", "Clusters"),
  by = "Spots"
)

# Cluster for plotting: let's use again the normal k-means for plotting purposes
clust_plot <- kmeans(dist(distmat), 5)  
g <- factoextra::fviz_cluster(object = clust_plot, data = corr, ellipse = T, show.clust.cent = T)
g + theme_bw() + labs(title = "Illustrating the k-means partitioning for currency comovement") + theme(legend.position = "bottom")

Let’s digress quick….

Note that you can always use the widyr package to calculate a df with bivariate correlation estimate very simply as:

pacman::p_load("widyr")

Spots %>% mutate(Year = format(date, "%Y")) %>% widyr::pairwise_cor(Spots, Return, upper = F, sort = T)
## # A tibble: 465 × 3
##    item1    item2    correlation
##    <chr>    <chr>          <dbl>
##  1 CAD_Spot TWD_Spot     -0.0264
##  2 CLP_Spot PEN_Spot     -0.0293
##  3 AUD_Spot CLP_Spot     -0.0294
##  4 CLP_Spot PHP_Spot     -0.0294
##  5 CLP_Spot GBP_Spot     -0.0295
##  6 CLP_Spot JPY_Spot     -0.0295
##  7 CLP_Spot THB_Spot     -0.0295
##  8 AUD_Spot PEN_Spot     -0.0295
##  9 BGN_Spot CLP_Spot     -0.0296
## 10 CAD_Spot CLP_Spot     -0.0296
## # … with 455 more rows

Hierarchical Clustering

In contrast to using k-means clustering, hierarchical clustering decomposes our comovement structure into nested clusters.

Intuitively, we are interested in creating a cluster tree - where the lowest level has each stock represent its own unique cluster (starting point), while sequentially grouping stocks into nodes of a tree that expands by grouping the closest observations (e.g. Euclidean) in each layer. The tree is then iteratively built up until all the stocks are part of a single cluster (step 4 below):

Another nice visual from here summarizes the iterative steps nicely:

I applied this process to the Top 40 Index below. The process involves:

  • Computing a proximity matrix (using, e.g., Euclidean distances as discussed above)

  • Starting by having all stocks be its own cluster

  • Grouping the closest stocks / clusters / combination into sequential clusters.

  • Iterate the process until we have a single cluster containing all the stocks.

  • We then plot it using a dendogram (note - I followed a sequential grouping process as per the above illustration, and building the tree then bottom-up. This process is called Agglomerative Nesting or AGNES):2

For the above, I coloured five main Nodes.

Dendograms are useful in providing us with easy to digest visualizations of the complex comovement structure underlying our data.

The above dendogram can be interpreted as follows:

  • The higher the height of the fusion of clusters, the less similar the objects are.

  • The banks and retailers (orange) form the closest clusters (lowest common nodes) - while that of AMS, ANG and GFI form clusters least similar to the rest of the TOPI.

  • SOL is entirely in its own cluster.

So…. while the distance calculations have been discussed above (similar as with k-means) - what remains to discuss is how exactly we calculate the distance between clusters.

There are several ways to do this:

  1. Minimum Distance (single linkage)
  • This defines the similarity of two clusters, Ci and Cj, by the minimum similarity between points pi and pj in clusters i and j:

  • Downside - not ideal with noisy and non-cleanly separated clusters (clusters tend to look more jumbled than the above simple picture)
  1. Max Distance (vice versa of the above.)

  2. Group Average

  • Take all pairs of points and compute the similarities to create an average similarity score for clusters.
  1. Ward’s method
  • Same as group average, but we consider the sum of squares of distances of all points in the distinct clusters.

dist(pi,pj)2|c1||c2|

To illustrate the above, let’s create some fake data and showcase our ability to create cluster dendograms using the cluster package, with a ward clustering approach using an agglomerative nesting procedure:

pacman::p_load(cluster)
set.seed(12345)
X <- data.frame(x = rnorm(10), Y = rnorm(10))
cluster <- cluster::agnes(dist(X), method = "ward") 
plot(cluster)

We could also use the native hclust functionality (probably easier to digest):

X <- data.frame(x = rnorm(10), Y = rnorm(10))
cluster <- hclust(dist(X), method = "ward.D")

plot(cluster)

# We can also plot the dendogram with rectangles (say highlighting 3 clusters)
rect.hclust(cluster,
  k = 2, # k is used to specify the number of clusters
  border = "red"
)

rect.hclust(cluster,
  k = 3, # k is used to specify the number of clusters
  border = "darkgreen"
)

rect.hclust(cluster,
  k = 4, # k is used to specify the number of clusters
  border = "steelblue"
)

rect.hclust(cluster,
  k = 5, # k is used to specify the number of clusters
  border = "orange"
)
rect.hclust(cluster,
  k = 3, # k is used to specify the number of clusters
  border = "darkgreen"
)

rect.hclust(cluster,
  k = 4, # k is used to specify the number of clusters
  border = "steelblue"
)

rect.hclust(cluster,
  k = 5, # k is used to specify the number of clusters
  border = "orange"
)

You should now stop, and view the cheatsheet for clustering in R here to get a fuller understanding of the theory and application.

Next, we will apply these techniques to split a set of stocks into meaningful clusters.

Clustering stocks

Below follows a nice work through of plotting dendograms using the ggdendro package in R.

First, though, source some auxiliary functions from my gist here as follows:

devtools::source_gist("https://gist.github.com/Nicktz/bd2614f8f8a551881a1dc3c11a1e7268")

Now we can use the loaded function, cluster_aux(), and then plot nicer dendograms (for more, see here)

# Auxilliary functions required to run the following:
cluster_aux()
# Notice several functions are now loaded in your environment. Let's proceed:
mtc <- scale(mtcars)
D   <- dist(mtc)
hc  <- hclust(D)
hcdata <- dendro_data_k(hc, 3)
p <- plot_ggdendro(hcdata,
                  direction   = "lr",
                expand.y    = 0.2)
p

# But, alas - we can do much better than this.... using the gist sourced function!
cols <- c("#a9a9a9", "#1f77b4", "#ff7f0e", "#2ca02c")
p <- plot_ggdendro(hcdata,
direction   = "tb",
scale.color = cols,
label.size  = 2.5,
branch.size = 0.5,
expand.y    = 0.2)
p <- p + theme_void() + expand_limits(x = c(-1, 32)) + labs(title = "Nice Dendogram of Cars", caption = "Dendogram created using Ward distances and AGNES clustering")
p

Let’s apply this to an actual group of stocks and discuss the process of doin.

Application: Stock comovement visualization

  • In the following, we calculate the correlation matrix of the 50 largest German index stocks, derived from a Ledoit-Wolf shrunk matrix

  • We then display complex comovement structures using a dendogram, colouring several groups that are of note.

  • We will again require the NA replacement function from our previous portfolio optimization tutorial.

dfrets <- fmxdat::DAX
# 
Top_50 <- dfrets %>% filter(date == last(date)) %>% top_n(50, weight) %>% pull(Tickers)
return_mat <- 
  dfrets %>% 
  filter(Tickers %in% Top_50) %>% 
  mutate(Tickers = gsub(" GR Equity", "", Tickers)) %>% 
  select(date, Tickers, return) %>% 
  spread(Tickers, return)

…Let’s load our old imputation function…

impute_missing_returns <- function(return_mat, impute_returns_method = "NONE", Seed = 1234){
  # Make sure we have a date column called date:
  if( !"date" %in% colnames(return_mat) ) stop("No 'date' column provided in return_mat. Try again please.")

  # Note my use of 'any' below...
  # Also note that I 'return' return_mat - which stops the function and returns return_mat. 
  if( impute_returns_method %in% c("NONE", "None", "none") ) {
    if( any(is.na(return_mat)) ) warning("There are missing values in the return matrix.. Consider maybe using impute_returns_method = 'Drawn_Distribution_Own' / 'Drawn_Distribution_Collective'")
    return(return_mat)
  }

  
  if( impute_returns_method  == "Average") {

    return_mat <-
      return_mat %>% gather(Stocks, Returns, -date) %>%
      group_by(date) %>%
      mutate(Avg = mean(Returns, na.rm=T)) %>%
      mutate(Avg = coalesce(Avg, 0)) %>% # date with no returns - set avg to zero
      ungroup() %>%
      mutate(Returns = coalesce(Returns, Avg)) %>% select(-Avg) %>% spread(Stocks, Returns)

    # That is just so much easier when tidy right? See how I gathered and spread again to give back a wide df?
    return(return_mat)
  } else

    if( impute_returns_method  == "Drawn_Distribution_Own") {

      set.seed(Seed)
      N <- nrow(return_mat)
      return_mat <-
        # DIY: see what density function does
left_join(return_mat %>% gather(Stocks, Returns, -date),
          return_mat %>% gather(Stocks, Returns, -date) %>% group_by(Stocks) %>%
          mutate(Dens = list(density(Returns, na.rm=T))) %>%
          summarise(Random_Draws = list(sample(Dens[[1]]$x, N, replace = TRUE, prob=.$Dens[[1]]$y))),
          by = "Stocks"
) %>%  group_by(Stocks) %>% 
  # Random draw from sample:
  mutate(Returns = coalesce(Returns, Random_Draws[[1]][row_number()])) %>%
  select(-Random_Draws) %>% ungroup() %>% spread(Stocks, Returns)
return(return_mat)
    } else

      if( impute_returns_method  == "Drawn_Distribution_Collective") {

        set.seed(Seed)
        NAll <- nrow(return_mat %>% gather(Stocks, Returns, -date))
        # DIY: see what density function does
        return_mat <-
          bind_cols(
          return_mat %>% gather(Stocks, Returns, -date),
          return_mat %>% gather(Stocks, Returns, -date) %>%
            mutate(Dens = list(density(Returns, na.rm=T))) %>%
            summarise(Random_Draws = list(sample(Dens[[1]]$x, NAll, replace = TRUE, prob=.$Dens[[1]]$y))) %>% 
            unnest(Random_Draws)
          ) %>%
          mutate(Returns = coalesce(Returns, Random_Draws)) %>% select(-Random_Draws) %>% spread(Stocks, Returns)
return(return_mat)
      } else

        if( impute_returns_method  == "Zero") {
        warning("This is probably not the best idea but who am I to judge....")
          return_mat[is.na(return_mat)] <- 0
  return(return_mat)
        } else
          stop("Please provide a valid impute_returns_method method. Options include:\n'Average', 'Drawn_Distribution_Own', 'Drawn_Distribution_Collective' and 'Zero'.")
  
  return_mat
  
}

# Now we will use this function as follows (after saving and sourcing it of course....):
# Note my seed is the year, day hour and minute - so unless you do this multiple times a minute, it will always differ.

options(scipen = 999) # Stop the scientific notation of

Let’s get the correlation matrix first, and then calculate the Distance matrix that will be used to construct our dendogram, given as (using the Dower metric):

D=di,ji,j1,...,Nwith:di,j=12(1ρi,j)

with ρi,j the pairwise correlation between the returns of stock i and j.

We then cluster the distance matrix using AGNES clustering:

return_mat <- impute_missing_returns(return_mat, impute_returns_method = "Drawn_Distribution_Collective", Seed = as.numeric(format( Sys.time(), "%Y%d%H%M"))) %>% select(-date)
# Ledoit Wolf shrunken covariance matrix:
Sigma <- RiskPortfolios::covEstimation(as.matrix(return_mat), control = list(type = "lw"))
# Or, controlling for fat tails as before:
# Sigma <- fitHeavyTail::fit_mvt(return_mat) %>% .$cov

corr <- cov2cor(Sigma)
distmat <- ((1 - corr) / 2)^0.5

Now we calculate the distance matrix:

distmat <- ((1 - corr) / 2)^0.5

And finally, calculate the clustered object using Agnes clustering and Ward distances:

cluster <- cluster::agnes(dist(distmat), method = "ward")

Now let’s plot this using our loaded auxiliary plots:

hc  <- hclust(D)
hcdata <- dendro_data_k(cluster, 4)
p <- plot_ggdendro(hcdata,
                  direction   = "lr",
                expand.y    = 0.2)

cols <- c("#a9a9a9", "#1f77b4", "#ff7f0e", "#2ca02c", "#AD3636")
p <- plot_ggdendro(hcdata,
direction   = "tb",
scale.color = cols,
label.size  = 2.5,
branch.size = 0.5,
expand.y    = 0.2)
p <- p + theme_void() + expand_limits(x = c(-1, 32))
p + labs(title = "Dendogram of Top 50 German Stocks", caption = "Dendogram created using Ward distances and AGNES clustering")

Use in portfolio optimization

The above framework can now be used in a portfolio construction process - where capital is allocated along the identified risk sources.

The canonical framework for doing so is HRP (Hierarchical Risk Parity) - but in principle, any optimization routine could be utilized in allocating capital along the cluster dimensions.

  • See our working paper here

END


  1. For more, see: Hartigan, JA, and MA Wong. 1979. “Algorithm AS 136: A K-means clustering algorithm.” Applied Statistics. Royal Statistical Society, 100–108.↩︎

  2. Divisive Nesting starts from a single node and divides it into sequentially smaller nodes. This is not discussed here for brevity.↩︎