In this session we will be discussing clustering and partitioning our data.
This follows from our previous sessions on data dimension reduction using PCA, but extends this into subdividing our data in sources of risk.
we are, in effect, trying to identify statistical sources of risk within our datasets / group of assets.
Simply put, clustering means grouping assets according to some underlying similarity metric (e.g. correlation) - such that similar objects are placed in similar groups.
With clustering, the ultimate goal is always to:
Maximize inter-cluster differences
Minimize intra-cluster distances
Notice from the above - we want the distances between bubbles to be maximimzed (distinct groupings), and dot distance within be minimized (tight groups).
Typically, identifying clusters is hard for humans - as we are constrained by an inability to comprehend multi-dimensions. Clustering allows a means of identifying statistical clustering groups.
Partitioning: Creating non-overlapping clusters
Ensuring clusters are distinct
Mostly not feasible as multi-dimensional series are normal multiplicative, not additive - hard to motivate this exclusive clusters.
Hierarchical: building a cluster tree
First things first, let’s discuss what we mean by distance between objects in an n-dimensional space.
Using the Pythagorean theorem, the minimum distance between two points with coordinates (xA,yA) and (xB,yB) is √(xA−xB)2+(yA−yB)2
So if we have two points A = (3, 7) and B = (4, 2), the minimum distance is: √(3−4)2+(7−2)2=√26≈5.099
Adding a point C = (4, 6) then, shows that C is closer to A than B, as √2≈1.4142
In R, this is luckily easily to calculate using dist:
## A B
## B 5.099020
## C 1.414214 4.000000
Of course there are other distance measures, with other merits, but for this session let’s use Euclidean distances.
Notice from the above, that the data coordinates need to be first placed on the same scale (can use the scale function, or do scaling by hand, e.g.).
In our later applications to financial data - we calculate distances on correlation estimates, which is a scaled version of covariance. We return to this later.
Form of partition clustering;
Each observation is assigned to a cluster (total of k);
Algorithm tries to optimally assign each series to a cluster by minimizing the sum of distances to each cluster’s centroid.1
Effectively, we want to find the structure which produces minimal intra-cluster variation
We will use the function kmeans, and use the argument centers to identify the N amount of clusters to identify (we decide this, not the machine - i.e. it is supervised).
Using first a common dataset - on USA arrests (from the datasets package) - let’s interrogate our data using k-means clustering (before applying it on our currencies set):
## K-means clustering with 5 clusters of sizes 14, 6, 4, 16, 10
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 8.214286 173.28571 70.64286 22.84286
## 2 2.533333 50.83333 56.33333 11.71667
## 3 3.575000 80.50000 50.25000 11.20000
## 4 11.812500 272.56250 68.31250 28.37500
## 5 5.590000 112.40000 65.60000 17.27000
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 4 4 4 1 4
## Colorado Connecticut Delaware Florida Georgia
## 1 5 4 4 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 5 4 5 2
## Kansas Kentucky Louisiana Maine Maryland
## 5 5 4 3 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 1 4 3 4 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 5 5 4 2 1
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 4 2 5
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 1 1 5 1 4
## South Dakota Tennessee Texas Utah Vermont
## 3 1 1 5 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 1 1 3 2 1
##
## Within cluster sum of squares by cluster:
## [1] 9136.6429 1792.9083 550.6775 19563.8625 1480.2100
## (between_SS / total_SS = 90.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
From the above we see the cluster centers (Cluster means), the cluster vectors - telling us which states belong to which cluster as well as several other measures (e.g. cluster sum of squares).
Joining our cluster info to the dataset makes it easier to analyse:
## # A tibble: 50 × 6
## State Murder Assault UrbanPop Rape km_est.cluster
## <chr> <dbl> <int> <int> <dbl> <int>
## 1 Alabama 13.2 236 58 21.2 4
## 2 Alaska 10 263 48 44.5 4
## 3 Arizona 8.1 294 80 31 4
## 4 Arkansas 8.8 190 50 19.5 1
## 5 California 9 276 91 40.6 4
## 6 Colorado 7.9 204 78 38.7 1
## 7 Connecticut 3.3 110 77 11.1 5
## 8 Delaware 5.9 238 72 15.8 4
## 9 Florida 15.4 335 80 31.9 4
## 10 Georgia 17.4 211 60 25.8 1
## # … with 40 more rows
In order to visualize the above, we can make use of the package factoextra:
Typically one would estimate several k-means clusters- and choose a level of k that produces the lowest within-cluster sum of squares.
Let’s apply this on our currency dataset from the PCA session.
In this application, we will now extend our analysis to the comovement space using clustering analysis.
Spots <- fmxdat::PCA_EX_Spots
# Let's calculate the sample covariance matrix (from previous practical on using multi-variate t for our cov calc):
Sigma <-
Spots %>% spread(Spots, Return) %>% select(-date) %>%
fitHeavyTail::fit_mvt(return_mat) %>% .$cov
# To calculate the correlation structure from our covariance matrix, let's use cov2cor:
# cov2cor scales a covariance matrix into the corresponding correlation matrix efficiently...
corr <- cov2cor(Sigma)
# Let's now calculate our distance matrix as:
distmat <- ((1 - corr) / 2)^0.5
# Use PAM for mediod grouping:
clust <- cluster::pam(dist(distmat), k = 3)
# df grouping:
cluster_df <-
left_join(
Spots,
data.frame(clust$clustering) %>% rownames_to_column() %>% set_names("Spots", "Clusters"),
by = "Spots"
)
# Cluster for plotting: let's use again the normal k-means for plotting purposes
clust_plot <- kmeans(dist(distmat), 5)
g <- factoextra::fviz_cluster(object = clust_plot, data = corr, ellipse = T, show.clust.cent = T)
g + theme_bw() + labs(title = "Illustrating the k-means partitioning for currency comovement") + theme(legend.position = "bottom")
Let’s digress quick….
Note that you can always use the widyr package to calculate a df with bivariate correlation estimate very simply as:
## # A tibble: 465 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 CAD_Spot TWD_Spot -0.0264
## 2 CLP_Spot PEN_Spot -0.0293
## 3 AUD_Spot CLP_Spot -0.0294
## 4 CLP_Spot PHP_Spot -0.0294
## 5 CLP_Spot GBP_Spot -0.0295
## 6 CLP_Spot JPY_Spot -0.0295
## 7 CLP_Spot THB_Spot -0.0295
## 8 AUD_Spot PEN_Spot -0.0295
## 9 BGN_Spot CLP_Spot -0.0296
## 10 CAD_Spot CLP_Spot -0.0296
## # … with 455 more rows
In contrast to using k-means clustering, hierarchical clustering decomposes our comovement structure into nested clusters.
Intuitively, we are interested in creating a cluster tree - where the lowest level has each stock represent its own unique cluster (starting point), while sequentially grouping stocks into nodes of a tree that expands by grouping the closest observations (e.g. Euclidean) in each layer. The tree is then iteratively built up until all the stocks are part of a single cluster (step 4 below):
Another nice visual from here summarizes the iterative steps nicely:
I applied this process to the Top 40 Index below. The process involves:
Computing a proximity matrix (using, e.g., Euclidean distances as discussed above)
Starting by having all stocks be its own cluster
Grouping the closest stocks / clusters / combination into sequential clusters.
Iterate the process until we have a single cluster containing all the stocks.
We then plot it using a dendogram (note - I followed a sequential grouping process as per the above illustration, and building the tree then bottom-up. This process is called Agglomerative Nesting or AGNES):2
For the above, I coloured five main Nodes.
Dendograms are useful in providing us with easy to digest visualizations of the complex comovement structure underlying our data.
The above dendogram can be interpreted as follows:
The higher the height of the fusion of clusters, the less similar the objects are.
The banks and retailers (orange) form the closest clusters (lowest common nodes) - while that of AMS, ANG and GFI form clusters least similar to the rest of the TOPI.
SOL is entirely in its own cluster.
So…. while the distance calculations have been discussed above (similar as with k-means) - what remains to discuss is how exactly we calculate the distance between clusters.
There are several ways to do this:
Max Distance (vice versa of the above.)
Group Average
∑dist(pi,pj)2|c1|∗|c2|
We could also use the native hclust functionality (probably easier to digest):
X <- data.frame(x = rnorm(10), Y = rnorm(10))
cluster <- hclust(dist(X), method = "ward.D")
plot(cluster)
# We can also plot the dendogram with rectangles (say highlighting 3 clusters)
rect.hclust(cluster,
k = 2, # k is used to specify the number of clusters
border = "red"
)
rect.hclust(cluster,
k = 3, # k is used to specify the number of clusters
border = "darkgreen"
)
rect.hclust(cluster,
k = 4, # k is used to specify the number of clusters
border = "steelblue"
)
rect.hclust(cluster,
k = 5, # k is used to specify the number of clusters
border = "orange"
)
rect.hclust(cluster,
k = 3, # k is used to specify the number of clusters
border = "darkgreen"
)
rect.hclust(cluster,
k = 4, # k is used to specify the number of clusters
border = "steelblue"
)
rect.hclust(cluster,
k = 5, # k is used to specify the number of clusters
border = "orange"
)
You should now stop, and view the cheatsheet for clustering in R here to get a fuller understanding of the theory and application.
Next, we will apply these techniques to split a set of stocks into meaningful clusters.
Below follows a nice work through of plotting dendograms using the ggdendro package in R.
First, though, source some auxiliary functions from my gist here as follows:
Now we can use the loaded function, cluster_aux(), and then plot nicer dendograms (for more, see here)
# But, alas - we can do much better than this.... using the gist sourced function!
cols <- c("#a9a9a9", "#1f77b4", "#ff7f0e", "#2ca02c")
p <- plot_ggdendro(hcdata,
direction = "tb",
scale.color = cols,
label.size = 2.5,
branch.size = 0.5,
expand.y = 0.2)
p <- p + theme_void() + expand_limits(x = c(-1, 32)) + labs(title = "Nice Dendogram of Cars", caption = "Dendogram created using Ward distances and AGNES clustering")
p
Let’s apply this to an actual group of stocks and discuss the process of doin.
In the following, we calculate the correlation matrix of the 50 largest German index stocks, derived from a Ledoit-Wolf shrunk matrix
We then display complex comovement structures using a dendogram, colouring several groups that are of note.
We will again require the NA replacement function from our previous portfolio optimization tutorial.
…Let’s load our old imputation function…
impute_missing_returns <- function(return_mat, impute_returns_method = "NONE", Seed = 1234){
# Make sure we have a date column called date:
if( !"date" %in% colnames(return_mat) ) stop("No 'date' column provided in return_mat. Try again please.")
# Note my use of 'any' below...
# Also note that I 'return' return_mat - which stops the function and returns return_mat.
if( impute_returns_method %in% c("NONE", "None", "none") ) {
if( any(is.na(return_mat)) ) warning("There are missing values in the return matrix.. Consider maybe using impute_returns_method = 'Drawn_Distribution_Own' / 'Drawn_Distribution_Collective'")
return(return_mat)
}
if( impute_returns_method == "Average") {
return_mat <-
return_mat %>% gather(Stocks, Returns, -date) %>%
group_by(date) %>%
mutate(Avg = mean(Returns, na.rm=T)) %>%
mutate(Avg = coalesce(Avg, 0)) %>% # date with no returns - set avg to zero
ungroup() %>%
mutate(Returns = coalesce(Returns, Avg)) %>% select(-Avg) %>% spread(Stocks, Returns)
# That is just so much easier when tidy right? See how I gathered and spread again to give back a wide df?
return(return_mat)
} else
if( impute_returns_method == "Drawn_Distribution_Own") {
set.seed(Seed)
N <- nrow(return_mat)
return_mat <-
# DIY: see what density function does
left_join(return_mat %>% gather(Stocks, Returns, -date),
return_mat %>% gather(Stocks, Returns, -date) %>% group_by(Stocks) %>%
mutate(Dens = list(density(Returns, na.rm=T))) %>%
summarise(Random_Draws = list(sample(Dens[[1]]$x, N, replace = TRUE, prob=.$Dens[[1]]$y))),
by = "Stocks"
) %>% group_by(Stocks) %>%
# Random draw from sample:
mutate(Returns = coalesce(Returns, Random_Draws[[1]][row_number()])) %>%
select(-Random_Draws) %>% ungroup() %>% spread(Stocks, Returns)
return(return_mat)
} else
if( impute_returns_method == "Drawn_Distribution_Collective") {
set.seed(Seed)
NAll <- nrow(return_mat %>% gather(Stocks, Returns, -date))
# DIY: see what density function does
return_mat <-
bind_cols(
return_mat %>% gather(Stocks, Returns, -date),
return_mat %>% gather(Stocks, Returns, -date) %>%
mutate(Dens = list(density(Returns, na.rm=T))) %>%
summarise(Random_Draws = list(sample(Dens[[1]]$x, NAll, replace = TRUE, prob=.$Dens[[1]]$y))) %>%
unnest(Random_Draws)
) %>%
mutate(Returns = coalesce(Returns, Random_Draws)) %>% select(-Random_Draws) %>% spread(Stocks, Returns)
return(return_mat)
} else
if( impute_returns_method == "Zero") {
warning("This is probably not the best idea but who am I to judge....")
return_mat[is.na(return_mat)] <- 0
return(return_mat)
} else
stop("Please provide a valid impute_returns_method method. Options include:\n'Average', 'Drawn_Distribution_Own', 'Drawn_Distribution_Collective' and 'Zero'.")
return_mat
}
# Now we will use this function as follows (after saving and sourcing it of course....):
# Note my seed is the year, day hour and minute - so unless you do this multiple times a minute, it will always differ.
options(scipen = 999) # Stop the scientific notation of
Let’s get the correlation matrix first, and then calculate the Distance matrix that will be used to construct our dendogram, given as (using the Dower metric):
D=di,j∀i,j∈1,...,Nwith:di,j=√12(1−ρi,j)
We then cluster the distance matrix using AGNES clustering:
return_mat <- impute_missing_returns(return_mat, impute_returns_method = "Drawn_Distribution_Collective", Seed = as.numeric(format( Sys.time(), "%Y%d%H%M"))) %>% select(-date)
# Ledoit Wolf shrunken covariance matrix:
Sigma <- RiskPortfolios::covEstimation(as.matrix(return_mat), control = list(type = "lw"))
# Or, controlling for fat tails as before:
# Sigma <- fitHeavyTail::fit_mvt(return_mat) %>% .$cov
corr <- cov2cor(Sigma)
distmat <- ((1 - corr) / 2)^0.5
Now we calculate the distance matrix:
And finally, calculate the clustered object using Agnes clustering and Ward distances:
Now let’s plot this using our loaded auxiliary plots:
hc <- hclust(D)
hcdata <- dendro_data_k(cluster, 4)
p <- plot_ggdendro(hcdata,
direction = "lr",
expand.y = 0.2)
cols <- c("#a9a9a9", "#1f77b4", "#ff7f0e", "#2ca02c", "#AD3636")
p <- plot_ggdendro(hcdata,
direction = "tb",
scale.color = cols,
label.size = 2.5,
branch.size = 0.5,
expand.y = 0.2)
p <- p + theme_void() + expand_limits(x = c(-1, 32))
p + labs(title = "Dendogram of Top 50 German Stocks", caption = "Dendogram created using Ward distances and AGNES clustering")
The above framework can now be used in a portfolio construction process - where capital is allocated along the identified risk sources.
The canonical framework for doing so is HRP (Hierarchical Risk Parity) - but in principle, any optimization routine could be utilized in allocating capital along the cluster dimensions.
END