library(tidyverse) # Used for data wrangling
library(lubridate) # Date handling
In this tutorial, we will use the package tidyfit
to
analyze factor loadings of a universe of funds (e.g. to identify style
tilts for individual funds). The primary purpose is to illustrate the
usage and advantages of the tidyfit
package in
R
and how it can help us:
To install tidyfit, run the following code chunk:
# devtools::install_github("jpfitzinger/tidyfit")
library(tidyfit)
If you’d like to do some background reading, have a look at the website. If you run into problems or find bugs in the package please don’t hesitate to contact me.
Some additional packages are necessary for the models we will fit
(tidyfit
does not install all model-specific dependencies
upfront). You will be prompted along the way to install
bestglm
, shrinkTVP
and arm
.
Let’s begin by loading some data.
# Run the following to update the used data....
# devtools::install_github("Nicktz/fmxdat")
# Fund & factor returns
<- fmxdat::asisa %>%
data select(date, Funds, Returns) %>%
left_join(fmxdat::Local_Factors, by = "date") %>%
select(-SWIX, -ALSI, -Multifactor)
data
## # A tibble: 54,000 × 9
## date Funds Returns Capped SWI…¹ Yield Value Momen…² Quality Size
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2002-07-31 Fund_1 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 2 2002-07-31 Fund_2 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 3 2002-07-31 Fund_3 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 4 2002-07-31 Fund_4 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 5 2002-07-31 Fund_5 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 6 2002-07-31 Fund_6 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 7 2002-07-31 Fund_7 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 8 2002-07-31 Fund_8 NA -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 9 2002-07-31 Fund_9 -0.0874 -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## 10 2002-07-31 Fund_10 -0.0990 -0.113 -0.0495 -0.100 -0.0532 -0.0416 -0.109
## # … with 53,990 more rows, and abbreviated variable names ¹`Capped SWIX`,
## # ²Momentum
We’ve got monthly fund returns plus 6 different factors. Note that
Capped SWIX
is the market factor (i.e. that coefficient
will give us the market beta). Next we will add some external
variables:
# Some external data
<- fmxdat::Commodities %>%
df_cmdty select(date, Name, returns) %>%
spread(Name, returns)
<- fmxdat::USD_Strength %>%
df_usd spread(Name, returns)
# Join
<- data %>%
data left_join(df_cmdty) %>%
left_join(df_usd)
# Handling (i.e. dropping in our case) missing data
<- data %>%
data
drop_na
data
## # A tibble: 16,141 × 14
## date Funds Returns Capped S…¹ Yield Value Momen…² Quality Size
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2005-01-31 Fund_9 0.00300 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 2 2005-01-31 Fund_10 0.00373 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 3 2005-01-31 Fund_12 0.0158 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 4 2005-01-31 Fund_20 0.00939 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 5 2005-01-31 Fund_25 0.0116 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 6 2005-01-31 Fund_26 0.0116 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 7 2005-01-31 Fund_41 -0.00553 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 8 2005-01-31 Fund_50 0.0153 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 9 2005-01-31 Fund_53 0.0110 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## 10 2005-01-31 Fund_54 0.0208 0.00756 0.00414 0.0246 0.00522 0.00889 0.00292
## # … with 16,131 more rows, 5 more variables: Bcom_Index <dbl>, Gold <dbl>,
## # Oil_Brent <dbl>, Plat <dbl>, BBDXY <dbl>, and abbreviated variable names
## # ¹`Capped SWIX`, ²Momentum
Before we begin modeling, we need to take a closer look at the sample. The minimum sample size across dates is 50 funds:
# Plot number of funds by date
%>%
data group_by(date) %>%
summarise(`# of funds` = n()) %>%
ggplot(aes(date, `# of funds`)) +
geom_line() +
theme_bw()
… however, some funds have very short history.
# Plot number of months by fund
%>%
data group_by(Funds) %>%
summarise(`# of dates` = n()) %>%
arrange(`# of dates`) %>%
head(20)
## # A tibble: 20 × 2
## Funds `# of dates`
## <chr> <int>
## 1 Fund_117 2
## 2 Fund_184 2
## 3 Fund_28 7
## 4 Fund_120 8
## 5 Fund_128 8
## 6 Fund_94 8
## 7 Fund_154 9
## 8 Fund_169 9
## 9 Fund_103 10
## 10 Fund_140 10
## 11 Fund_6 10
## 12 Fund_7 10
## 13 Fund_8 10
## 14 Fund_190 11
## 15 Fund_189 12
## 16 Fund_63 12
## 17 Fund_178 13
## 18 Fund_191 13
## 19 Fund_199 13
## 20 Fund_225 13
We will drop any funds with a sample size less than 4 years (48
months) and that are no longer active. tidyfit
does include
many regularized regression techniques that can handle very small sample
sizes, but this will simplify things a little later on:
<- data %>%
data group_by(Funds) %>%
filter(n() > 48, max(date)==max(data$date))
This leaves us with a data set of exactly 100 Funds. Now we are ready to get modeling!
tidyfit
has a number of nice features. Have a look at
the repo and the website for a more complete
documentation. One feature is that we can fit a regression on
grouped tibbles. In this case a regression is fitted
for each group in the data. The syntax is extremely
simple:
<- data %>%
models group_by(Funds) %>% # So we estimate a model for each fund
regress(Returns ~ ., m("lm"), .mask = "date")
models
## # A tibble: 100 × 7
## Funds model estimator `size (MB)` grid_id model_object settings
## <chr> <chr> <chr> <dbl> <chr> <list> <list>
## 1 Fund_1 lm stats::lm 0.263 #0010000 <tidyFit> <tibble [1 × 0]>
## 2 Fund_10 lm stats::lm 0.290 #0010000 <tidyFit> <tibble [1 × 0]>
## 3 Fund_101 lm stats::lm 0.261 #0010000 <tidyFit> <tibble [1 × 0]>
## 4 Fund_107 lm stats::lm 0.290 #0010000 <tidyFit> <tibble [1 × 0]>
## 5 Fund_108 lm stats::lm 0.290 #0010000 <tidyFit> <tibble [1 × 0]>
## 6 Fund_109 lm stats::lm 0.249 #0010000 <tidyFit> <tibble [1 × 0]>
## 7 Fund_111 lm stats::lm 0.288 #0010000 <tidyFit> <tibble [1 × 0]>
## 8 Fund_112 lm stats::lm 0.290 #0010000 <tidyFit> <tibble [1 × 0]>
## 9 Fund_114 lm stats::lm 0.250 #0010000 <tidyFit> <tibble [1 × 0]>
## 10 Fund_115 lm stats::lm 0.270 #0010000 <tidyFit> <tibble [1 × 0]>
## # … with 90 more rows
regress
takes (1) a data argument
(piped in this case), (2) a formula (regression of
Returns
on everything else), and (3) a model
wrapper m()
. There are also some additional
arguments like .mask
that help us set things up.
The model wrapper stipulates the method used and can
take any additional arguments to pass to that method. For instance, here
we are fitting OLS regressions using lm()
. To see what
methods you could use here, check ?m()
. It is important to
note, however, that you need to have some familiarity with the
underlying package, in order to know what types of arguments you can
pass. To see what the underlying method is, and for more details about
hyperparameters etc. see ?.model.<method>
(e.g. ?.model.lm
).
The models
object contains fitted models for each fund.
In our case there should be a total of 100 fund-specific models in the
model frame.
We are interested in the loadings, which are accessible with
coef
:
<- coef(models)
factor_loadings factor_loadings
## # A tibble: 1,200 × 5
## # Groups: Funds, model [100]
## Funds model term estimate model_info
## <chr> <chr> <chr> <dbl> <list>
## 1 Fund_1 lm (Intercept) 0.00175 <tibble [1 × 3]>
## 2 Fund_1 lm Capped SWIX 0.372 <tibble [1 × 3]>
## 3 Fund_1 lm Yield -0.119 <tibble [1 × 3]>
## 4 Fund_1 lm Value -0.112 <tibble [1 × 3]>
## 5 Fund_1 lm Momentum 0.118 <tibble [1 × 3]>
## 6 Fund_1 lm Quality 0.0202 <tibble [1 × 3]>
## 7 Fund_1 lm Size 0.404 <tibble [1 × 3]>
## 8 Fund_1 lm Bcom_Index -0.0297 <tibble [1 × 3]>
## 9 Fund_1 lm Gold 0.0770 <tibble [1 × 3]>
## 10 Fund_1 lm Oil_Brent 0.0224 <tibble [1 × 3]>
## # … with 1,190 more rows
Suppose we want to identify some value funds with commodities exposure. Here’s a plot — we’re looking for the funds with positive value loadings and a positive coefficient on the Bcom_Index:
%>%
factor_loadings select(Funds, term, estimate) %>%
spread(term, estimate) %>%
ggplot(aes(Value, Bcom_Index)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_rect(aes(xmin = 0, xmax = max(Value)+0.02,
ymin = 0, ymax = max(Bcom_Index)+0.02),
color = "red", alpha = 0) +
theme_bw()
A problem is that linear regression is often spurious or imprecise (particularly when T is small). Let’s apply Occam’s Razor here and select sparse models using a backward selection algorithm (i.e. if a factor does not improve the model, it is dropped).
# install.packages("bestglm")
To perform a backward selection, we use the “subset” method in
m()
. This method can do forward selection, backward
selection and exhaustive search. It wraps bestglm
, which
uses the R
standard for these algorithms:
leaps
, but with some additional helpful extensions. Here is
were the power of tidyfit
kicks in — only a slight
adjustment to the code allows as to generate the new models:
<- data %>%
models_subset group_by(Funds) %>%
regress(Returns ~ ., m("subset", method = "backward", IC = "AIC"), .mask = "date")
models_subset
## # A tibble: 100 × 8
## Funds model estimator size (M…¹ grid_id model_o…² settings warni…³
## <chr> <chr> <chr> <dbl> <chr> <list> <list> <chr>
## 1 Fund_1 subset bestglm::bestglm 0.0533 #00100… <tidyFit> <tibble> <NA>
## 2 Fund_10 subset bestglm::bestglm 0.0814 #00100… <tidyFit> <tibble> <NA>
## 3 Fund_101 subset bestglm::bestglm 0.0486 #00100… <tidyFit> <tibble> <NA>
## 4 Fund_107 subset bestglm::bestglm 0.0903 #00100… <tidyFit> <tibble> model …
## 5 Fund_108 subset bestglm::bestglm 0.0903 #00100… <tidyFit> <tibble> <NA>
## 6 Fund_109 subset bestglm::bestglm 0.0491 #00100… <tidyFit> <tibble> <NA>
## 7 Fund_111 subset bestglm::bestglm 0.0839 #00100… <tidyFit> <tibble> model …
## 8 Fund_112 subset bestglm::bestglm 0.0815 #00100… <tidyFit> <tibble> <NA>
## 9 Fund_114 subset bestglm::bestglm 0.0423 #00100… <tidyFit> <tibble> <NA>
## 10 Fund_115 subset bestglm::bestglm 0.0628 #00100… <tidyFit> <tibble> <NA>
## # … with 90 more rows, and abbreviated variable names ¹`size (MB)`,
## # ²model_object, ³warnings
Once again we fetch the coefficients and plot:
<- coef(models_subset)
factor_loadings_subset %>%
factor_loadings_subset select(Funds, term, estimate) %>%
spread(term, estimate) %>%
drop_na(Value, Bcom_Index) %>%
ggplot(aes(Value, Bcom_Index)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_rect(aes(xmin = 0, xmax = max(Value)+0.02,
ymin = 0, ymax = max(Bcom_Index)+0.02),
color = "red", alpha = 0) +
theme_bw()
Lo and behold, we have only one fund in our quadrant of interest, and very few with robust loadings on value and commodities. Let’s see which fund this is:
%>%
factor_loadings_subset select(Funds, term, estimate) %>%
spread(term, estimate) %>%
filter(Value > 0, Bcom_Index > 0)
## # A tibble: 1 × 14
## # Groups: Funds, model [1]
## model Funds (Intercep…¹ BBDXY Bcom_…² Cappe…³ Gold Momen…⁴ Oil_B…⁵ Plat
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 subset Fund_50 -0.0000791 NA 0.0641 0.774 -0.0359 NA NA NA
## # … with 4 more variables: Quality <dbl>, Size <dbl>, Value <dbl>, Yield <dbl>,
## # and abbreviated variable names ¹`(Intercept)`, ²Bcom_Index, ³`Capped SWIX`,
## # ⁴Momentum, ⁵Oil_Brent
# install.packages("shrinkTVP")
Now we may be interested to explore style drift in
this fund. Let’s estimate the loadings for this specific fund in a
time-varying parameter framework. Again,
tidyfit
helps us out here:
# Notice how we no longer need .mask="date", because we are passing
# the arg index_col="date". This ensures that for time-varying parameter
# estimation, the result to coef() contains a date index.
if(!require("shrinkTVP")) install.packages("shrinkTVP")
<- data %>%
models_tvp filter(Funds == "Fund_50") %>%
regress(Returns ~ Value + `Capped SWIX` + Bcom_Index, m("tvp", niter = 5000, index_col = "date"))
We can fetch and plot the coefficients over time:
<- coef(models_tvp)
factor_loadings_tvp <- factor_loadings_tvp %>%
factor_loadings_tvp unnest(model_info)
factor_loadings_tvp
## # A tibble: 604 × 8
## # Groups: Funds, model [1]
## Funds model term estimate upper lower posterior…¹ index
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <date>
## 1 Fund_50 tvp (Intercept) -0.0000899 0.000598 -0.00118 0.000593 2005-01-31
## 2 Fund_50 tvp (Intercept) -0.0000993 0.000569 -0.00120 0.000606 2005-02-28
## 3 Fund_50 tvp (Intercept) -0.000106 0.000575 -0.00123 0.000609 2005-03-31
## 4 Fund_50 tvp (Intercept) -0.000114 0.000616 -0.00126 0.000636 2005-05-31
## 5 Fund_50 tvp (Intercept) -0.000110 0.000652 -0.00127 0.000638 2005-06-30
## 6 Fund_50 tvp (Intercept) -0.000117 0.000629 -0.00131 0.000665 2005-08-31
## 7 Fund_50 tvp (Intercept) -0.000121 0.000647 -0.00137 0.000657 2005-09-30
## 8 Fund_50 tvp (Intercept) -0.000118 0.000678 -0.00137 0.000673 2005-10-31
## 9 Fund_50 tvp (Intercept) -0.000118 0.000734 -0.00137 0.000687 2005-11-30
## 10 Fund_50 tvp (Intercept) -0.000125 0.000713 -0.00137 0.000708 2006-01-31
## # … with 594 more rows, and abbreviated variable name ¹posterior.sd
%>%
factor_loadings_tvp ggplot(aes(index, estimate)) +
geom_ribbon(aes(ymax=upper, ymin=lower), alpha = 0.25) +
geom_line() +
facet_wrap("term", scales = "free") +
theme_bw()
The value loading has been steadily increasing over the period of the funds existence. Similarly, the commodities loading reflects an upward trend. The credible interval (obtained by taking the 90% quantile interval of the beta posteriors) is very wide.
# install.packages("arm")
Now, we may ask whether the backward selection is actually any good
at predicting returns. Let’s use a predictive workflow to assess the
out-of-sample performance of the models. Here I will introduce two
additional methods into the mix. The first is a LASSO
regression, which is estimated with glmnet
. The
LASSO also estimates sparse models, but unlike backward selection it is
an embedded method (i.e. it estimates parameters and performs variable
selection simultaneously and not sequentially like the subset selection
algorithms). The second method is a Bayesian regression
using arm
.
The LASSO regression requires hyperparameter tuning. A reasonable
grid for the penalty parameter is chosen automatically (but could also
be passed manually using
m("lasso", lambda = <grid vector>)
). However, we need
to define a .cv
argument to tell regress
how
to cross validate the grid. tidyfit
leverages the
rsample
package to build cross validation samples. Here we
will use a simple train/validate split with time ordered data:
.cv = initial_time_split
.
# IMPORTANT:
# Chunk takes a little longer to run
if(!require("bestglm")) install.packages("bestglm")
if(!require("arm")) install.packages("arm")
<- data %>%
models filter(year(date) < 2021) %>%
regress(Returns ~ .,
m("lm"),
m("subset", method = "backward", IC = "AIC"),
m("lasso"),
m("bayes"),
.mask = "date", .cv = "initial_time_split")
Next we generate out-of-sample predictions. Here we can simply use
predict
:
<- data %>%
predictions filter(year(date)>=2021) %>%
predict(models, .)
Now, calculating the accuracy is made very easy using the
yardstick
package. It is important to note that the
predictions tibble is already grouped appropriately. If this was not the
case, you would still need to group by model
and
Funds
.
if(!require("yardstick")) install.packages("yardstick")
<- predictions %>%
acc ::rmse(truth, prediction) yardstick
Let’s plot the result. To make it easier to compare, I will present all metrics relative to “lm”. So a positive value is a larger error, and a negative value is a smaller error:
%>%
acc group_by(Funds) %>%
mutate(.estimate = .estimate - .estimate[model == "lm"]) %>%
filter(model != "lm") %>%
ggplot(aes(model, .estimate)) +
geom_hline(yintercept = 0, color = "red") +
geom_boxplot() +
theme_bw()
Interesting result: backward selection does not actually improve things at all. The Bayesian regression does a better job than OLS and the LASSO certainly does as well. I have seen this quite often: LASSO and other embedded methods are much better than sequential methods at variable selection.
We’ve gone through quite a lot of analysis in a very short time. Now
it is your turn to get familiar with tidyfit
and try out
some of your own workflows. Here are some ideas:
m("quantile")
)classify
m("glmm")
). This requires some knowledge of
lme4
.Thanks a lot for reading! I hope you have fun with the package :)
~ Dr. Johann Pfitzinger