Predicting realized volatility using Google Trends
In this article, we'll study whether the daily realized volatility, $RV_t$, of a stock index is better modelled by augmenting the [heterogeneous autoregressive (HAR) model](http://statmath.wu.ac.at/~hauser/LVs/FinEtricsQF/References/Corsi2009JFinEtrics_LMmodelRealizedVola.pdf) with Google search volumes for the name of the index, $SV_t$.
The extended model, which we term the HAR+SV model, is specified by:
$$
v^d_t = c + \beta_d v^d_{t-1} + \beta_w v^w_{t-1} + \beta_m v^m_{t-1} + \gamma s^d_{t-1} + \epsilon_t \tag{1}
$$
where $v^d_t = \log(RV_t)$, $v^w_t = \frac{1}{5} \sum_{i=0}^{4} v^d_{t-i}$, $v^m_t = \frac{1}{22} \sum_{i=0}^{21} v^d_{t-i}$, and $s^d_t = \log(SV_t)$. The standard HAR model is a special case when $\gamma=0$.
The full code and data for this article are available here: https://github.com/jackdry/predicting-realized-volatility-using-google-trends.
## Import packages
```python
import matplotlib.ticker as plticker
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import numpy as np
```
## Define `indices`, `in_sample` and `out_of_sample`
We'll focus our analysis on the FTSE 100, CAC 40, DJIA and DAX indices from July 2006 – June 2011. The first two years of this period will be used to assess the goodness of fit of the HAR and HAR+SV models, and the last three will be used to compare their out-of-sample forecasting performance.
```python
indices = ["FTSE 100", "CAC 40", "DJIA", "DAX"]
```
```python
in_sample = slice("2006-07-01", "2008-06-30")
out_of_sample = slice("2008-07-01", "2011-06-30")
```
## Read and plot the data
The data has two columns: `sv` holds relative search volume data obtained from [Google Trends](https://trends.google.com/trends/) (for the terms "FTSE", "CAC", "Dow" and "DAX"), and `rv` holds 10-minute realized volatility data from the [OMI's realized library](https://realized.oxford-man.ox.ac.uk/).
```python
svrv = pd.read_csv("svrv.csv", index_col=[0, 1], parse_dates=True)
```
```python
svrv.head()
```
```output
sv rv
datetime index
2006-05-01 DJIA 0.50169 0.00636
2006-05-02 CAC 40 0.70759 0.00578
DAX 0.68387 0.00473
DJIA 0.71293 0.00425
FTSE 100 0.72174 0.00640
```
Plotting the search volume (dark blue) and realized volatility (light blue) data for each index over time reveals a strong co-movement between the variables.
```python
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()
for i, index in enumerate(indices):
xs = svrv.xs(index, level=1)
ax1 = axes[i]
ax1.grid()
ax1.plot(xs.index, xs["rv"], c="#7e96cd")
ax1.set_title(index, loc="left")
ax1.yaxis.set_major_locator(plticker.MultipleLocator(base=0.04))
ax2 = ax1.twinx()
ax2.plot(xs.index, xs["sv"], c="#17223B")
ax2.set_xticks(["2007", "2009", "2011"])
ax2.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
ax2.yaxis.set_major_locator(plticker.MultipleLocator(base=5))
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.show()
```
```output
<img src = "https://jackdry.s3.us-east-2.amazonaws.com/content/articles/predicting-realized-volatility-using-google-trends/data.jpg">
```
## Prepare data for modelling
Because [`sm.OLS`](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html), which we're going to use to fit the models, doesn't include an intercept by default, we first need to add a column of 1s to `svrv`.
```python
svrv["c"] = 1
```
We then add columns for each of the lagged regressors in the HAR+SV model, using names based on the notation in equation 1.
```python
svrv[["s", "v"]] = np.log(svrv[["sv", "rv"]])
svrv[["s_d", "v_d"]] = svrv[["s", "v"]].groupby(level=1).shift()
svrv["v_w"] = svrv.groupby(level=1)["v_d"].apply(lambda x: x.rolling(5).mean())
svrv["v_m"] = svrv.groupby(level=1)["v_d"].apply(lambda x: x.rolling(22).mean())
```
We lastly rearrange the columns and discard data before the sample period.
```python
cols = ["c", "v_d", "v_w", "v_m", "s_d", "v"]
svrv = svrv.loc[pd.IndexSlice[in_sample.start:, :], cols]
```
```python
svrv.head()
```
```output
c v_d v_w v_m s_d v
datetime index
2006-07-03 CAC 40 1 -4.80021 -4.96926 -4.66049 -0.12541 -5.28451
DAX 1 -4.54562 -4.75624 -4.53791 -0.29126 -5.29235
DJIA 1 -5.35771 -5.18182 -4.94551 -0.26401 -5.27107
FTSE 100 1 -4.40144 -4.81162 -4.59088 0.33995 -5.19562
2006-07-04 CAC 40 1 -5.28451 -4.98638 -4.70326 -0.60096 -5.23280
```
## Assess goodness of fit of models
We first create a mapping from each model's name to its covariates.
```python
models = {"HAR": ["c", "v_d", "v_w", "v_m"],
"HAR+SV": ["c", "v_d", "v_w", "v_m", "s_d"]}
```
Next, we fit the models to the in-sample data for each index and calculate the associated AICs.
```python
aic_index = pd.Index(indices, name="index")
aic = pd.DataFrame(index=aic_index, columns=models.keys(), dtype=float)
for index in indices:
xs = svrv.xs(index, level=1).loc[in_sample]
for model in models:
fit = sm.OLS(xs["v"], xs[models[model]]).fit()
aic.loc[index, model] = fit.aic
```
```python
aic
```
```output
HAR HAR+SV
index
FTSE 100 225.17236 218.54891
CAC 40 189.77295 185.45307
DJIA 300.70891 294.64466
DAX 178.28250 166.28124
```
Since the AICs for the HAR+SV model are considerably lower, it's clear that it better describes the realized volatility of the indices.
## Compare forecasting performance of models
We first compute one-step-ahead forecasts for each model over the out-of-sample period.
```python
preds_index = svrv.loc[out_of_sample].index
preds = pd.DataFrame(index=preds_index, columns=models.keys(), dtype=float)
for datetime, index in preds.index:
xs = svrv.xs(index, level=1).loc[:datetime]
for model in models:
f = sm.OLS(xs[:-1]["v"], xs[:-1][models[model]]).fit()
preds.loc[(datetime, index), model] = f.predict(xs[-1:][models[model]])[0]
```
```python
preds.head()
```
```output
HAR HAR+SV
datetime index
2008-07-01 CAC 40 -4.30653 -4.33577
DAX -4.39112 -4.42122
DJIA -4.64193 -4.59459
FTSE 100 -4.43101 -4.45533
2008-07-02 CAC 40 -4.36807 -4.37539
```
We then calculate the MSE ($\times 10^4$) of these forecasts for each model and index.
```python
errors = -np.exp(preds).subtract(np.exp(svrv.loc[out_of_sample, "v"]), axis=0)
mse = (errors**2).groupby(level=1).mean().reindex(indices) * 10**4
```
```python
mse
```
```output
HAR HAR+SV
index
FTSE 100 0.37139 0.35620
CAC 40 0.19564 0.18535
DJIA 0.27438 0.25806
DAX 0.18616 0.17329
```
The MSEs for the HAR+SV model are lower for all indices, strongly indicating that search volumes are helpful for forecasting realized volatility.
## Plot net SSE over out-of-sample period
To investigate in which periods search volumes are most helpful for forecasting realized volatility, we finally plot the net SSE ($\times 10^4$) of the two models over the out-of-sample period, given by:
$$
\text{net SSE}(\tau) = \sum_{t=1}^{\tau}(\hat{e}^2_{\text{HAR},t} - \hat{e}^2_{\text{HAR+SV},t}) \tag{2}
$$
where $\hat{e}_{\text{HAR},t}$ and $\hat{e}_{\text{HAR+SV},t}$ are the one-step-ahead prediction errors of the HAR and HAR+SV models respectively.
```python
net_sse = (errors["HAR"]**2 - errors["HAR+SV"]**2).groupby(level=1).cumsum() * 10**4
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.ravel()
for i, index in enumerate(indices):
xs = net_sse.xs(index, level=1)
ax = axes[i]
ax.grid()
ax.plot(xs.index, xs, c="#17223B")
ax.set_title(index, loc="left")
ax.xaxis.set_major_locator(mdates.YearLocator(1))
ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
ax.yaxis.set_major_locator(plticker.MultipleLocator(base=5))
plt.subplots_adjust(wspace=0.15, hspace=0.3)
plt.show()
```
```output
<img src = "https://jackdry.s3.us-east-2.amazonaws.com/content/articles/predicting-realized-volatility-using-google-trends/net-SSE.jpg">
```
Since the net SSE increases sharply across all indices towards the end of the 2008 financial crisis and also during the 2010 flash crash (with the exception of the DAX), we conclude that search volumes are most helpful in high-volatility phases.
My email list
Any thoughts?
Comment with twitter or reddit.
Comments
- Interesting. Thanks for the nice write up and source code! Great stuff.
- Thanks mb34565!