# Predicting realized volatility using Google Trends

In this article, we'll study whether the daily realized volatility, $RV_t$, of a stock index is better modelled by augmenting the [heterogeneous autoregressive (HAR) model](http://statmath.wu.ac.at/~hauser/LVs/FinEtricsQF/References/Corsi2009JFinEtrics_LMmodelRealizedVola.pdf) with Google search volumes for the name of the index, $SV_t$. The extended model, which we term the HAR+SV model, is specified by: $$v^d_t = c + \beta_d v^d_{t-1} + \beta_w v^w_{t-1} + \beta_m v^m_{t-1} + \gamma s^d_{t-1} + \epsilon_t \tag{1}$$ where $v^d_t = \log(RV_t)$, $v^w_t = \frac{1}{5} \sum_{i=0}^{4} v^d_{t-i}$, $v^m_t = \frac{1}{22} \sum_{i=0}^{21} v^d_{t-i}$, and $s^d_t = \log(SV_t)$. The standard HAR model is a special case when $\gamma=0$. The full code and data for this article are available here: https://github.com/jackdry/predicting-realized-volatility-using-google-trends. ## Import packages python import matplotlib.ticker as plticker import matplotlib.dates as mdates import matplotlib.pyplot as plt import statsmodels.api as sm import pandas as pd import numpy as np  ## Define indices, in_sample and out_of_sample We'll focus our analysis on the FTSE 100, CAC 40, DJIA and DAX indices from July 2006 – June 2011. The first two years of this period will be used to assess the goodness of fit of the HAR and HAR+SV models, and the last three will be used to compare their out-of-sample forecasting performance. python indices = ["FTSE 100", "CAC 40", "DJIA", "DAX"]  python in_sample = slice("2006-07-01", "2008-06-30") out_of_sample = slice("2008-07-01", "2011-06-30")  ## Read and plot the data The data has two columns: sv holds relative search volume data obtained from [Google Trends](https://trends.google.com/trends/) (for the terms "FTSE", "CAC", "Dow" and "DAX"), and rv holds 10-minute realized volatility data from the [OMI's realized library](https://realized.oxford-man.ox.ac.uk/). python svrv = pd.read_csv("svrv.csv", index_col=[0, 1], parse_dates=True)  python svrv.head()  output sv rv datetime index 2006-05-01 DJIA 0.50169 0.00636 2006-05-02 CAC 40 0.70759 0.00578 DAX 0.68387 0.00473 DJIA 0.71293 0.00425 FTSE 100 0.72174 0.00640  Plotting the search volume (dark blue) and realized volatility (light blue) data for each index over time reveals a strong co-movement between the variables. python fig, axes = plt.subplots(2, 2, figsize=(12, 8)) axes = axes.ravel() for i, index in enumerate(indices): xs = svrv.xs(index, level=1) ax1 = axes[i] ax1.grid() ax1.plot(xs.index, xs["rv"], c="#7e96cd") ax1.set_title(index, loc="left") ax1.yaxis.set_major_locator(plticker.MultipleLocator(base=0.04)) ax2 = ax1.twinx() ax2.plot(xs.index, xs["sv"], c="#17223B") ax2.set_xticks(["2007", "2009", "2011"]) ax2.xaxis.set_major_formatter(mdates.DateFormatter("%Y")) ax2.yaxis.set_major_locator(plticker.MultipleLocator(base=5)) plt.subplots_adjust(wspace=0.3, hspace=0.3) plt.show()  output <img src = "https://jackdry.s3.us-east-2.amazonaws.com/content/articles/predicting-realized-volatility-using-google-trends/data.jpg">  ## Prepare data for modelling Because [sm.OLS](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html), which we're going to use to fit the models, doesn't include an intercept by default, we first need to add a column of 1s to svrv. python svrv["c"] = 1  We then add columns for each of the lagged regressors in the HAR+SV model, using names based on the notation in equation 1. python svrv[["s", "v"]] = np.log(svrv[["sv", "rv"]]) svrv[["s_d", "v_d"]] = svrv[["s", "v"]].groupby(level=1).shift() svrv["v_w"] = svrv.groupby(level=1)["v_d"].apply(lambda x: x.rolling(5).mean()) svrv["v_m"] = svrv.groupby(level=1)["v_d"].apply(lambda x: x.rolling(22).mean())  We lastly rearrange the columns and discard data before the sample period. python cols = ["c", "v_d", "v_w", "v_m", "s_d", "v"] svrv = svrv.loc[pd.IndexSlice[in_sample.start:, :], cols]  python svrv.head()  output c v_d v_w v_m s_d v datetime index 2006-07-03 CAC 40 1 -4.80021 -4.96926 -4.66049 -0.12541 -5.28451 DAX 1 -4.54562 -4.75624 -4.53791 -0.29126 -5.29235 DJIA 1 -5.35771 -5.18182 -4.94551 -0.26401 -5.27107 FTSE 100 1 -4.40144 -4.81162 -4.59088 0.33995 -5.19562 2006-07-04 CAC 40 1 -5.28451 -4.98638 -4.70326 -0.60096 -5.23280  ## Assess goodness of fit of models We first create a mapping from each model's name to its covariates. python models = {"HAR": ["c", "v_d", "v_w", "v_m"], "HAR+SV": ["c", "v_d", "v_w", "v_m", "s_d"]}  Next, we fit the models to the in-sample data for each index and calculate the associated AICs. python aic_index = pd.Index(indices, name="index") aic = pd.DataFrame(index=aic_index, columns=models.keys(), dtype=float) for index in indices: xs = svrv.xs(index, level=1).loc[in_sample] for model in models: fit = sm.OLS(xs["v"], xs[models[model]]).fit() aic.loc[index, model] = fit.aic  python aic  output HAR HAR+SV index FTSE 100 225.17236 218.54891 CAC 40 189.77295 185.45307 DJIA 300.70891 294.64466 DAX 178.28250 166.28124  Since the AICs for the HAR+SV model are considerably lower, it's clear that it better describes the realized volatility of the indices. ## Compare forecasting performance of models We first compute one-step-ahead forecasts for each model over the out-of-sample period. python preds_index = svrv.loc[out_of_sample].index preds = pd.DataFrame(index=preds_index, columns=models.keys(), dtype=float) for datetime, index in preds.index: xs = svrv.xs(index, level=1).loc[:datetime] for model in models: f = sm.OLS(xs[:-1]["v"], xs[:-1][models[model]]).fit() preds.loc[(datetime, index), model] = f.predict(xs[-1:][models[model]])  python preds.head()  output HAR HAR+SV datetime index 2008-07-01 CAC 40 -4.30653 -4.33577 DAX -4.39112 -4.42122 DJIA -4.64193 -4.59459 FTSE 100 -4.43101 -4.45533 2008-07-02 CAC 40 -4.36807 -4.37539  We then calculate the MSE ($\times 10^4$) of these forecasts for each model and index. python errors = -np.exp(preds).subtract(np.exp(svrv.loc[out_of_sample, "v"]), axis=0) mse = (errors**2).groupby(level=1).mean().reindex(indices) * 10**4  python mse  output HAR HAR+SV index FTSE 100 0.37139 0.35620 CAC 40 0.19564 0.18535 DJIA 0.27438 0.25806 DAX 0.18616 0.17329  The MSEs for the HAR+SV model are lower for all indices, strongly indicating that search volumes are helpful for forecasting realized volatility. ## Plot net SSE over out-of-sample period To investigate in which periods search volumes are most helpful for forecasting realized volatility, we finally plot the net SSE ($\times 10^4$) of the two models over the out-of-sample period, given by: $$\text{net SSE}(\tau) = \sum_{t=1}^{\tau}(\hat{e}^2_{\text{HAR},t} - \hat{e}^2_{\text{HAR+SV},t}) \tag{2}$$ where $\hat{e}_{\text{HAR},t}$ and $\hat{e}_{\text{HAR+SV},t}$ are the one-step-ahead prediction errors of the HAR and HAR+SV models respectively. python net_sse = (errors["HAR"]**2 - errors["HAR+SV"]**2).groupby(level=1).cumsum() * 10**4 fig, axes = plt.subplots(2, 2, figsize=(12, 8)) axes = axes.ravel() for i, index in enumerate(indices): xs = net_sse.xs(index, level=1) ax = axes[i] ax.grid() ax.plot(xs.index, xs, c="#17223B") ax.set_title(index, loc="left") ax.xaxis.set_major_locator(mdates.YearLocator(1)) ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y")) ax.yaxis.set_major_locator(plticker.MultipleLocator(base=5)) plt.subplots_adjust(wspace=0.15, hspace=0.3) plt.show()  output <img src = "https://jackdry.s3.us-east-2.amazonaws.com/content/articles/predicting-realized-volatility-using-google-trends/net-SSE.jpg">  Since the net SSE increases sharply across all indices towards the end of the 2008 financial crisis and also during the 2010 flash crash (with the exception of the DAX), we conclude that search volumes are most helpful in high-volatility phases.