Jack Dry

Using an LSTM-based model to predict stock returns


In this tutorial, we'll build an LSTM-based model to predict whether EasyJet's stock price will go up or down on a particular day, given pricing data from the past 30 trading days. If you don't know what an LSTM is, I recommend reading [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) and [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). The full code and data for this tutorial are available here: https://github.com/jackdry/using-an-lstm-based-model-to-predict-stock-returns. --- ## 1. Import packages ```python from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator import tensorflow as tf import pandas as pd import numpy as np import os ``` --- ## 2. Define training, validation and test periods ```python train_start_date = "2008-01-01" train_end_date = "2016-12-31" val_start_date = "2017-01-01" val_end_date = "2017-12-31" test_start_date = "2018-01-01" test_end_date = "2018-12-31" ``` We'll use the data in the training period to train models, the data in the validation period to optimize hyperparameters and the data in the test period to evaluate our final model. --- ## 3. Read and label the data ```python ezj = pd.read_csv("ezj.csv", index_col=0, parse_dates=True) ezj["return"] = ezj["close"] / ezj["close"].shift() - 1 ``` ```python ezj.head() ``` ```output date open high low close volume return 2008-01-02 664.908997 677.455017 654.544983 657.273010 1721833.0 NaN 2008-01-03 651.273010 654.000000 617.455017 632.182007 2740650.0 -0.038174 2008-01-04 627.817993 632.726990 590.726990 596.726990 4711938.0 -0.056084 2008-01-07 596.182007 597.817993 560.726990 583.091003 4103622.0 -0.022851 2008-01-08 567.273010 573.273010 497.782013 502.091003 12687374.0 -0.138915 ``` We'll assign a label of 1 to dates on which EasyJet's stock return is positive, and a label of 0 to dates on which it is not. ```python ezj["label"] = np.where(ezj["return"] > 0, 1, 0) ``` --- ## 4. Engineer features To predict the labels, we'll use returns and volumes from the past 30 trading days. [To reduce the chances of getting stuck in local optima](https://www.youtube.com/watch?v=FDCfw-YqWTE), we'll standardize the returns using statistics computed over the training period, and the volumes using a sliding window approach. ```python ezj["std_return"] = (ezj["return"] - ezj["return"][:val_start_date].mean()) / ezj["return"][:val_start_date].std() ezj["std_volume"] = (ezj["volume"] - ezj["volume"].rolling(50).mean()) / ezj["volume"].rolling(50).std() ``` ```python ezj.dropna(inplace=True) ``` --- ## 5. Create generators Before creating generators for the train, validation and test sets, we need the integer locations corresponding to the start of the validation and test periods. ```python val_start_iloc = ezj.index.get_loc(val_start_date, method="bfill") test_start_iloc = ezj.index.get_loc(test_start_date, method="bfill") ``` We'll use [`TimeseriesGenerator`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/TimeseriesGenerator) to create the generators, and pass `length=30` so that data from the past 30 trading days is used to make predictions. ```python train_generator = TimeseriesGenerator(ezj[["std_return", "std_volume"]].values, ezj[["label"]].values, length=30, batch_size=64, end_index=val_start_iloc-1) val_generator = TimeseriesGenerator(ezj[["std_return", "std_volume"]].values, ezj[["label"]].values, length=30, batch_size=64, start_index=val_start_iloc, end_index=test_start_iloc-1) test_generator = TimeseriesGenerator(ezj[["std_return", "std_volume"]].values, ezj[["label"]].values, length=30, batch_size=64, start_index=test_start_iloc) ``` --- ## 6. Create `model_fn` `model_fn` trains an LSTM-based model for a maximum of 100 epochs, stopping early if validation accuracy does not improve for 5 epochs. If you don't have a GPU, make sure to swap [`CuDNNLSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/CuDNNLSTM) for [`LSTM`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM). ```python def model_fn(params): model = tf.keras.Sequential() model.add(tf.keras.layers.CuDNNLSTM(params["lstm_size"], input_shape=(30, 2))) model.add(tf.keras.layers.Dropout(params["dropout"])) model.add(tf.keras.layers.Dense(1, activation="sigmoid")) model.compile(optimizer=tf.keras.optimizers.Adam(params["learning_rate"]), loss="binary_crossentropy", metrics=["accuracy"]) callbacks = [tf.keras.callbacks.EarlyStopping(monitor="val_acc", patience=5, restore_best_weights=True)] history = model.fit_generator(train_generator, validation_data=val_generator, callbacks=callbacks, epochs=100, verbose=0).history return (history, model) ``` --- ## 7. Create `random_search` We'll use `random_search` to optimize hyperparameters, which runs a [random search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) and saves the results and best model in `search_dir`. ```python def random_search(model_fn, search_space, n_iter, search_dir): results = [] os.mkdir(search_dir) best_model_path = os.path.join(search_dir, "best_model.h5") results_path = os.path.join(search_dir, "results.csv") for i in range(n_iter): params = {k: v[np.random.randint(len(v))] for k, v in search_space.items()} history, model = model_fn(params) epochs = np.argmax(history["val_acc"]) + 1 result = {k: v[epochs - 1] for k, v in history.items()} params["epochs"] = epochs if i == 0: best_val_acc = result["val_acc"] model.save(best_model_path) if result["val_acc"] > best_val_acc: best_val_acc = result["val_acc"] model.save(best_model_path) result = {**params, **result} results.append(result) tf.keras.backend.clear_session() print(f"iteration {i + 1} – {', '.join(f'{k}:{v:.4g}' for k, v in result.items())}") best_model = tf.keras.models.load_model(best_model_path) results = pd.DataFrame(results) results.to_csv(results_path) return (results, best_model) ``` --- ## 8. Run random search We'll run the random search for 200 iterations. It should take somewhere between 10 and 90 minutes to complete, depending on your hardware. ```python search_space = {"lstm_size": np.linspace(50, 200, 16, dtype=int), "dropout": np.linspace(0, 0.4, 9), "learning_rate": np.linspace(0.004, 0.01, 13)} ``` ```python results, best_model = random_search(model_fn, search_space, 200, "search") ``` ```python results.sort_values("val_acc", ascending=False).head() ``` ```output acc dropout epochs learning_rate loss lstm_size val_acc val_loss 90 0.512528 0.20 3 0.0090 0.690524 190 0.581081 0.689752 100 0.509339 0.35 2 0.0080 0.692828 80 0.576577 0.689374 101 0.518451 0.20 2 0.0080 0.692348 100 0.576577 0.689398 114 0.536219 0.35 6 0.0085 0.690760 70 0.567568 0.692731 140 0.526196 0.00 2 0.0070 0.691741 70 0.567568 0.691383 ``` --- ## 9. Evaluate final model All that's left is to evaluate our final model over the test period. ```python best_model.evaluate_generator(test_generator) ``` ```output [0.6878005266189575, 0.5515695] ``` It achieved an accuracy of 55.2%. Not bad at all!

My email list

If you'd like to be sent emails about new articles, subscribe below.


Any thoughts?

Log in with twitter or reddit to comment.


Comments

No comments.