Stock Forecasting
Time Series Analysis of MSFT
I want to be entirely clear: I doubt anyone could predict stocks in a profitable and accurate way, least of all me. It’s still worth a try, just in case. If anything, this project is an introduction to time series data and modeling for some future work.
Let’s try and make some predictions with the closing price of MSFT stock
But before we begin, let’s start with some mathematical formulation
Mathematical Background
Time series data is characterized by data that is taken sequentially through time, with a minimum time period difference between subsequent data points. The index of the data are timestamps.
For the purposes of explanation, “lagged” data means data staggered by n time periods forward (so a single lag from daily data would be data from the day previous).
ARIMA
The basic Autoregressive Integrated Moving Average Model (ARIMA) is given by the following equation:
Where the Yi‘s are the values of the response variable at time i (for i less than t) and the corresponding phis represent their respective coefficients. The Y hat is the predicted value at a given time t. The ei‘s represent the moving average at time i (for i less than t), and the corresponding thetas are the respective coeffciients, and C corresponds to some intercept.
The standard notation for an ARIMA model is ARIMA(p,d,q), p representing the number of time-lagged terms, d representing the nonseasonal differences to induce stationarity (the mean and variance do not change statistically significantly with time), and q referring to the number of moving average terms.
SARIMA
This is really just a modification of the ARIMA model, with a seasonality component added in. The way these models are represented is very similar, where a SARIMA model is represented as ARIMA(p,d,q)(P,D,Q)m. The parameters P, D, Q represent the number of autoregressive, seasonal differences, and moving average terms for the seasonal part, and m represents the number of periods in each season
Data Preparation
You can download the data from various sites, or if you want to get sophisticated, you can use a broker’s API to obtain that data. I initially downloaded data from MarketWatch, but I wanted to mess around with TD Ameritrade’s API and obtain data without having to leave the notebook.
Honestly, the API was such a pain to figure out. The documentation on the API page for TD Ameritrade is not completely robust, and there was a pretty high learning curve (as someone who hasn’t really tangled with API’s before). I provide a link to my code below with the code for API requests included.
Despite all that, here’s the data I managed to pull out of the API.
Data Analysis
Now that I have the stock data from the TD Ameritrade API, I need to figure out the nature of the data, the number of time lags that are highly correlated with the current time step, what type of mathematical relationship those lags have with the current time step (say linear, or exponential, etc.), the seasonality of the data, and whether the data is stationary or not.
Seasonality and Stationarity
Often, time series data expresses some type of seasonality, where the value of the current time step is partially influenced by the time index of that value.
One way to analyze the effect of the time index is through a periodogram which indicates the strength of frequencies in a time series, and to plot the differenced data (current value subtracted by the previous value).
Looking at the first plot, it doesn’t seem as though the closing price of MSFT stock has very much seasonality. If it did, I could use a series of Fourier components and then remove most of the seasonality from the data.
Examining the second plot, it doesn’t seem as though there are any significant seasonal trends either, which is consistent with the results of the first plot.
If there was seasonality, there are built in packages that allow for a seasonal decomposition.
The first way to think about seasonality is through an additive model given by:
Or by a multiplicative model, given by:
Where Vt is the value of the series at time t, Yt is the decomposed data, Tt is the overall trend in the data, and et is the error for that given value.
If the data were seasonal, by specifying either multiplicative or additive and using a seasonal decomposition. This works by identifying the trend in the data, and removing that trend from the periods specified.
Multiplicative Decomposition
Ideally, if this were an accurate seasonal decomposition, there would be no pattern in the residuals and the residuals would be close to zero, and would be relatively constant over time.
The next question to ask is whether this data is stationary or not, otherwise it will be quite difficult to build an accurate ARIMA model for this series.
The standard deviation is pretty constant across time, the mean is increasing though, indicating less than perfect stationarity. That will be something that needs to be corrected for.
Detrending, Number of Lags, and Moving Average Terms
Before I identify the number of lagged terms and moving average or differenced terms, I want to remove the overall trend from the data, to hopefully make the data more stationary.
I’ll perform a linear regression on the training dataset to determine the overall trend in the stock price, and remove that from the data.
With the trend removed, the data will likely be more stationary. The final model will consist of adding this trend together with the result of the ARIMA model.
Now I’m going to dive into deciding which lags and how many lags to include in the model.
The partial autocorrelation function is given as follows:
In the context of time series analysis, this represents the autocorrelation of the current value, yt with another value in time, yt+k, adjusted for the correlations between yt and the yi between yt+1 and yt+k-1.
Using a plot functionality in python, it becomes clear which lags have high partial autocorrelation:
The optimal number of lags seem to be two days previous. While some lags were also statistically significant (the line is above the blue shaded area representing a 99% confidence interval), none are quite as highly correlated as the previous day. So the value of p in the ARIMA model should be 2.
It’s also important to determine what mathematical relationship these lags share with the current value and whether any transformations should be performed on the data before putting them into the ARIMA model.
Luckily enough, it seems as though the first six lags have a linear relationship with the current value.
I’m also going to look at the autocorrelation plot for lags, to determine the value for q, or the number of moving average terms included
Few lagged means are above the shaded significance area, and the ones that are have low autocorrelation with the current value. So the model will likely perform best with a low q value (between 0 and 3).
So to recap, the optimal number of lags, p, indicated by the graph of the partial autocorrelation function, appears to be 2.
From the autocorrelation plot of the differenced MSFT stock close data, the ideal q value seems to be somewhere between 0 and 3.
Lucky enough, there’s a way to automatically find the optimal parameters and see if the optimal ARIMA model matches with my interpretation for ARIMA parameters from the previous plots.
Overall it seems as though the parameters I determined through the diagnostic plots are the optimal parameters for building this ARIMA model, p = 2, d = 1, and q = 0.
Model Selection
Linear Regression
One day previous lag, forecasting one day ahead
This model, which made predictions about the next day’s closing price using the current day’s closing price, volume, open, high, and low, performed the best (based on the root mean squared error of the validation data). Now, while this might be useful, I’d say that this model is not necessarily the most useful, but interesting nonetheless. The model is effectively simulating a one step random walk, and due to a combination of transaction fees and lack of complete accuracy, very little if any profit could be generated from a model like this.
ARIMA
Base ARIMA model, forecasting 20 days ahead
This model is by far the worst performing of the three models tested, however, unlike the first model, this one generated a longer forecast (20 days into the future), which had a decent degree of accuracy, but slightly overestimated the performance of MSFT Stock.
SARIMA
ARIMA with seasonality, forecasting 20 days ahead
With a root mean squared error of approximately 8.06 for the validation data, this model performed second best in this metric, however, given the larger window of forecasting this model made (again 20 days), I think this model is the most useful, and I will use this model on the full set of data I have to make predictions about the MSFT Stock 20 days from June 13th, 2022.
Predictions
Now that I’ve found the model that will be the most useful for the task ahead, it’s time to implement it.
Before fully diving in, there’s a little bit of set up first, mainly retraining the model on the full dataset from June 13th, 2002, all the way through to June 13th, 2022. Then I needed to make a time index with the relevant dates stored (excluding weekends and holidays since the market is closed on those days).
Finally, I let the model make a forecast 20 market days into the future, all the way to July 12th, 2022, and added back in the linear trend I made using the training data.
Looking at the predicted closing price for July 13th, 2022, my model predicted a price of $245.13, and the actual price was $252.72, representing around a 3% error in the final prediction, and over that time interval a root mean squared error of 27.72.
All things considered, given the stock was priced at over $200 and the task I attempted (simulating a random walk 20 timesteps ahead) I’m pretty happy with the performance of the model.