Commit d63a42f8 authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

Make prediction

parent f0d4a406
%% Cell type:markdown id:b04d6b74 tags:
# ASO1 Problem
Authors: Remy Huet, Mathilde Rineau
Date 24/10/2021
%% Cell type:markdown id:ce33a2fc tags:
Subject:
We have the monthly retail debit card usage in Iceland (million ISK) from january 2000 to december 2012.
We want to estimate the cumulated debit card usage during the 4 first months of 2013.
%% Cell type:code id:bd017ee6 tags:
```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
```
%% Cell type:code id:9f8d9f85 tags:
```
# reading csv file
ts = pd.read_csv("data/debitcards.csv", index_col = 0,parse_dates=True)
print(ts)
```
%% Cell type:code id:a3686021 tags:
```
# verification on the data
assert(ts.shape == (156, 1))
assert(type(ts.index) is pd.core.indexes.datetimes.DatetimeIndex)
```
%% Cell type:code id:c5271112 tags:
```
# MS: month start frequency
ts.index.freq = "MS"
```
%% Cell type:code id:919229a4 tags:
```
plt.plot(ts.V1)
```
%% Cell type:markdown id:615221ca tags:
By plotting the data, we can see that the expectancy and the standard deviation do not seem to be constant so the time series is probably not stationary.
But, we perform a augmented Dickey-Fuller test to decide if it is or not a stationary time series.
%% Cell type:code id:61b00901 tags:
```
from statsmodels.tsa.stattools import adfuller
#perform augmented Dickey-Fuller test
test = adfuller(ts.V1, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:a195677c tags:
%% Cell type:markdown id:a49775f2 tags:
The given p-value is 0.79 so we are highly confident that the data is not stationary, as we expected.
%% Cell type:markdown id:3d4ddcfd tags:
%% Cell type:markdown id:bf951e7a tags:
By inspecting the data, we fist see a trend (debit card usage increases over time).
We also see regular peaks.
We will "zoom" on the data to see when those peaks append.
%% Cell type:code id:05987f75 tags:
%% Cell type:code id:68947c59 tags:
```
plt.rcParams['figure.figsize'] = [12, 5]
ts_zoom = ts['2000-01-01':'2003-01-01']
plt.plot(ts_zoom)
```
%% Cell type:markdown id:fdc7d106 tags:
%% Cell type:markdown id:701d104d tags:
We can see that the peaks seems to appear annually in december (which is quite logical).
We will thus presume a seasonality of 12 months on the data.
We thus have :
- A global increasing trend over time
- A seasonal effect with a period of twelve months
- A (maybe) stationary time series
%% Cell type:markdown id:d7839e3a tags:
%% Cell type:markdown id:066f28ff tags:
We will first bet on a constant augmentation.
We will thus use an integration of order 1 to reduce this effect.
By reading [the documentation on SARIMAX](https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_sarimax_stata.html#ARIMA-Example-2:-Arima-with-additive-seasonal-effects) we decided to try the following :
%% Cell type:code id:d2d59724 tags:
%% Cell type:code id:8f73ab30 tags:
```
from statsmodels.tsa.statespace.sarimax import SARIMAX as sarimax
ar = 1 # Max dregree of the polynomial
ma = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1) # This is a seasonal effect on twelve months
i = 1
model = sarimax(ts.V1, trend='c', order=(ar, i, ma))
res = model.fit()
plt.rcParams['figure.figsize'] = [10, 10]
_ = res.plot_diagnostics()
```
%% Cell type:markdown id:46a5c37e tags:
%% Cell type:markdown id:de02fe39 tags:
We can see with this diagnostics that the residuals are not really normally distributed and that there is some correlation on them.
%% Cell type:markdown id:d9eed6c7 tags:
%% Cell type:markdown id:ed813d90 tags:
By re-inspecting our data, we see that the variance might not be constant.
To counter this effect, we will try to use the same ARIMA model on the log of the data.
%% Cell type:code id:7c3f26a5 tags:
%% Cell type:code id:7793fe53 tags:
```
ts.log_V1 = np.log(ts.V1)
ts.diff_log_V1 = ts.log_V1.diff()
# Graph data
fig, axes = plt.subplots(1, 2, figsize=(15,4))
# Levels
axes[0].plot(ts.index, ts.V1, '-')
axes[0].set(title='Original data')
# Log difference
axes[1].plot(ts.index, ts.diff_log_V1, '-')
axes[1].hlines(0, ts.index[0], ts.index[-1], 'r')
axes[1].set(title='Diff of the log of the data')
plt.show()
```
%% Cell type:markdown id:e48b1dd4 tags:
%% Cell type:markdown id:7507165f tags:
By using the diff of the log, we seems to retrieve something stationary with a clear seasonal effect.
We will try our previous model on the log of the data to see if the results are better.
%% Cell type:code id:cbc86671 tags:
%% Cell type:code id:f4bc108f tags:
```
model = sarimax(ts.log_V1, trend='c', order=(ar, i, ma))
res = model.fit()
plt.rcParams['figure.figsize'] = [10, 10]
_ = res.plot_diagnostics()
```
%% Cell type:markdown id:5845d2dd tags:
%% Cell type:markdown id:56192cab tags:
The residuals seems to be very close to a normal distribution (especially on the Q-Q plot), but we see some correlation between them.
%% Cell type:markdown id:979b8d56 tags:
Using this model, we can try to predict the cumulated debit card usage for the 4 first months of 2013.
%% Cell type:code id:a23c5733 tags:
```
forecast = np.exp(res.forecast(4))
ts.plot(label='Data', legend=True)
forecast.plot(label='Forecast', legend=True)
```
%% Cell type:markdown id:b83cccc1 tags:
The obtained predictions seems coherent with our data :
- The amount in january is far less than the amount of the peak of december;
- The amount in february is a little less than the amount in january
- The amount grows after february
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment