Commit 5fbdbcea authored by Mathilde Rineau's avatar Mathilde Rineau 🙂
Browse files

Update problem_remy.ipynb

parent d63a42f8
......@@ -42,7 +42,7 @@
"outputs": [],
"source": [
"# reading csv file\n",
"ts = pd.read_csv(\"data/debitcards.csv\", index_col = 0,parse_dates=True)\n",
"ts = pd.read_csv(\"debitcards.csv\", index_col = 0,parse_dates=True)\n",
"print(ts)"
]
},
......@@ -284,6 +284,196 @@
"- The amount in february is a little less than the amount in january\n",
"- The amount grows after february"
]
},
{
"cell_type": "markdown",
"id": "b88f48fe",
"metadata": {},
"source": [
"Another method consists of using the [LOESS](https://fr.wikipedia.org/wiki/R%C3%A9gression_locale)"
]
},
{
"cell_type": "markdown",
"id": "6cca3ef7",
"metadata": {},
"source": [
"We decompose the data with the STL function\n",
"\n",
"According to the [manual](https://www.statsmodels.org/v0.11.0/generated/statsmodels.tsa.seasonal.STL.html) ,\"The period (12) is automatically detected from the data’s frequency (‘M’)\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0316887d",
"metadata": {},
"outputs": [],
"source": [
"from statsmodels.tsa.seasonal import STL\n",
"res = STL(ts).fit()\n",
"res.plot()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "4af9c502",
"metadata": {},
"source": [
"We obtain a decomposition with the trend, the seasonal component and a stationary time series (the resid)\n",
"\n",
"We have to verify if is the resid is really stationary"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58dd2521",
"metadata": {},
"outputs": [],
"source": [
"test = adfuller(res.resid, autolag='AIC')\n",
"pvalue = test[1]\n",
"print(pvalue)"
]
},
{
"cell_type": "markdown",
"id": "3fbbfdfa",
"metadata": {},
"source": [
"The p-value is approximately 7.5e-11 so we can conclude that the resid is a stationary time series (as expected)"
]
},
{
"cell_type": "markdown",
"id": "46547d97",
"metadata": {},
"source": [
"We divide the data into two data sets: a training data set and a test dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0fb793a3",
"metadata": {},
"outputs": [],
"source": [
"ts_train=ts[:\"2009-01-01\"]\n",
"ts_test=ts[\"2009-01-01\":]"
]
},
{
"cell_type": "markdown",
"id": "bc7757d9",
"metadata": {},
"source": [
"We use the `forecast` function of STL on the training data set in order to compare the result with the test data set and we plot them together\n",
"\n",
"According to the [manual] (https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html), we use an ARIMA model because we are working with integration models"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "720cf330",
"metadata": {},
"outputs": [],
"source": [
"from statsmodels.tsa.arima.model import ARIMA\n",
"from statsmodels.tsa.forecasting.stl import STLForecast\n",
"\n",
"stlf = STLForecast(ts_train, ARIMA, model_kwargs=dict(order=(1, 1, 0), trend=\"t\"))\n",
"stlf_res = stlf.fit()\n",
"\n",
"forecast = stlf_res.forecast(48)\n",
"plt.plot(ts)\n",
"plt.plot(forecast)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "8c4a021f",
"metadata": {},
"source": [
"The orange curve is the forecast data, we can see that the curve is closed to the real data (the blue curve) on the test dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0567e95",
"metadata": {},
"outputs": [],
"source": [
"mse = np.mean((forecast - ts_test.V1)**2)\n",
"print(mse)\n",
"error = np.mean(abs(forecast - ts_test.V1))\n",
"print(error)"
]
},
{
"cell_type": "markdown",
"id": "4e276ed2",
"metadata": {},
"source": [
"We compute the mean of the absolute value of the error and the mean squared error\n",
"\n",
"The MSE is huge but the mean of the absolute value of the error seems acceptable"
]
},
{
"cell_type": "markdown",
"id": "08326b2b",
"metadata": {},
"source": [
"To improve the prediction, we re-compute the model on the whole data set and we compute the forecasting for the 4 first months of 2013"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da0c7ac1",
"metadata": {},
"outputs": [],
"source": [
"stlf = STLForecast(ts, ARIMA, model_kwargs=dict(order=(1, 1, 0), trend=\"t\"))\n",
"stlf_res = stlf.fit()\n",
"\n",
"forecast = stlf_res.forecast(4)\n",
"plt.plot(ts)\n",
"plt.plot(forecast)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "c010c89a",
"metadata": {},
"source": [
"We compute the cumulative value of the 4 first months of 2013"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15d1be3b",
"metadata": {},
"outputs": [],
"source": [
"print(\"Cumulative value of the 4 first months of 2013: \",np.sum(forecast))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f2bba421",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
......@@ -291,7 +481,8 @@
"hash": "78ff2a7d75990e26f7862f23aec114522929670ec71bbfd9a70bdb18a9100993"
},
"kernelspec": {
"display_name": "Python 3.8.10 64-bit ('AOS1-3HAiNONq': pipenv)",
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
......@@ -303,7 +494,8 @@
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
......
%% Cell type:markdown id:b04d6b74 tags:
# ASO1 Problem
Authors: Remy Huet, Mathilde Rineau
Date 24/10/2021
%% Cell type:markdown id:ce33a2fc tags:
Subject:
We have the monthly retail debit card usage in Iceland (million ISK) from january 2000 to december 2012.
We want to estimate the cumulated debit card usage during the 4 first months of 2013.
%% Cell type:code id:bd017ee6 tags:
```
``` python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
```
%% Cell type:code id:9f8d9f85 tags:
```
``` python
# reading csv file
ts = pd.read_csv("data/debitcards.csv", index_col = 0,parse_dates=True)
ts = pd.read_csv("debitcards.csv", index_col = 0,parse_dates=True)
print(ts)
```
%% Cell type:code id:a3686021 tags:
```
``` python
# verification on the data
assert(ts.shape == (156, 1))
assert(type(ts.index) is pd.core.indexes.datetimes.DatetimeIndex)
```
%% Cell type:code id:c5271112 tags:
```
``` python
# MS: month start frequency
ts.index.freq = "MS"
```
%% Cell type:code id:919229a4 tags:
```
``` python
plt.plot(ts.V1)
```
%% Cell type:markdown id:615221ca tags:
By plotting the data, we can see that the expectancy and the standard deviation do not seem to be constant so the time series is probably not stationary.
But, we perform a augmented Dickey-Fuller test to decide if it is or not a stationary time series.
%% Cell type:code id:61b00901 tags:
```
``` python
from statsmodels.tsa.stattools import adfuller
#perform augmented Dickey-Fuller test
test = adfuller(ts.V1, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:a49775f2 tags:
The given p-value is 0.79 so we are highly confident that the data is not stationary, as we expected.
%% Cell type:markdown id:bf951e7a tags:
By inspecting the data, we fist see a trend (debit card usage increases over time).
We also see regular peaks.
We will "zoom" on the data to see when those peaks append.
%% Cell type:code id:68947c59 tags:
```
``` python
plt.rcParams['figure.figsize'] = [12, 5]
ts_zoom = ts['2000-01-01':'2003-01-01']
plt.plot(ts_zoom)
```
%% Cell type:markdown id:701d104d tags:
We can see that the peaks seems to appear annually in december (which is quite logical).
We will thus presume a seasonality of 12 months on the data.
We thus have :
- A global increasing trend over time
- A seasonal effect with a period of twelve months
- A (maybe) stationary time series
%% Cell type:markdown id:066f28ff tags:
We will first bet on a constant augmentation.
We will thus use an integration of order 1 to reduce this effect.
By reading [the documentation on SARIMAX](https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_sarimax_stata.html#ARIMA-Example-2:-Arima-with-additive-seasonal-effects) we decided to try the following :
%% Cell type:code id:8f73ab30 tags:
```
``` python
from statsmodels.tsa.statespace.sarimax import SARIMAX as sarimax
ar = 1 # Max dregree of the polynomial
ma = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1) # This is a seasonal effect on twelve months
i = 1
model = sarimax(ts.V1, trend='c', order=(ar, i, ma))
res = model.fit()
plt.rcParams['figure.figsize'] = [10, 10]
_ = res.plot_diagnostics()
```
%% Cell type:markdown id:de02fe39 tags:
We can see with this diagnostics that the residuals are not really normally distributed and that there is some correlation on them.
%% Cell type:markdown id:ed813d90 tags:
By re-inspecting our data, we see that the variance might not be constant.
To counter this effect, we will try to use the same ARIMA model on the log of the data.
%% Cell type:code id:7793fe53 tags:
```
``` python
ts.log_V1 = np.log(ts.V1)
ts.diff_log_V1 = ts.log_V1.diff()
# Graph data
fig, axes = plt.subplots(1, 2, figsize=(15,4))
# Levels
axes[0].plot(ts.index, ts.V1, '-')
axes[0].set(title='Original data')
# Log difference
axes[1].plot(ts.index, ts.diff_log_V1, '-')
axes[1].hlines(0, ts.index[0], ts.index[-1], 'r')
axes[1].set(title='Diff of the log of the data')
plt.show()
```
%% Cell type:markdown id:7507165f tags:
By using the diff of the log, we seems to retrieve something stationary with a clear seasonal effect.
We will try our previous model on the log of the data to see if the results are better.
%% Cell type:code id:f4bc108f tags:
```
``` python
model = sarimax(ts.log_V1, trend='c', order=(ar, i, ma))
res = model.fit()
plt.rcParams['figure.figsize'] = [10, 10]
_ = res.plot_diagnostics()
```
%% Cell type:markdown id:56192cab tags:
The residuals seems to be very close to a normal distribution (especially on the Q-Q plot), but we see some correlation between them.
%% Cell type:markdown id:979b8d56 tags:
Using this model, we can try to predict the cumulated debit card usage for the 4 first months of 2013.
%% Cell type:code id:a23c5733 tags:
```
``` python
forecast = np.exp(res.forecast(4))
ts.plot(label='Data', legend=True)
forecast.plot(label='Forecast', legend=True)
```
%% Cell type:markdown id:b83cccc1 tags:
The obtained predictions seems coherent with our data :
- The amount in january is far less than the amount of the peak of december;
- The amount in february is a little less than the amount in january
- The amount grows after february
%% Cell type:markdown id:b88f48fe tags:
Another method consists of using the [LOESS](https://fr.wikipedia.org/wiki/R%C3%A9gression_locale)
%% Cell type:markdown id:6cca3ef7 tags:
We decompose the data with the STL function
According to the [manual](https://www.statsmodels.org/v0.11.0/generated/statsmodels.tsa.seasonal.STL.html) ,"The period (12) is automatically detected from the data’s frequency (‘M’)".
%% Cell type:code id:0316887d tags:
``` python
from statsmodels.tsa.seasonal import STL
res = STL(ts).fit()
res.plot()
plt.show()
```
%% Cell type:markdown id:4af9c502 tags:
We obtain a decomposition with the trend, the seasonal component and a stationary time series (the resid)
We have to verify if is the resid is really stationary
%% Cell type:code id:58dd2521 tags:
``` python
test = adfuller(res.resid, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:3fbbfdfa tags:
The p-value is approximately 7.5e-11 so we can conclude that the resid is a stationary time series (as expected)
%% Cell type:markdown id:46547d97 tags:
We divide the data into two data sets: a training data set and a test dataset
%% Cell type:code id:0fb793a3 tags:
``` python
ts_train=ts[:"2009-01-01"]
ts_test=ts["2009-01-01":]
```
%% Cell type:markdown id:bc7757d9 tags:
We use the `forecast` function of STL on the training data set in order to compare the result with the test data set and we plot them together
According to the [manual] (https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html), we use an ARIMA model because we are working with integration models
%% Cell type:code id:720cf330 tags:
``` python
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.forecasting.stl import STLForecast
stlf = STLForecast(ts_train, ARIMA, model_kwargs=dict(order=(1, 1, 0), trend="t"))
stlf_res = stlf.fit()
forecast = stlf_res.forecast(48)
plt.plot(ts)
plt.plot(forecast)
plt.show()
```
%% Cell type:markdown id:8c4a021f tags:
The orange curve is the forecast data, we can see that the curve is closed to the real data (the blue curve) on the test dataset
%% Cell type:code id:b0567e95 tags:
``` python
mse = np.mean((forecast - ts_test.V1)**2)
print(mse)
error = np.mean(abs(forecast - ts_test.V1))
print(error)
```
%% Cell type:markdown id:4e276ed2 tags:
We compute the mean of the absolute value of the error and the mean squared error
The MSE is huge but the mean of the absolute value of the error seems acceptable
%% Cell type:markdown id:08326b2b tags:
To improve the prediction, we re-compute the model on the whole data set and we compute the forecasting for the 4 first months of 2013
%% Cell type:code id:da0c7ac1 tags:
``` python
stlf = STLForecast(ts, ARIMA, model_kwargs=dict(order=(1, 1, 0), trend="t"))
stlf_res = stlf.fit()
forecast = stlf_res.forecast(4)
plt.plot(ts)
plt.plot(forecast)
plt.show()
```
%% Cell type:markdown id:c010c89a tags:
We compute the cumulative value of the 4 first months of 2013
%% Cell type:code id:15d1be3b tags:
``` python
print("Cumulative value of the 4 first months of 2013: ",np.sum(forecast))
```
%% Cell type:code id:f2bba421 tags:
``` python
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment