Commit f0d4a406 authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

Little corrections, new work

parent d39a1067
......@@ -42,7 +42,7 @@
"outputs": [],
"source": [
"# reading csv file\n",
"ts = pd.read_csv(\"debitcards.csv\", index_col = 0,parse_dates=True)\n",
"ts = pd.read_csv(\"data/debitcards.csv\", index_col = 0,parse_dates=True)\n",
"print(ts)"
]
},
......@@ -107,7 +107,7 @@
"id": "3c24d32d",
"metadata": {},
"source": [
"The p-value is approximately 0.78, so much higher than 0.005 so we accept the hypothesis that the data is stationary."
"The p-value is approximately 0.78, so much higher than 0.005 so we accept the hypothesis that the data is not stationary."
]
},
{
......@@ -149,7 +149,7 @@
"id": "cb24ff00",
"metadata": {},
"source": [
"The p-value is approximately 0.71, so much higher than 0.005 so we accept the hypothesis that the data is stationary."
"The p-value is approximately 0.71, so much higher than 0.005 so we accept the hypothesis that the data is not stationary."
]
},
{
......@@ -416,9 +416,11 @@
}
],
"metadata": {
"interpreter": {
"hash": "78ff2a7d75990e26f7862f23aec114522929670ec71bbfd9a70bdb18a9100993"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"display_name": "Python 3.8.10 64-bit ('AOS1-3HAiNONq': pipenv)",
"name": "python3"
},
"language_info": {
......@@ -430,8 +432,7 @@
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
......
%% Cell type:markdown id:b04d6b74 tags:
# ASO1 Problem
Authors: Remy Huet, Mathilde Rineau
Date 24/10/2021
%% Cell type:markdown id:ce33a2fc tags:
Subject:
We have the monthly retail debit card usage in Iceland (million ISK) from january 2000 to december 2012.
We want to estimate the cumulated debit card usage during the 4 first months of 2013.
%% Cell type:code id:bd017ee6 tags:
``` python
```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
```
%% Cell type:code id:9f8d9f85 tags:
``` python
```
# reading csv file
ts = pd.read_csv("debitcards.csv", index_col = 0,parse_dates=True)
ts = pd.read_csv("data/debitcards.csv", index_col = 0,parse_dates=True)
print(ts)
```
%% Cell type:code id:a3686021 tags:
``` python
```
# verification on the data
assert(ts.shape == (156, 1))
assert(type(ts.index) is pd.core.indexes.datetimes.DatetimeIndex)
```
%% Cell type:code id:c5271112 tags:
``` python
```
# MS: month start frequency
ts.index.freq = "MS"
```
%% Cell type:code id:919229a4 tags:
``` python
```
plt.plot(ts.V1)
```
%% Cell type:markdown id:615221ca tags:
By plotting the data, we can see that the expectancy and the standard deviation do not seem to be constant so the time series is probably not stationary.
But, we perform a augmented Dickey-Fuller test to decide if it is or not a stationary time series.
%% Cell type:code id:61b00901 tags:
``` python
```
from statsmodels.tsa.stattools import adfuller
#perform augmented Dickey-Fuller test
test = adfuller(ts.V1, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:3c24d32d tags:
The p-value is approximately 0.78, so much higher than 0.005 so we accept the hypothesis that the data is stationary.
The p-value is approximately 0.78, so much higher than 0.005 so we accept the hypothesis that the data is not stationary.
%% Cell type:markdown id:e19bf18c tags:
We use a Box-Cox transformation in order to obtain stationary time series
%% Cell type:code id:edd8dd63 tags:
``` python
```
from scipy import stats
# Box-Cox transformation
ts_V1_transform, ts_V1_lambda = stats.boxcox(ts.V1)
plt.plot(ts_V1_transform)
```
%% Cell type:code id:d46d9cc8 tags:
``` python
```
#perform augmented Dickey-Fuller test
test = adfuller(ts_V1_transform, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:cb24ff00 tags:
The p-value is approximately 0.71, so much higher than 0.005 so we accept the hypothesis that the data is stationary.
The p-value is approximately 0.71, so much higher than 0.005 so we accept the hypothesis that the data is not stationary.
%% Cell type:markdown id:25f27bd7 tags:
The Box-Cox transformation failed to give us a stationary time series.
%% Cell type:markdown id:72295080 tags:
We perform an integrated model in order to obtain a stationary series (Yt - Yt-1)
%% Cell type:code id:a6851e0d tags:
``` python
```
ts_stationary = ts.copy()
j=1
for i in (ts.index):
if (j <155):
ts_stationary.loc[i]= ts.V1[j+1]- ts.V1[j]
elif(j==155):
ts_stationary.loc[i]= ts.V1[154]- ts.V1[153]
elif(j==156):
ts_stationary.loc[i]= ts.V1[155]- ts.V1[154]
else:
pass
j+=1
print(ts_stationary)
```
%% Cell type:code id:c602973d tags:
``` python
```
plt.plot(ts_stationary)
```
%% Cell type:code id:fc7d373f tags:
``` python
```
from statsmodels.tsa.stattools import adfuller
test = adfuller(ts_stationary, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:616ad4c8 tags:
The p-value is approximately 0.67, so much higher than 0.005 so we accept the hypothesis that the data is stationary.
%% Cell type:markdown id:ce066bc0 tags:
The integrated model failed to give us a stationary time series.
%% Cell type:markdown id:bb89b4e9 tags:
We change the period of the data in order to reduce the variability
%% Cell type:code id:cd01d964 tags:
``` python
```
ts_quartil = pd.DataFrame(columns=["V1"],index=pd.date_range(pd.Timestamp("2000-01-01"),periods = 39, freq ='4MS'))
j=0
for i in ts_quartil.index:
ts_quartil.loc[i]= ts.V1[j] + ts.V1[j+1] + ts.V1[j+2] + ts.V1[j+3]
j+=4
plt.plot(ts_quartil)
```
%% Cell type:code id:82dc234e tags:
``` python
```
test = adfuller(ts_quartil, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:0b482df4 tags:
The p-value is approximately 0.68, so much higher than 0.005 so we accept the hypothesis that the data is stationary.
%% Cell type:markdown id:1502bd9b tags:
This changement failed to give us a stationary time series.
%% Cell type:code id:eadb0062 tags:
``` python
```
ts_stationary = ts_quartil.copy()
#print(ts_stationary)
length = len(ts_quartil)
j=1
for i in (ts_quartil.index):
if (j <length-1):
#print(i)
ts_stationary.loc[i]= ts_quartil.V1[j+1]- ts_quartil.V1[j]
elif(j==length-1):
#print(i)
ts_stationary.loc[i]= ts_quartil.V1[37]- ts_quartil.V1[36]
elif(j==length):
#print(i)
ts_stationary.loc[i]= ts_quartil.V1[38]- ts_quartil.V1[37]
else:
pass
j+=1
#print(ts_stationary)
plt.plot(ts_stationary)
```
%% Cell type:code id:366acc41 tags:
``` python
```
ts_stationary_train = ts_stationary[:"2007-01-01"]
ts_stationary_test = ts_stationary["2007-01-01":]
```
%% Cell type:code id:f2bd2d5b tags:
``` python
```
test = adfuller(ts_stationary_train, autolag='AIC')
pvalue = test[1]
print(pvalue)
test = adfuller(ts_stationary_test, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:66abd1a8 tags:
If we split the data in two parts we have the first part which seems to be stationary because the p-value of the adfuller test is 0.0 so lower than 0.005.
But the second part is not stationary because the p-value is almost 1.
%% Cell type:code id:4b35ea92 tags:
``` python
```
print(type(ts_stationary_train))
plt.plot(ts_stationary_train)
```
%% Cell type:code id:70235c46 tags:
``` python
```
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(ts_stationary_train)
```
%% Cell type:code id:edce5bba tags:
``` python
```
#génère une erreur
plot_pacf(ts_stationary_train)
```
%% Cell type:code id:e111eb2c tags:
``` python
```
#génère une erreur
model = sarimax(ts_stationary_train, order=(1, 0, 0))
model_fit = model.fit()
```
%% Cell type:code id:ad11fc71 tags:
``` python
```
```
%% Cell type:code id:b0fba332 tags:
``` python
```
```
%% Cell type:code id:b9b66fb6 tags:
``` python
```
```
......
{
"cells": [
{
"cell_type": "markdown",
"id": "b04d6b74",
"metadata": {},
"source": [
"# ASO1 Problem\n",
"\n",
"Authors: Remy Huet, Mathilde Rineau\n",
"\n",
"Date 24/10/2021\n"
]
},
{
"cell_type": "markdown",
"id": "ce33a2fc",
"metadata": {},
"source": [
"Subject:\n",
"We have the monthly retail debit card usage in Iceland (million ISK) from january 2000 to december 2012.\n",
"We want to estimate the cumulated debit card usage during the 4 first months of 2013."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd017ee6",
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9f8d9f85",
"metadata": {},
"outputs": [],
"source": [
"# reading csv file\n",
"ts = pd.read_csv(\"data/debitcards.csv\", index_col = 0,parse_dates=True)\n",
"print(ts)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3686021",
"metadata": {},
"outputs": [],
"source": [
"# verification on the data\n",
"assert(ts.shape == (156, 1))\n",
"assert(type(ts.index) is pd.core.indexes.datetimes.DatetimeIndex)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5271112",
"metadata": {},
"outputs": [],
"source": [
"# MS: month start frequency\n",
"ts.index.freq = \"MS\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "919229a4",
"metadata": {},
"outputs": [],
"source": [
"plt.plot(ts.V1)"
]
},
{
"cell_type": "markdown",
"id": "615221ca",
"metadata": {},
"source": [
"By plotting the data, we can see that the expectancy and the standard deviation do not seem to be constant so the time series is probably not stationary.\n",
"But, we perform a augmented Dickey-Fuller test to decide if it is or not a stationary time series."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61b00901",
"metadata": {},
"outputs": [],
"source": [
"from statsmodels.tsa.stattools import adfuller\n",
"#perform augmented Dickey-Fuller test\n",
"test = adfuller(ts.V1, autolag='AIC')\n",
"pvalue = test[1]\n",
"print(pvalue)"
]
},
{
"cell_type": "markdown",
"id": "a195677c",
"metadata": {},
"source": [
"The given p-value is 0.79 so we are highly confident that the data is not stationary, as we expected."
]
},
{
"cell_type": "markdown",
"id": "3d4ddcfd",
"metadata": {},
"source": [
"By inspecting the data, we fist see a trend (debit card usage increases over time).\n",
"\n",
"We also see regular peaks.\n",
"We will \"zoom\" on the data to see when those peaks append."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05987f75",
"metadata": {},
"outputs": [],
"source": [
"plt.rcParams['figure.figsize'] = [12, 5]\n",
"ts_zoom = ts['2000-01-01':'2003-01-01']\n",
"plt.plot(ts_zoom)"
]
},
{
"cell_type": "markdown",
"id": "fdc7d106",
"metadata": {},
"source": [
"We can see that the peaks seems to appear annually in december (which is quite logical).\n",
"We will thus presume a seasonality of 12 months on the data.\n",
"\n",
"We thus have :\n",
"- A global increasing trend over time\n",
"- A seasonal effect with a period of twelve months\n",
"- A (maybe) stationary time series"
]
},
{
"cell_type": "markdown",
"id": "d7839e3a",
"metadata": {},
"source": [
"We will first bet on a constant augmentation.\n",
"We will thus use an integration of order 1 to reduce this effect.\n",
"\n",
"By reading [the documentation on SARIMAX](https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_sarimax_stata.html#ARIMA-Example-2:-Arima-with-additive-seasonal-effects) we decided to try the following :\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2d59724",
"metadata": {},
"outputs": [],
"source": [
"from statsmodels.tsa.statespace.sarimax import SARIMAX as sarimax\n",
"\n",
"ar = 1 # Max dregree of the polynomial\n",
"ma = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1) # This is a seasonal effect on twelve months\n",
"i = 1\n",
"\n",
"model = sarimax(ts.V1, trend='c', order=(ar, i, ma))\n",
"res = model.fit()\n",
"\n",
"plt.rcParams['figure.figsize'] = [10, 10]\n",
"_ = res.plot_diagnostics()"
]
},
{
"cell_type": "markdown",
"id": "46a5c37e",
"metadata": {},
"source": [
"We can see with this diagnostics that the residuals are not really normally distributed and that there is some correlation on them."
]
},
{
"cell_type": "markdown",
"id": "d9eed6c7",
"metadata": {},
"source": [
"By re-inspecting our data, we see that the variance might not be constant.\n",
"To counter this effect, we will try to use the same ARIMA model on the log of the data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7c3f26a5",
"metadata": {},
"outputs": [],
"source": [
"ts.log_V1 = np.log(ts.V1)\n",
"ts.diff_log_V1 = ts.log_V1.diff()\n",
"\n",
"# Graph data\n",
"fig, axes = plt.subplots(1, 2, figsize=(15,4))\n",
"\n",
"# Levels\n",
"axes[0].plot(ts.index, ts.V1, '-')\n",
"axes[0].set(title='Original data')\n",
"\n",
"# Log difference\n",
"axes[1].plot(ts.index, ts.diff_log_V1, '-')\n",
"axes[1].hlines(0, ts.index[0], ts.index[-1], 'r')\n",
"axes[1].set(title='Diff of the log of the data')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "e48b1dd4",
"metadata": {},
"source": [
"By using the diff of the log, we seems to retrieve something stationary with a clear seasonal effect.\n",
"\n",
"We will try our previous model on the log of the data to see if the results are better."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cbc86671",
"metadata": {},
"outputs": [],
"source": [
"model = sarimax(ts.log_V1, trend='c', order=(ar, i, ma))\n",
"res = model.fit()\n",
"\n",
"plt.rcParams['figure.figsize'] = [10, 10]\n",
"_ = res.plot_diagnostics()"
]
},
{
"cell_type": "markdown",
"id": "5845d2dd",
"metadata": {},
"source": [
"The residuals seems to be very close to a normal distribution (especially on the Q-Q plot), but we see some correlation between them."
]
}
],
"metadata": {
"interpreter": {
"hash": "78ff2a7d75990e26f7862f23aec114522929670ec71bbfd9a70bdb18a9100993"
},
"kernelspec": {
"display_name": "Python 3.8.10 64-bit ('AOS1-3HAiNONq': pipenv)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
%% Cell type:markdown id:b04d6b74 tags:
# ASO1 Problem
Authors: Remy Huet, Mathilde Rineau
Date 24/10/2021
%% Cell type:markdown id:ce33a2fc tags:
Subject:
We have the monthly retail debit card usage in Iceland (million ISK) from january 2000 to december 2012.
We want to estimate the cumulated debit card usage during the 4 first months of 2013.
%% Cell type:code id:bd017ee6 tags:
```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
```
%% Cell type:code id:9f8d9f85 tags:
```
# reading csv file
ts = pd.read_csv("data/debitcards.csv", index_col = 0,parse_dates=True)
print(ts)
```
%% Cell type:code id:a3686021 tags:
```
# verification on the data
assert(ts.shape == (156, 1))
assert(type(ts.index) is pd.core.indexes.datetimes.DatetimeIndex)
```
%% Cell type:code id:c5271112 tags:
```
# MS: month start frequency
ts.index.freq = "MS"
```
%% Cell type:code id:919229a4 tags:
```
plt.plot(ts.V1)
```
%% Cell type:markdown id:615221ca tags:
By plotting the data, we can see that the expectancy and the standard deviation do not seem to be constant so the time series is probably not stationary.
But, we perform a augmented Dickey-Fuller test to decide if it is or not a stationary time series.
%% Cell type:code id:61b00901 tags:
```
from statsmodels.tsa.stattools import adfuller
#perform augmented Dickey-Fuller test
test = adfuller(ts.V1, autolag='AIC')
pvalue = test[1]
print(pvalue)
```
%% Cell type:markdown id:a195677c tags:
The given p-value is 0.79 so we are highly confident that the data is not stationary, as we expected.
%% Cell type:markdown id:3d4ddcfd tags:
By inspecting the data, we fist see a trend (debit card usage increases over time).
We also see regular peaks.
We will "zoom" on the data to see when those peaks append.
%% Cell type:code id:05987f75 tags:
```
plt.rcParams['figure.figsize'] = [12, 5]
ts_zoom = ts['2000-01-01':'2003-01-01']
plt.plot(ts_zoom)
```
%% Cell type:markdown id:fdc7d106 tags:
We can see that the peaks seems to appear annually in december (which is quite logical).
We will thus presume a seasonality of 12 months on the data.
We thus have :
- A global increasing trend over time
- A seasonal effect with a period of twelve months
- A (maybe) stationary time series
%% Cell type:markdown id:d7839e3a tags:
We will first bet on a constant augmentation.
We will thus use an integration of order 1 to reduce this effect.
By reading [the documentation on SARIMAX](https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_sarimax_stata.html#ARIMA-Example-2:-Arima-with-additive-seasonal-effects) we decided to try the following :
%% Cell type:code id:d2d59724 tags:
```
from statsmodels.tsa.statespace.sarimax import SARIMAX as sarimax
ar = 1 # Max dregree of the polynomial
ma = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1) # This is a seasonal effect on twelve months
i = 1
model = sarimax(ts.V1, trend='c', order=(ar, i, ma))
res = model.fit()
plt.rcParams['figure.figsize'] = [10, 10]
_ = res.plot_diagnostics()
```
%% Cell type:markdown id:46a5c37e tags:
We can see with this diagnostics that the residuals are not really normally distributed and that there is some correlation on them.
%% Cell type:markdown id:d9eed6c7 tags:
By re-inspecting our data, we see that the variance might not be constant.
To counter this effect, we will try to use the same ARIMA model on the log of the data.
%% Cell type:code id:7c3f26a5 tags:
```
ts.log_V1 = np.log(ts.V1)
ts.diff_log_V1 = ts.log_V1.diff()
# Graph data
fig, axes = plt.subplots(1, 2, figsize=(15,4))
# Levels
axes[0].plot(ts.index, ts.V1, '-')
axes[0].set(title='Original data')
# Log difference
axes[1].plot(ts.index, ts.diff_log_V1, '-')
axes[1].hlines(0, ts.index[0], ts.index[-1], 'r')
axes[1].set(title='Diff of the log of the data')
plt.show()
```
%% Cell type:markdown id:e48b1dd4 tags:
By using the diff of the log, we seems to retrieve something stationary with a clear seasonal effect.
We will try our previous model on the log of the data to see if the results are better.
%% Cell type:code id:cbc86671 tags:
```
model = sarimax(ts.log_V1, trend='c', order=(ar, i, ma))
res = model.fit()
plt.rcParams['figure.figsize'] = [10, 10]
_ = res.plot_diagnostics()
```
%% Cell type:markdown id:5845d2dd tags:
The residuals seems to be very close to a normal distribution (especially on the Q-Q plot), but we see some correlation between them.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment