Commit e48ac5ce authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

Ajout explications

parent b674df38
......@@ -30,6 +30,7 @@
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import random\n",
"\n",
"from sklearn.linear_model import Lasso, ElasticNet\n",
"from sklearn.pipeline import make_pipeline\n",
......@@ -51,13 +52,13 @@
"- m, s the parameters of the normal law used for the generation of the first feature\n",
"and the outputs X and y\n",
"\n",
"For this purpose, we will proceed in X steps :\n",
"For this purpose, we will proceed in 3 steps :\n",
"\n",
"- First, we will generate the first dimension of X randomly from a normal law (m, s)\n",
"- For the other dimensions of X, noted i, the value will be calculated as follow :\n",
" - We generate a number from a normal law N(i, 1)\n",
" - We generate a number from a normal law N(i / 2, 1)\n",
" - We add it to the value of the first column\n",
"- For Y, the value is calculated as the mean of the values we generated for X"
"- For Y, we select 2 over 3 values of X and we sum them"
]
},
{
......@@ -66,7 +67,6 @@
"metadata": {},
"outputs": [],
"source": [
"# /!\\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n",
"def generate_data(n_samples, n_features, m, s):\n",
" X = np.ndarray((n_samples, n_features))\n",
" y = np.ndarray((n_samples,))\n",
......@@ -75,14 +75,11 @@
" X[i, 0] = np.random.normal(m, s)\n",
" for j in range(1, n_features):\n",
" X[i, j] = X[i, 0] + np.random.normal(i / 2, 1)\n",
" selected_features = [X[i, j] for j in range(n_features) if j % 3 != 0]\n",
" y[i] = np.sum(selected_features)\n",
" \n",
" y = np.mean(X, axis=1)\n",
"\n",
" return X, y\n",
"# /!\\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n",
"\n",
"\n",
"import random\n",
"def generate_data_2(n_samples, n_features):\n",
" X = []\n",
" y = np.ndarray((n_samples,))\n",
......@@ -167,6 +164,15 @@
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results show the instability of the Lasso regression :\n",
"- We see that the number of selected feature changes from one training to another\n",
"- We see that the regression model selects different features : if the first features are selected (almost) each time, the others may be selected or not (some of them are selected 1 time over two for example)."
]
},
{
"cell_type": "code",
"execution_count": null,
......@@ -203,7 +209,7 @@
"source": [
"### Demonstrate stability of elastic net\n",
"\n",
"In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods"
"In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods."
]
},
{
......@@ -214,7 +220,7 @@
"source": [
"# We use the same alpha as the lasso regression\n",
"# Assume we really want to select features, we give the priority to l1\n",
"elastic_net = ElasticNet(alpha=3.0, l1_ratio=0.9, fit_intercept=True, max_iter=10000)\n",
"elastic_net = ElasticNet(alpha=3.0, l1_ratio=0.99, fit_intercept=True, max_iter=10000)\n",
"\n",
"model = make_pipeline(standard_scaler, elastic_net)\n",
"\n",
......@@ -242,6 +248,15 @@
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this data, wo observe that elastic net is stable in term of selected features, but the result on this sample is not satisfying : we always use 49 of the 50 features, even with a very high (0.99) importance given to the L1 factor. The bypassed feature is always the first one.\n",
"\n",
"It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
......@@ -278,15 +293,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Notes : **migh change if we found a better dataset**\n",
"### Conclusion\n",
"\n",
"- Instability of Lasso is proved\n",
"- Stability of elastic_net is OK for this sample.\n",
"With our correlated data, we were able to demonstrate the instability of the Lasso regression : the number of features and the features selected from one trainig to another can vary a lot.\n",
"\n",
"BUT :\n",
"- Feature selection w/ elastic net for this sample is not satisfying (we only remove the first one)\n",
"However, if we managed to show a form of stability for the elastic net regression, the restults are not satisfying : the regression dit not select much features among the 50 ones of the sample (one for the first method, 0 a lot of time for the second one).\n",
"\n",
"It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.\n"
"We tried various changes in the method for the generation of the data and various params for the regularization without success."
]
}
],
......@@ -295,8 +308,7 @@
"hash": "3abb0a1ef4892304d86bb3a3dfd052bcca35057beadba016173999c775e8d3ba"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"display_name": "Python 3.9.7 64-bit ('AOS1-QteoCFsS': pipenv)",
"name": "python3"
},
"language_info": {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment