Commit 76fd4c8b authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

Ça avance de ouuuuuuuuf

parent ed9386e5
......@@ -29,8 +29,11 @@
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.linear_model import Lasso, ElasticNet"
"from sklearn.linear_model import Lasso, ElasticNet\n",
"from sklearn.pipeline import make_pipeline\n",
"from sklearn.preprocessing import StandardScaler"
]
},
{
......@@ -51,8 +54,8 @@
"For this purpose, we will proceed in X steps :\n",
"\n",
"- First, we will generate the first dimension of X randomly from a normal law (m, s)\n",
"- For the other dimensions of X, the value will be calculated as follow :\n",
" - We generate a number from a normal law N(0, 1)\n",
"- For the other dimensions of X, noted i, the value will be calculated as follow :\n",
" - We generate a number from a normal law N(i, 1)\n",
" - We add it to the value of the first column\n",
"- For Y, the value is calculated as the mean of the values we generated for X"
]
......@@ -63,7 +66,7 @@
"metadata": {},
"outputs": [],
"source": [
"# /!\\ THIS IS A FIRST TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n",
"# /!\\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n",
"def generate_data(n_samples, n_features, m, s):\n",
" X = np.ndarray((n_samples, n_features))\n",
" y = np.ndarray((n_samples,))\n",
......@@ -71,7 +74,7 @@
" for i in range(n_samples):\n",
" X[i, 0] = np.random.normal(m, s)\n",
" for j in range(1, n_features):\n",
" X[i, j] = X[i, 0] + np.random.normal(1, 0)\n",
" X[i, j] = X[i, 0] + np.random.normal(i / 2, 1)\n",
" \n",
" y = np.mean(X, axis=1)\n",
"\n",
......@@ -82,7 +85,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Demonstrate instability of Lasso"
"### Demonstrate instability of Lasso\n",
"\n",
"Now that we have a way to generate highly correlated data, we will use a lasso regression on it.\n",
"The aim of this part is to demonstrate the instability of the Lasso regression on this data.\n",
"For this purpose, we will use a loop to generate several datasets using **the same params** on our `generate_data` function.\n",
"\n",
"We will demonstrate instability by counting the number of selected features each time, and registe which features are selected.\n",
"\n",
"**Note :** the model auto-correct intercept by default `fit_intercept=True`.\n",
"The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we juste scale the data and do nnot center it, we leave this work to the Lasso."
]
},
{
......@@ -91,21 +103,45 @@
"metadata": {},
"outputs": [],
"source": [
"# TODO\n",
"\n",
"########## TMP TESTS ##########\n",
"X, y = generate_data(1000, 50, 30, 3)\n",
"model = Lasso(alpha=1.0)\n",
"model.fit(X, y)\n",
"model.coef_\n",
"###############################"
"# Params for data generation:\n",
"n_samples = 500\n",
"n_features = 50\n",
"m = 30\n",
"s = 3\n",
"\n",
"# Number of tests\n",
"n_tests = 100\n",
"\n",
"standard_scaler = StandardScaler(with_mean=False)\n",
"lasso = Lasso(alpha=1.0, fit_intercept=True, max_iter=5000)\n",
"\n",
"model = make_pipeline(standard_scaler, lasso)\n",
"\n",
"# Arrays to store results\n",
"n_selected = []\n",
"\n",
"# We test our regression n_test time\n",
"for i in range(n_tests):\n",
" # Generate the data\n",
" X, y = generate_data(n_samples, n_features, m, s)\n",
" # Fit the model (pipeline with the data)\n",
" model.fit(X, y)\n",
" # We can now retrieve selected features :\n",
" selected_features = lasso.coef_ != 0\n",
" n_selected.append(np.count_nonzero(selected_features))\n",
"\n",
"uniq, count = np.unique(n_selected, return_counts=True)\n",
"plt.bar(uniq, count, label='Number of selected features per training')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Demonstrate stability of elastic net"
"### Demonstrate stability of elastic net\n",
"\n",
"In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods"
]
},
{
......@@ -114,7 +150,49 @@
"metadata": {},
"outputs": [],
"source": [
"# TODO"
"# We use the same alpha as the lasso regression\n",
"# Assume we really want to select features, we give the priority to l1\n",
"elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.8, fit_intercept=True, max_iter=10000)\n",
"\n",
"model = make_pipeline(standard_scaler, elastic_net)\n",
"\n",
"# Arrays to store results\n",
"n_selected = []\n",
"zero_removed = 0\n",
"\n",
"# We test our regression n_test time\n",
"for i in range(n_tests):\n",
" # Generate the data\n",
" X, y = generate_data(n_samples, n_features, m, s)\n",
" # Fit the model (pipeline with the data)\n",
" model.fit(X, y)\n",
" # We can now retrieve selected features :\n",
" selected_features = elastic_net.coef_ != 0\n",
" n_selected.append(np.count_nonzero(selected_features))\n",
"\n",
" # Fastly show that we always remove X[0]\n",
" if not selected_features[0]:\n",
" zero_removed += 1\n",
"\n",
"\n",
"uniq, count = np.unique(n_selected, return_counts=True)\n",
"print(f'Features selected : {uniq}, count : {count}')\n",
"print(f'Number of time fist feature was ignored : {zero_removed}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notes : **migh change if we found a better dataset**\n",
"\n",
"- Instability of Lasso is proved\n",
"- Stability of elastic_net is OK for this sample.\n",
"\n",
"BUT :\n",
"- Feature selection w/ elastic net for this sample is not satisfying (we only remove the first one)\n",
"\n",
"It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.\n"
]
}
],
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment