### Ça avance de ouuuuuuuuf

parent ed9386e5
 ... ... @@ -29,8 +29,11 @@ "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.linear_model import Lasso, ElasticNet" "from sklearn.linear_model import Lasso, ElasticNet\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.preprocessing import StandardScaler" ] }, { ... ... @@ -51,8 +54,8 @@ "For this purpose, we will proceed in X steps :\n", "\n", "- First, we will generate the first dimension of X randomly from a normal law (m, s)\n", "- For the other dimensions of X, the value will be calculated as follow :\n", " - We generate a number from a normal law N(0, 1)\n", "- For the other dimensions of X, noted i, the value will be calculated as follow :\n", " - We generate a number from a normal law N(i, 1)\n", " - We add it to the value of the first column\n", "- For Y, the value is calculated as the mean of the values we generated for X" ] ... ... @@ -63,7 +66,7 @@ "metadata": {}, "outputs": [], "source": [ "# /!\\ THIS IS A FIRST TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n", "# /!\\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n", "def generate_data(n_samples, n_features, m, s):\n", " X = np.ndarray((n_samples, n_features))\n", " y = np.ndarray((n_samples,))\n", ... ... @@ -71,7 +74,7 @@ " for i in range(n_samples):\n", " X[i, 0] = np.random.normal(m, s)\n", " for j in range(1, n_features):\n", " X[i, j] = X[i, 0] + np.random.normal(1, 0)\n", " X[i, j] = X[i, 0] + np.random.normal(i / 2, 1)\n", " \n", " y = np.mean(X, axis=1)\n", "\n", ... ... @@ -82,7 +85,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ "### Demonstrate instability of Lasso" "### Demonstrate instability of Lasso\n", "\n", "Now that we have a way to generate highly correlated data, we will use a lasso regression on it.\n", "The aim of this part is to demonstrate the instability of the Lasso regression on this data.\n", "For this purpose, we will use a loop to generate several datasets using **the same params** on our `generate_data` function.\n", "\n", "We will demonstrate instability by counting the number of selected features each time, and registe which features are selected.\n", "\n", "**Note :** the model auto-correct intercept by default `fit_intercept=True`.\n", "The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we juste scale the data and do nnot center it, we leave this work to the Lasso." ] }, { ... ... @@ -91,21 +103,45 @@ "metadata": {}, "outputs": [], "source": [ "# TODO\n", "\n", "########## TMP TESTS ##########\n", "X, y = generate_data(1000, 50, 30, 3)\n", "model = Lasso(alpha=1.0)\n", "model.fit(X, y)\n", "model.coef_\n", "###############################" "# Params for data generation:\n", "n_samples = 500\n", "n_features = 50\n", "m = 30\n", "s = 3\n", "\n", "# Number of tests\n", "n_tests = 100\n", "\n", "standard_scaler = StandardScaler(with_mean=False)\n", "lasso = Lasso(alpha=1.0, fit_intercept=True, max_iter=5000)\n", "\n", "model = make_pipeline(standard_scaler, lasso)\n", "\n", "# Arrays to store results\n", "n_selected = []\n", "\n", "# We test our regression n_test time\n", "for i in range(n_tests):\n", " # Generate the data\n", " X, y = generate_data(n_samples, n_features, m, s)\n", " # Fit the model (pipeline with the data)\n", " model.fit(X, y)\n", " # We can now retrieve selected features :\n", " selected_features = lasso.coef_ != 0\n", " n_selected.append(np.count_nonzero(selected_features))\n", "\n", "uniq, count = np.unique(n_selected, return_counts=True)\n", "plt.bar(uniq, count, label='Number of selected features per training')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Demonstrate stability of elastic net" "### Demonstrate stability of elastic net\n", "\n", "In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods" ] }, { ... ... @@ -114,7 +150,49 @@ "metadata": {}, "outputs": [], "source": [ "# TODO" "# We use the same alpha as the lasso regression\n", "# Assume we really want to select features, we give the priority to l1\n", "elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.8, fit_intercept=True, max_iter=10000)\n", "\n", "model = make_pipeline(standard_scaler, elastic_net)\n", "\n", "# Arrays to store results\n", "n_selected = []\n", "zero_removed = 0\n", "\n", "# We test our regression n_test time\n", "for i in range(n_tests):\n", " # Generate the data\n", " X, y = generate_data(n_samples, n_features, m, s)\n", " # Fit the model (pipeline with the data)\n", " model.fit(X, y)\n", " # We can now retrieve selected features :\n", " selected_features = elastic_net.coef_ != 0\n", " n_selected.append(np.count_nonzero(selected_features))\n", "\n", " # Fastly show that we always remove X\n", " if not selected_features:\n", " zero_removed += 1\n", "\n", "\n", "uniq, count = np.unique(n_selected, return_counts=True)\n", "print(f'Features selected : {uniq}, count : {count}')\n", "print(f'Number of time fist feature was ignored : {zero_removed}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notes : **migh change if we found a better dataset**\n", "\n", "- Instability of Lasso is proved\n", "- Stability of elastic_net is OK for this sample.\n", "\n", "BUT :\n", "- Feature selection w/ elastic net for this sample is not satisfying (we only remove the first one)\n", "\n", "It is **like** the elastic net « found » that each \$X[i], i > 0\$ were generated from \$X\$ but did not « found » a link between the elements.\n" ] } ], ... ...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!