### Ajout explications

parent b674df38
 ... ... @@ -30,6 +30,7 @@ "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import random\n", "\n", "from sklearn.linear_model import Lasso, ElasticNet\n", "from sklearn.pipeline import make_pipeline\n", ... ... @@ -51,13 +52,13 @@ "- m, s the parameters of the normal law used for the generation of the first feature\n", "and the outputs X and y\n", "\n", "For this purpose, we will proceed in X steps :\n", "For this purpose, we will proceed in 3 steps :\n", "\n", "- First, we will generate the first dimension of X randomly from a normal law (m, s)\n", "- For the other dimensions of X, noted i, the value will be calculated as follow :\n", " - We generate a number from a normal law N(i, 1)\n", " - We generate a number from a normal law N(i / 2, 1)\n", " - We add it to the value of the first column\n", "- For Y, the value is calculated as the mean of the values we generated for X" "- For Y, we select 2 over 3 values of X and we sum them" ] }, { ... ... @@ -66,7 +67,6 @@ "metadata": {}, "outputs": [], "source": [ "# /!\\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n", "def generate_data(n_samples, n_features, m, s):\n", " X = np.ndarray((n_samples, n_features))\n", " y = np.ndarray((n_samples,))\n", ... ... @@ -75,14 +75,11 @@ " X[i, 0] = np.random.normal(m, s)\n", " for j in range(1, n_features):\n", " X[i, j] = X[i, 0] + np.random.normal(i / 2, 1)\n", " selected_features = [X[i, j] for j in range(n_features) if j % 3 != 0]\n", " y[i] = np.sum(selected_features)\n", " \n", " y = np.mean(X, axis=1)\n", "\n", " return X, y\n", "# /!\\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n", "\n", "\n", "import random\n", "def generate_data_2(n_samples, n_features):\n", " X = []\n", " y = np.ndarray((n_samples,))\n", ... ... @@ -167,6 +164,15 @@ "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results show the instability of the Lasso regression :\n", "- We see that the number of selected feature changes from one training to another\n", "- We see that the regression model selects different features : if the first features are selected (almost) each time, the others may be selected or not (some of them are selected 1 time over two for example)." ] }, { "cell_type": "code", "execution_count": null, ... ... @@ -203,7 +209,7 @@ "source": [ "### Demonstrate stability of elastic net\n", "\n", "In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods" "In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods." ] }, { ... ... @@ -214,7 +220,7 @@ "source": [ "# We use the same alpha as the lasso regression\n", "# Assume we really want to select features, we give the priority to l1\n", "elastic_net = ElasticNet(alpha=3.0, l1_ratio=0.9, fit_intercept=True, max_iter=10000)\n", "elastic_net = ElasticNet(alpha=3.0, l1_ratio=0.99, fit_intercept=True, max_iter=10000)\n", "\n", "model = make_pipeline(standard_scaler, elastic_net)\n", "\n", ... ... @@ -242,6 +248,15 @@ "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this data, wo observe that elastic net is stable in term of selected features, but the result on this sample is not satisfying : we always use 49 of the 50 features, even with a very high (0.99) importance given to the L1 factor. The bypassed feature is always the first one.\n", "\n", "It is **like** the elastic net « found » that each \$X[i], i > 0\$ were generated from \$X\$ but did not « found » a link between the elements.\n" ] }, { "cell_type": "code", "execution_count": null, ... ... @@ -278,15 +293,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ "Notes : **migh change if we found a better dataset**\n", "### Conclusion\n", "\n", "- Instability of Lasso is proved\n", "- Stability of elastic_net is OK for this sample.\n", "With our correlated data, we were able to demonstrate the instability of the Lasso regression : the number of features and the features selected from one trainig to another can vary a lot.\n", "\n", "BUT :\n", "- Feature selection w/ elastic net for this sample is not satisfying (we only remove the first one)\n", "However, if we managed to show a form of stability for the elastic net regression, the restults are not satisfying : the regression dit not select much features among the 50 ones of the sample (one for the first method, 0 a lot of time for the second one).\n", "\n", "It is **like** the elastic net « found » that each \$X[i], i > 0\$ were generated from \$X\$ but did not « found » a link between the elements.\n" "We tried various changes in the method for the generation of the data and various params for the regularization without success." ] } ], ... ... @@ -295,8 +308,7 @@ "hash": "3abb0a1ef4892304d86bb3a3dfd052bcca35057beadba016173999c775e8d3ba" }, "kernelspec": { "display_name": "Python 3", "language": "python", "display_name": "Python 3.9.7 64-bit ('AOS1-QteoCFsS': pipenv)", "name": "python3" }, "language_info": { ... ...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!