" selected_features = [X[i, j] for j in range(n_features) if j % 3 != 0]\n",
" y[i] = np.sum(selected_features)\n",
" \n",
" y = np.mean(X, axis=1)\n",
"\n",
" return X, y\n",
"# /!\\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE\n",
"\n",
"\n",
"import random\n",
"def generate_data_2(n_samples, n_features):\n",
" X = []\n",
" y = np.ndarray((n_samples,))\n",
...
...
@@ -167,6 +164,15 @@
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These results show the instability of the Lasso regression :\n",
"- We see that the number of selected feature changes from one training to another\n",
"- We see that the regression model selects different features : if the first features are selected (almost) each time, the others may be selected or not (some of them are selected 1 time over two for example)."
]
},
{
"cell_type": "code",
"execution_count": null,
...
...
@@ -203,7 +209,7 @@
"source": [
"### Demonstrate stability of elastic net\n",
"\n",
"In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods"
"In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods."
]
},
{
...
...
@@ -214,7 +220,7 @@
"source": [
"# We use the same alpha as the lasso regression\n",
"# Assume we really want to select features, we give the priority to l1\n",
"With this data, wo observe that elastic net is stable in term of selected features, but the result on this sample is not satisfying : we always use 49 of the 50 features, even with a very high (0.99) importance given to the L1 factor. The bypassed feature is always the first one.\n",
"\n",
"It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
...
...
@@ -278,15 +293,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Notes : **migh change if we found a better dataset**\n",
"### Conclusion\n",
"\n",
"- Instability of Lasso is proved\n",
"- Stability of elastic_net is OK for this sample.\n",
"With our correlated data, we were able to demonstrate the instability of the Lasso regression : the number of features and the features selected from one trainig to another can vary a lot.\n",
"\n",
"BUT :\n",
"- Feature selection w/ elastic net for this sample is not satisfying (we only remove the first one)\n",
"However, if we managed to show a form of stability for the elastic net regression, the restults are not satisfying : the regression dit not select much features among the 50 ones of the sample (one for the first method, 0 a lot of time for the second one).\n",
"\n",
"It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.\n"
"We tried various changes in the method for the generation of the data and various params for the regularization without success."
Now that we have a way to generate highly correlated data, we will use a lasso regression on it.
The aim of this part is to demonstrate the instability of the Lasso regression on this data.
For this purpose, we will use a loop to generate several datasets using **the same params** on our `generate_data` function.
We will demonstrate instability by counting the number of selected features each time, and registe which features are selected.
**Note :** the model auto-correct intercept by default `fit_intercept=True`.
The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we just scale the data and do not center it, we leave this work to the Lasso.
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
These results show the instability of the Lasso regression :
- We see that the number of selected feature changes from one training to another
- We see that the regression model selects different features : if the first features are selected (almost) each time, the others may be selected or not (some of them are selected 1 time over two for example).
%% Cell type:code id: tags:
```python
```
# Arrays to store results
n_selected = []
selection_count = np.zeros((n_features))
# We test our regression n_test time
for i in range(n_tests):
# Generate the data
X, y = generate_data_2(n_samples, n_features,)
# Fit the model (pipeline with the data)
model.fit(X, y)
# We can now retrieve selected features :
selected_features = (lasso.coef_ != 0) * 1 # (lasso.coef_ != 0) gives a matrix of True / False. * 1 transforms True in 1 and False in 0
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
With this data, wo observe that elastic net is stable in term of selected features, but the result on this sample is not satisfying : we always use 49 of the 50 features, even with a very high (0.99) importance given to the L1 factor. The bypassed feature is always the first one.
It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.
plt.bar(uniq, count, label='Number of selected features per training')
plt.legend()
plt.show()
plt.bar(range(n_features), selection_count, label='Number of time each feature was selected')
#plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
Notes : **migh change if we found a better dataset**
### Conclusion
- Instability of Lasso is proved
- Stability of elastic_net is OK for this sample.
With our correlated data, we were able to demonstrate the instability of the Lasso regression : the number of features and the features selected from one trainig to another can vary a lot.
BUT :
- Feature selection w/ elastic net for this sample is not satisfying (we only remove the first one)
However, if we managed to show a form of stability for the elastic net regression, the restults are not satisfying : the regression dit not select much features among the 50 ones of the sample (one for the first method, 0 a lot of time for the second one).
It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.
We tried various changes in the method for the generation of the data and various params for the regularization without success.