Now that we have a way to generate highly correlated data, we will use a lasso regression on it.
The aim of this part is to demonstrate the instability of the Lasso regression on this data.
For this purpose, we will use a loop to generate several datasets using **the same params** on our `generate_data` function.
We will demonstrate instability by counting the number of selected features each time, and registe which features are selected.
**Note :** the model auto-correct intercept by default `fit_intercept=True`.
The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we just scale the data and do not center it, we leave this work to the Lasso.
plt.bar(uniq,count,label='Number of selected features per training')
plt.legend()
plt.show()
# Using selection_count, we can display the number of times each feature was selected
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
These results show the instability of the Lasso regression :
- We see that the number of selected feature changes from one training to another
- We see that the regression model selects different features : if the first features are selected (almost) each time, the others may be selected or not (some of them are selected 1 time over two for example).
%% Cell type:code id: tags:
``` python
# Arrays to store results
n_selected=[]
selection_count=np.zeros((n_features))
# We test our regression n_test time
foriinrange(n_tests):
# Generate the data
X,y=generate_data_2(n_samples,n_features,)
# Fit the model (pipeline with the data)
model.fit(X,y)
# We can now retrieve selected features :
selected_features=(lasso.coef_!=0)*1# (lasso.coef_ != 0) gives a matrix of True / False. * 1 transforms True in 1 and False in 0
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
With this data, wo observe that elastic net is stable in term of selected features, but the result on this sample is not satisfying : we always use 49 of the 50 features, even with a very high (0.99) importance given to the L1 factor. The bypassed feature is always the first one.
It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.
plt.bar(uniq,count,label='Number of selected features per training')
plt.legend()
plt.show()
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
#plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
### Conclusion
With our correlated data, we were able to demonstrate the instability of the Lasso regression : the number of features and the features selected from one trainig to another can vary a lot.
However, if we managed to show a form of stability for the elastic net regression, the restults are not satisfying : the regression dit not select much features among the 50 ones of the sample (one for the first method, 0 a lot of time for the second one).
We tried various changes in the method for the generation of the data and various params for the regularization without success.