"First, we will generate highly correlated data, containing a sample X (multidim) and a target y (one dim).\n",
"\n",
"#### Data generation first function: generate_data\n",
"We write a function for this.\n",
"Its parameters are :\n",
"- n_samples the number of samples\n",
...
...
@@ -58,7 +58,25 @@
"- For the other dimensions of X, noted i, the value will be calculated as follow :\n",
" - We generate a number from a normal law N(i / 2, 1)\n",
" - We add it to the value of the first column\n",
"- For Y, we select 2 over 3 values of X and we sum them"
"- For Y, we select 2 over 3 values of X and we sum them\n",
"\n",
"#### Data generation second function: generate_data_2\n",
"We have written a second function, which generate another highly correlated data set in order to compare our results.\n",
"Its parameters are\n",
"- n_samples the number of samples \n",
"- n_features the number of features in X\n",
"and the outputs X and y\n",
"\n",
"For this purpose, we proceed in 4 steps:\n",
"\n",
"- we generate samples of a geometric law of parameter p = 0.5, these samples are stored in the first column of X\n",
"- for the other columns of X we do\n",
" - we generate randomly a parameter p between 0 and 1\n",
" - we generate samples of a geometric law of parameter p\n",
" - we add this samples to the sum of the previous column\n",
" \n",
"At the end, we have the matrix X where each column `Xi` is a sum of a samples generated from a geometric law and the previous columns `X0+...+Xi-1`.\n",
"We generate `y` as the mean of `X` on the axis 1."
Now that we have a way to generate highly correlated data, we will use a lasso regression on it.
The aim of this part is to demonstrate the instability of the Lasso regression on this data.
For this purpose, we will use a loop to generate several datasets using **the same params** on our `generate_data` function.
We will demonstrate instability by counting the number of selected features each time, and registe which features are selected.
**Note :** the model auto-correct intercept by default `fit_intercept=True`.
The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we just scale the data and do not center it, we leave this work to the Lasso.
plt.bar(uniq,count,label='Number of selected features per training')
plt.legend()
plt.show()
# Using selection_count, we can display the number of times each feature was selected
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
These results show the instability of the Lasso regression :
- We see that the number of selected feature changes from one training to another
- We see that the regression model selects different features : if the first features are selected (almost) each time, the others may be selected or not (some of them are selected 1 time over two for example).
%% Cell type:code id: tags:
```
```python
# Arrays to store results
n_selected=[]
selection_count=np.zeros((n_features))
# We test our regression n_test time
foriinrange(n_tests):
# Generate the data
X,y=generate_data_2(n_samples,n_features,)
# Fit the model (pipeline with the data)
model.fit(X,y)
# We can now retrieve selected features :
selected_features=(lasso.coef_!=0)*1# (lasso.coef_ != 0) gives a matrix of True / False. * 1 transforms True in 1 and False in 0
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
With this data, wo observe that elastic net is stable in term of selected features, but the result on this sample is not satisfying : we always use 49 of the 50 features, even with a very high (0.99) importance given to the L1 factor. The bypassed feature is always the first one.
It is **like** the elastic net « found » that each $X[i], i > 0$ were generated from $X[0]$ but did not « found » a link between the elements.
plt.bar(uniq,count,label='Number of selected features per training')
plt.legend()
plt.show()
plt.bar(range(n_features),selection_count,label='Number of time each feature was selected')
#plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
### Conclusion
With our correlated data, we were able to demonstrate the instability of the Lasso regression : the number of features and the features selected from one trainig to another can vary a lot.
However, if we managed to show a form of stability for the elastic net regression, the restults are not satisfying : the regression dit not select much features among the 50 ones of the sample (one for the first method, 0 a lot of time for the second one).
We tried various changes in the method for the generation of the data and various params for the regularization without success.