"We will demonstrate instability by counting the number of selected features each time, and registe which features are selected.\n",
"\n",
"**Note :** the model auto-correct intercept by default `fit_intercept=True`.\n",
"The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we juste scale the data and do nnot center it, we leave this work to the Lasso."
"The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we just scale the data and do not center it, we leave this work to the Lasso."
]
},
{
...
...
%% Cell type:markdown id: tags:
# AOS1 assignment
## Make elastic net outshine the Lasso
authors : Mathilde Rineau, Rémy Huet
### Introduction
The aim of this work is to demonstrate experimentally that the elastic net regularization outshine the Lasso regulation in some cases.
We know that the Lasso regularization may be unstable when used on high-correlated data.
Indeed, the Lasso regularization may lead to ignore some features (by setting their weight in the regression to 0).
When the data is highly correlated, small changes on the sample could lead to changes in the selection of the features (what whe call instability).
At the opposite, elastic net regression should be able to ignore some features but with more stability than Lasso.
In this work, we will construct a dataset with highly correlated data to demonstrate that.
%% Cell type:code id: tags:
```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
```
%% Cell type:markdown id: tags:
### Data generation
First, we will generate highly correlated data, containing a sample X (multidim) and a target y (one dim).
We write a function for this.
Its parameters are :
- n_samples the number of samples
- n_features the number of features in X
- m, s the parameters of the normal law used for the generation of the first feature
and the outputs X and y
For this purpose, we will proceed in X steps :
- First, we will generate the first dimension of X randomly from a normal law (m, s)
- For the other dimensions of X, noted i, the value will be calculated as follow :
- We generate a number from a normal law N(i, 1)
- We add it to the value of the first column
- For Y, the value is calculated as the mean of the values we generated for X
%% Cell type:code id: tags:
```
# /!\ THIS IS A THIRD TEST VERSION, COULD (AND WILL CERTAINLY) CHANGE
def generate_data(n_samples, n_features, m, s):
X = np.ndarray((n_samples, n_features))
y = np.ndarray((n_samples,))
for i in range(n_samples):
X[i, 0] = np.random.normal(m, s)
for j in range(1, n_features):
X[i, j] = X[i, 0] + np.random.normal(i / 2, 1)
y = np.mean(X, axis=1)
return X, y
```
%% Cell type:markdown id: tags:
### Demonstrate instability of Lasso
Now that we have a way to generate highly correlated data, we will use a lasso regression on it.
The aim of this part is to demonstrate the instability of the Lasso regression on this data.
For this purpose, we will use a loop to generate several datasets using **the same params** on our `generate_data` function.
We will demonstrate instability by counting the number of selected features each time, and registe which features are selected.
**Note :** the model auto-correct intercept by default `fit_intercept=True`.
The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we juste scale the data and do nnot center it, we leave this work to the Lasso.
The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we just scale the data and do not center it, we leave this work to the Lasso.