Commit cdfef733 by Mathilde Rineau 🙂

### Update Devoir2.ipynb

parent 5611e3ae
 ... ... @@ -106,9 +106,7 @@ " for i in range(n_samples):\n", " p = random.random()\n", " temp = np.random.geometric(p = p, size = n_features)\n", " #print(temp)\n", " sum_X = sum_X + temp\n", " #print(sum_X)\n", " X.append(sum_X)\n", " X = np.array(X)\n", " \n", ... ...
 %% Cell type:markdown id: tags: # AOS1 assignment ## Make elastic net outshine the Lasso authors : Mathilde Rineau, Rémy Huet ### Introduction The aim of this work is to demonstrate experimentally that the elastic net regularization outshine the Lasso regulation in some cases. We know that the Lasso regularization may be unstable when used on high-correlated data. Indeed, the Lasso regularization may lead to ignore some features (by setting their weight in the regression to 0). When the data is highly correlated, small changes on the sample could lead to changes in the selection of the features (what whe call instability). At the opposite, elastic net regression should be able to ignore some features but with more stability than Lasso. In this work, we will construct a dataset with highly correlated data to demonstrate that. %% Cell type:code id: tags: ``` python import numpy as np import matplotlib.pyplot as plt import random from sklearn.linear_model import Lasso, ElasticNet from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler ``` %% Cell type:markdown id: tags: ### Data generation First, we will generate highly correlated data, containing a sample X (multidim) and a target y (one dim). #### Data generation first function: generate_data We write a function for this. Its parameters are : - n_samples the number of samples - n_features the number of features in X - m, s the parameters of the normal law used for the generation of the first feature and the outputs X and y For this purpose, we will proceed in 3 steps : - First, we will generate the first dimension of X randomly from a normal law (m, s) - For the other dimensions of X, noted i, the value will be calculated as follow : - We generate a number from a normal law N(i / 2, 1) - We add it to the value of the first column - For Y, we select 2 over 3 values of X and we sum them #### Data generation second function: generate_data_2 We have written a second function, which generate another highly correlated data set in order to compare our results. Its parameters are - n_samples the number of samples - n_features the number of features in X and the outputs X and y For this purpose, we proceed in 4 steps: - we generate samples of a geometric law of parameter p = 0.5, these samples are stored in the first column of X - for the other columns of X we do - we generate randomly a parameter p between 0 and 1 - we generate samples of a geometric law of parameter p - we add this samples to the sum of the previous column At the end, we have the matrix X where each column `Xi` is a sum of a samples generated from a geometric law and the previous columns `X0+...+Xi-1`. We generate `y` as the mean of `X` on the axis 1. %% Cell type:code id: tags: ``` python def generate_data(n_samples, n_features, m, s): X = np.ndarray((n_samples, n_features)) y = np.ndarray((n_samples,)) for i in range(n_samples): X[i, 0] = np.random.normal(m, s) for j in range(1, n_features): X[i, j] = X[i, 0] + np.random.normal(i / 2, 1) selected_features = [X[i, j] for j in range(n_features) if j % 3 != 0] y[i] = np.sum(selected_features) return X, y def generate_data_2(n_samples, n_features): X = [] y = np.ndarray((n_samples,)) X.append(np.random.geometric(p = 0.5, size = n_features)) sum_X = np.ndarray((n_features,)) for i in range(n_samples): p = random.random() temp = np.random.geometric(p = p, size = n_features) #print(temp) sum_X = sum_X + temp #print(sum_X) X.append(sum_X) X = np.array(X) y = np.mean(X, axis=1) return X, y ``` %% Cell type:markdown id: tags: ### Demonstrate instability of Lasso Now that we have a way to generate highly correlated data, we will use a lasso regression on it. The aim of this part is to demonstrate the instability of the Lasso regression on this data. For this purpose, we will use a loop to generate several datasets using **the same params** on our `generate_data` function. We will demonstrate instability by counting the number of selected features each time, and registe which features are selected. **Note :** the model auto-correct intercept by default `fit_intercept=True`. The parameter `normalize` is deprecated, so we use a pipeline to normalize the data before the regression, as suggested in the deprecation message. We set the `with_mean` parameter to `False` because we just scale the data and do not center it, we leave this work to the Lasso. %% Cell type:code id: tags: ``` python # Params for data generation: n_samples = 300 n_features = 50 m = 30 s = 3 # Number of tests n_tests = 100 standard_scaler = StandardScaler(with_mean=False) lasso = Lasso(alpha=3.0, fit_intercept=True, max_iter=5000) model = make_pipeline(standard_scaler, lasso) # Arrays to store results n_selected = [] selection_count = np.zeros((n_features)) # We test our regression n_test time for i in range(n_tests): # Generate the data X, y = generate_data(n_samples, n_features, m, s) # Fit the model (pipeline with the data) model.fit(X, y) # We can now retrieve selected features : selected_features = (lasso.coef_ != 0) * 1 # (lasso.coef_ != 0) gives a matrix of True / False. * 1 transforms True in 1 and False in 0 n_selected.append(np.count_nonzero(selected_features)) selection_count += selected_features # Using n_selected, we can display the number of selected features per training uniq, count = np.unique(n_selected, return_counts=True) plt.bar(uniq, count, label='Number of selected features per training') plt.legend() plt.show() # Using selection_count, we can display the number of times each feature was selected plt.bar(range(n_features), selection_count, label='Number of time each feature was selected') plt.legend() plt.show() ``` %% Cell type:markdown id: tags: These results show the instability of the Lasso regression : - We see that the number of selected feature changes from one training to another - We see that the regression model selects different features : if the first features are selected (almost) each time, the others may be selected or not (some of them are selected 1 time over two for example). %% Cell type:code id: tags: ``` python # Arrays to store results n_selected = [] selection_count = np.zeros((n_features)) # We test our regression n_test time for i in range(n_tests): # Generate the data X, y = generate_data_2(n_samples, n_features,) # Fit the model (pipeline with the data) model.fit(X, y) # We can now retrieve selected features : selected_features = (lasso.coef_ != 0) * 1 # (lasso.coef_ != 0) gives a matrix of True / False. * 1 transforms True in 1 and False in 0 n_selected.append(np.count_nonzero(selected_features)) selection_count += selected_features uniq, count = np.unique(n_selected, return_counts=True) plt.bar(uniq, count, label='Number of selected features per training') plt.show() # Using selection_count, we can display the number of times each feature was selected plt.bar(range(n_features), selection_count, label='Number of time each feature was selected') plt.legend() plt.show() ``` %% Cell type:markdown id: tags: ### Demonstrate stability of elastic net In a second time, we can do the same test with an elastic net regression model to highlight the difference between the two methods. %% Cell type:code id: tags: ``` python # We use the same alpha as the lasso regression # Assume we really want to select features, we give the priority to l1 elastic_net = ElasticNet(alpha=3.0, l1_ratio=0.99, fit_intercept=True, max_iter=10000) model = make_pipeline(standard_scaler, elastic_net) # Arrays to store results n_selected = [] selection_count = np.zeros((n_features)) # We test our regression n_test time for i in range(n_tests): # Generate the data X, y = generate_data(n_samples, n_features, m, s) # Fit the model (pipeline with the data) model.fit(X, y) # We can now retrieve selected features : selected_features = (elastic_net.coef_ != 0) * 1 n_selected.append(np.count_nonzero(selected_features)) selection_count += selected_features uniq, count = np.unique(n_selected, return_counts=True) print(f'Features selected : {uniq}, count : {count}') plt.bar(range(n_features), selection_count, label='Number of time each feature was selected') plt.legend() plt.show() ``` %% Cell type:markdown id: tags: With this data, wo observe that elastic net is stable in term of selected features, but the result on this sample is not satisfying : we always use 49 of the 50 features, even with a very high (0.99) importance given to the L1 factor. The bypassed feature is always the first one. It is **like** the elastic net « found » that each \$X[i], i > 0\$ were generated from \$X[0]\$ but did not « found » a link between the elements. %% Cell type:code id: tags: ``` python # Arrays to store results n_selected = [] selection_count = np.zeros((n_features)) # We test our regression n_test time for i in range(n_tests): # Generate the data X, y = generate_data_2(n_samples, n_features) # Fit the model (pipeline with the data) model.fit(X, y) # We can now retrieve selected features : selected_features = (elastic_net.coef_ != 0) * 1 n_selected.append(np.count_nonzero(selected_features)) selection_count += selected_features uniq, count = np.unique(n_selected, return_counts=True) plt.bar(uniq, count, label='Number of selected features per training') plt.legend() plt.show() plt.bar(range(n_features), selection_count, label='Number of time each feature was selected') #plt.legend() plt.show() ``` %% Cell type:markdown id: tags: ### Conclusion With our correlated data, we were able to demonstrate the instability of the Lasso regression : the number of features and the features selected from one trainig to another can vary a lot. However, if we managed to show a form of stability for the elastic net regression, the restults are not satisfying : the regression dit not select much features among the 50 ones of the sample (one for the first method, 0 a lot of time for the second one). We tried various changes in the method for the generation of the data and various params for the regularization without success. ... ...
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!