Commit bcc4de02 authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

TP4

parent 0affb9fb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TD PCA\n",
"\n",
"## PCA by hand"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Generate dataset\n",
"X = np.random.multivariate_normal([1, 2], [[2, 1], [1, 2]], 100)\n",
"X.shape\n",
"plt.scatter(*X.T) # Same notation as (X[:, 0], X[:, 1])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 1\n",
"\n",
"Number of ssamples : 100\n",
"Each sample has 2 features. (n = 100, p = 2)\n",
"\n",
"### Question 2\n",
"\n",
"We first calculate the covariance matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import scipy.linalg as linalg\n",
"\n",
"V = X.T @ X # @ = mult de matrices\n",
"\n",
"w, vl = linalg.eig(V)\n",
"wh, vlh = linalg.eigh(V)\n",
"eigvals = linalg.eigvals(V)\n",
"eigvalsh = linalg.eigvalsh(V)\n",
"\n",
"print(eigvalsh)\n",
"print(linalg.svdvals(X) ** 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"eig : calcul des valeurs et vecteurs propres : peut-être des complexes\n",
"eigh : pour des matrices symétriques : on sait que les valeurs propres sont des réels. La matrice de covariance est symétrique.\n",
"\n",
"eigvals, eigvalsh = idem mais pour les valeurs propres uniquement\n",
"\n",
"linalg.svd : calcul de la décomposition complète.\n",
"linalg.svdvals : calcul direct des valeurs singulières\n",
"\n",
"On a bien eigvalsh(X.T @ X) = linalg.svdvals(X ** 2)\n",
"\n",
"### Question 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## New X = 15 features\n",
"\n",
"n, p = 100, 15\n",
"\n",
"X = np.random.normal(size=(n, p))\n",
"\n",
"## First : center the data\n",
"\n",
"Xc = X - X.mean(axis=0) # Rubstract row-wise (numpy)\n",
"\n",
"V = 1 / (n - 1) * Xc.T @ Xc # (use same calc as scikit : unbiased variance)\n",
"\n",
"# Calculate eigen values and vectors\n",
"# w eigenvalues, vl eigenvectors\n",
"w, vl = linalg.eigh(V)\n",
"\n",
"# We can calculate the principal components by multiplying Xc by vl\n",
"\n",
"Xpca = Xc @ vl\n",
"\n",
"print(Xpca.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we compare again scikit-learn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA() # All components are kept by default.\n",
"\n",
"# PCA auto-center the data, no need to do it first\n",
"pca.fit(X)\n",
"\n",
"print(n / (n - 1) * Xpca.std(axis=0) ** 2)\n",
"print(w)\n",
"print(pca.explained_variance_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We find the same PCA as scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PCA for dimension reduction\n",
"\n",
"We use the boston regression dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_boston\n",
"boston = load_boston()\n",
"\n",
"# We use PCA objects:\n",
"\n",
"X = boston.data\n",
"y = boston.target\n",
"\n",
"pca = PCA()\n",
"pca.fit(X)\n",
"\n",
"# With study of explained veriance decreasing : \n",
"ev = pca.explained_variance_\n",
"print(ev)\n",
"plt.bar(range(len(ev)), ev)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# With total variance percentage :\n",
"evr = np.cumsum(pca.explained_variance_ratio_)\n",
"print(evr)\n",
"plt.bar(range(len(evr)), evr)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using previous calculation, we could for example decide to use the two firsts PCA (> 95 % of the explained variance) or the three firsts (99 %)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.pipeline import Pipeline\n",
"\n",
"# With a data-driven way, determine the number of PCA that make the best score.\n",
"\n",
"pca = PCA()\n",
"lin = LinearRegression()\n",
"# Define a list of tasks to do with the data : here a PCA then linear regression\n",
"pca_lin = Pipeline([(\"pca\", pca), (\"lin\", lin)])\n",
"\n",
"# The classifier is used to learn the best pipeline\n",
"# Grid-search Cross Validation, exploring some different PCA parameters\n",
"clf = GridSearchCV(\n",
" estimator=pca_lin,\n",
" scoring=\"neg_mean_squared_error\",\n",
" cv=10,\n",
" param_grid=dict(pca__n_components=range(1, X.shape[1] + 1))\n",
")\n",
"\n",
"# Fit using input and target value\n",
"# At the end, the beso model is kept\n",
"clf.fit(X, y)\n",
"\n",
"clf.best_params_"
]
}
],
"metadata": {
"interpreter": {
"hash": "3abb0a1ef4892304d86bb3a3dfd052bcca35057beadba016173999c775e8d3ba"
},
"kernelspec": {
"display_name": "Python 3.9.7 64-bit ('AOS1-QteoCFsS': pipenv)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
# TD PCA
## PCA by hand
%% Cell type:code id: tags:
```
import numpy as np
import matplotlib.pyplot as plt
# Generate dataset
X = np.random.multivariate_normal([1, 2], [[2, 1], [1, 2]], 100)
X.shape
plt.scatter(*X.T) # Same notation as (X[:, 0], X[:, 1])
plt.show()
```
%% Cell type:markdown id: tags:
### Question 1
Number of ssamples : 100
Each sample has 2 features. (n = 100, p = 2)
### Question 2
We first calculate the covariance matrix
%% Cell type:code id: tags:
```
import scipy.linalg as linalg
V = X.T @ X # @ = mult de matrices
w, vl = linalg.eig(V)
wh, vlh = linalg.eigh(V)
eigvals = linalg.eigvals(V)
eigvalsh = linalg.eigvalsh(V)
print(eigvalsh)
print(linalg.svdvals(X) ** 2)
```
%% Cell type:markdown id: tags:
eig : calcul des valeurs et vecteurs propres : peut-être des complexes
eigh : pour des matrices symétriques : on sait que les valeurs propres sont des réels. La matrice de covariance est symétrique.
eigvals, eigvalsh = idem mais pour les valeurs propres uniquement
linalg.svd : calcul de la décomposition complète.
linalg.svdvals : calcul direct des valeurs singulières
On a bien eigvalsh(X.T @ X) = linalg.svdvals(X ** 2)
### Question 3
%% Cell type:code id: tags:
```
## New X = 15 features
n, p = 100, 15
X = np.random.normal(size=(n, p))
## First : center the data
Xc = X - X.mean(axis=0) # Rubstract row-wise (numpy)
V = 1 / (n - 1) * Xc.T @ Xc # (use same calc as scikit : unbiased variance)
# Calculate eigen values and vectors
# w eigenvalues, vl eigenvectors
w, vl = linalg.eigh(V)
# We can calculate the principal components by multiplying Xc by vl
Xpca = Xc @ vl
print(Xpca.shape)
```
%% Cell type:markdown id: tags:
we compare again scikit-learn
%% Cell type:code id: tags:
```
from sklearn.decomposition import PCA
pca = PCA() # All components are kept by default.
# PCA auto-center the data, no need to do it first
pca.fit(X)
print(n / (n - 1) * Xpca.std(axis=0) ** 2)
print(w)
print(pca.explained_variance_)
```
%% Cell type:markdown id: tags:
We find the same PCA as scikit-learn
%% Cell type:markdown id: tags:
## PCA for dimension reduction
We use the boston regression dataset
%% Cell type:code id: tags:
```
from sklearn.datasets import load_boston
boston = load_boston()
# We use PCA objects:
X = boston.data
y = boston.target
pca = PCA()
pca.fit(X)
# With study of explained veriance decreasing :
ev = pca.explained_variance_
print(ev)
plt.bar(range(len(ev)), ev)
plt.show()
```
%% Cell type:code id: tags:
```
# With total variance percentage :
evr = np.cumsum(pca.explained_variance_ratio_)
print(evr)
plt.bar(range(len(evr)), evr)
plt.show()
```
%% Cell type:markdown id: tags:
Using previous calculation, we could for example decide to use the two firsts PCA (> 95 % of the explained variance) or the three firsts (99 %)
%% Cell type:code id: tags:
```
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# With a data-driven way, determine the number of PCA that make the best score.
pca = PCA()
lin = LinearRegression()
# Define a list of tasks to do with the data : here a PCA then linear regression
pca_lin = Pipeline([("pca", pca), ("lin", lin)])
# The classifier is used to learn the best pipeline
# Grid-search Cross Validation, exploring some different PCA parameters
clf = GridSearchCV(
estimator=pca_lin,
scoring="neg_mean_squared_error",
cv=10,
param_grid=dict(pca__n_components=range(1, X.shape[1] + 1))
)
# Fit using input and target value
# At the end, the beso model is kept
clf.fit(X, y)
clf.best_params_
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment