Commit bcc4de02 authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

TP4

parent 0affb9fb
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TD PCA\n",
"\n",
"## PCA by hand"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Generate dataset\n",
"X = np.random.multivariate_normal([1, 2], [[2, 1], [1, 2]], 100)\n",
"X.shape\n",
"plt.scatter(*X.T) # Same notation as (X[:, 0], X[:, 1])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 1\n",
"\n",
"Number of ssamples : 100\n",
"Each sample has 2 features. (n = 100, p = 2)\n",
"\n",
"### Question 2\n",
"\n",
"We first calculate the covariance matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import scipy.linalg as linalg\n",
"\n",
"V = X.T @ X # @ = mult de matrices\n",
"\n",
"w, vl = linalg.eig(V)\n",
"wh, vlh = linalg.eigh(V)\n",
"eigvals = linalg.eigvals(V)\n",
"eigvalsh = linalg.eigvalsh(V)\n",
"\n",
"print(eigvalsh)\n",
"print(linalg.svdvals(X) ** 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"eig : calcul des valeurs et vecteurs propres : peut-être des complexes\n",
"eigh : pour des matrices symétriques : on sait que les valeurs propres sont des réels. La matrice de covariance est symétrique.\n",
"\n",
"eigvals, eigvalsh = idem mais pour les valeurs propres uniquement\n",
"\n",
"linalg.svd : calcul de la décomposition complète.\n",
"linalg.svdvals : calcul direct des valeurs singulières\n",
"\n",
"On a bien eigvalsh(X.T @ X) = linalg.svdvals(X ** 2)\n",
"\n",
"### Question 3"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## New X = 15 features\n",
"\n",
"n, p = 100, 15\n",
"\n",
"X = np.random.normal(size=(n, p))\n",
"\n",
"## First : center the data\n",
"\n",
"Xc = X - X.mean(axis=0) # Rubstract row-wise (numpy)\n",
"\n",
"V = 1 / (n - 1) * Xc.T @ Xc # (use same calc as scikit : unbiased variance)\n",
"\n",
"# Calculate eigen values and vectors\n",
"# w eigenvalues, vl eigenvectors\n",
"w, vl = linalg.eigh(V)\n",
"\n",
"# We can calculate the principal components by multiplying Xc by vl\n",
"\n",
"Xpca = Xc @ vl\n",
"\n",
"print(Xpca.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we compare again scikit-learn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA() # All components are kept by default.\n",
"\n",
"# PCA auto-center the data, no need to do it first\n",
"pca.fit(X)\n",
"\n",
"print(n / (n - 1) * Xpca.std(axis=0) ** 2)\n",
"print(w)\n",
"print(pca.explained_variance_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We find the same PCA as scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PCA for dimension reduction\n",
"\n",
"We use the boston regression dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import load_boston\n",
"boston = load_boston()\n",
"\n",
"# We use PCA objects:\n",
"\n",
"X = boston.data\n",
"y = boston.target\n",
"\n",
"pca = PCA()\n",
"pca.fit(X)\n",
"\n",
"# With study of explained veriance decreasing : \n",
"ev = pca.explained_variance_\n",
"print(ev)\n",
"plt.bar(range(len(ev)), ev)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# With total variance percentage :\n",
"evr = np.cumsum(pca.explained_variance_ratio_)\n",
"print(evr)\n",
"plt.bar(range(len(evr)), evr)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using previous calculation, we could for example decide to use the two firsts PCA (> 95 % of the explained variance) or the three firsts (99 %)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.pipeline import Pipeline\n",
"\n",
"# With a data-driven way, determine the number of PCA that make the best score.\n",
"\n",
"pca = PCA()\n",
"lin = LinearRegression()\n",
"# Define a list of tasks to do with the data : here a PCA then linear regression\n",
"pca_lin = Pipeline([(\"pca\", pca), (\"lin\", lin)])\n",
"\n",
"# The classifier is used to learn the best pipeline\n",
"# Grid-search Cross Validation, exploring some different PCA parameters\n",
"clf = GridSearchCV(\n",
" estimator=pca_lin,\n",
" scoring=\"neg_mean_squared_error\",\n",
" cv=10,\n",
" param_grid=dict(pca__n_components=range(1, X.shape[1] + 1))\n",
")\n",
"\n",
"# Fit using input and target value\n",
"# At the end, the beso model is kept\n",
"clf.fit(X, y)\n",
"\n",
"clf.best_params_"
]
}
],
"metadata": {
"interpreter": {
"hash": "3abb0a1ef4892304d86bb3a3dfd052bcca35057beadba016173999c775e8d3ba"
},
"kernelspec": {
"display_name": "Python 3.9.7 64-bit ('AOS1-QteoCFsS': pipenv)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment