Commit 21870265 authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

TP6 -> looooooong clf

parent ec4ef403
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AOS1\n",
"## TP3 - Kernel methods"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.datasets import fetch_lfw_people"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Qu1\n",
"\n",
"Fetch the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"faces = fetch_lfw_people(min_faces_per_person=60)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(faces.target)\n",
"print(faces.target_names)\n",
"print(np.unique(faces.target))\n",
"\n",
"print(faces.images.shape)\n",
"print(faces.data.shape) # Images \"flattened\"\n",
"\n",
"plt.imshow(faces.images[5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each sample has 2914 features (pixels).\n",
"We will first do a PCA to reduce the number of features before learning the SVM"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 2\n",
"\n",
"We split the data in train and test sets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 3\n",
"\n",
"We want a PCA to reduce the number of features to 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\n",
"\n",
"pca = PCA(n_components=100, whiten=True)\n",
"\n",
"X_train_pca = pca.fit_transform(X_train)\n",
"\n",
"print(pca.n_components_)\n",
"print(pca.explained_variance_ratio_)\n",
"print(np.sum(pca.explained_variance_ratio_))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With 100 components, we keep more than 90 % of the explained value.\n",
"\n",
"Now we want to train a avnilla svm on this data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import SVC\n",
"from sklearn.metrics import confusion_matrix\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"svc = SVC()\n",
"svc.fit(X_train_pca, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With our trained data, we can predict the results for the test dataset and compare it to the test targets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_test_pca = pca.transform(X_test)\n",
"y = svc.predict(X_test_pca)\n",
"\n",
"print(confusion_matrix(y_test, y))\n",
"print(classification_report(y_test, y))\n",
"print(accuracy_score(y_test, y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 4\n",
"\n",
"The SVM was trained with default hyperparameters.\n",
"\n",
"Theses parameters are the following:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(svc.C)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 5\n",
"\n",
"We use a gridsearchCV to perform a search on the hyperparameters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"parameters = {'C': np.logspace(-2, 3, 10), 'gamma': np.logspace(-4, 1, 10)}\n",
"\n",
"clf = GridSearchCV(svc, parameters)\n",
"print(clf)\n",
"\n",
"clf.fit(X_train_pca, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"clf.best_params_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With these parameters :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"svc = SVC(C=clf.best_params_['C'], gamma=clf.best_params_['gamma'])\n",
"svc.fit(X_train_pca, y_train)\n",
"\n",
"y = svc.predict(X_test_pca)\n",
"\n",
"print(classification_report(y_test, y))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Question 6\n",
"\n",
"We want to add the number of principall components used to the CV"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import make_pipeline\n",
"\n",
"pca = PCA(whiten=True)\n",
"svc = SVC()\n",
"\n",
"estimator = make_pipeline(pca, svc)\n",
"\n",
"parameters = {\n",
" 'pca__n_components': range(100),\n",
" 'svc__C' : np.logspace(-2, 3, 10),\n",
" 'svc__gamma' : np.logspace(-4, 1, 10)\n",
"}\n",
"\n",
"clf = GridSearchCV(estimator, parameters)\n",
"clf.fit(X_train_pca, y_train)\n",
"clf.best_params_"
]
}
],
"metadata": {
"interpreter": {
"hash": "3abb0a1ef4892304d86bb3a3dfd052bcca35057beadba016173999c775e8d3ba"
},
"kernelspec": {
"display_name": "Python 3.9.7 64-bit ('AOS1-QteoCFsS': pipenv)",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
# AOS1
## TP3 - Kernel methods
%% Cell type:code id: tags:
```
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
```
%% Cell type:markdown id: tags:
### Qu1
Fetch the data
%% Cell type:code id: tags:
```
faces = fetch_lfw_people(min_faces_per_person=60)
```
%% Cell type:code id: tags:
```
print(faces.target)
print(faces.target_names)
print(np.unique(faces.target))
print(faces.images.shape)
print(faces.data.shape) # Images "flattened"
plt.imshow(faces.images[5])
```
%% Cell type:markdown id: tags:
Each sample has 2914 features (pixels).
We will first do a PCA to reduce the number of features before learning the SVM
%% Cell type:markdown id: tags:
### Question 2
We split the data in train and test sets.
%% Cell type:code id: tags:
```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(faces.data, faces.target)
```
%% Cell type:markdown id: tags:
### Question 3
We want a PCA to reduce the number of features to 100
%% Cell type:code id: tags:
```
from sklearn.decomposition import PCA
pca = PCA(n_components=100, whiten=True)
X_train_pca = pca.fit_transform(X_train)
print(pca.n_components_)
print(pca.explained_variance_ratio_)
print(np.sum(pca.explained_variance_ratio_))
```
%% Cell type:markdown id: tags:
With 100 components, we keep more than 90 % of the explained value.
Now we want to train a avnilla svm on this data
%% Cell type:code id: tags:
```
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
svc = SVC()
svc.fit(X_train_pca, y_train)
```
%% Cell type:markdown id: tags:
With our trained data, we can predict the results for the test dataset and compare it to the test targets
%% Cell type:code id: tags:
```
X_test_pca = pca.transform(X_test)
y = svc.predict(X_test_pca)
print(confusion_matrix(y_test, y))
print(classification_report(y_test, y))
print(accuracy_score(y_test, y))
```
%% Cell type:markdown id: tags:
### Question 4
The SVM was trained with default hyperparameters.
Theses parameters are the following:
%% Cell type:code id: tags:
```
print(svc.C)
```
%% Cell type:markdown id: tags:
### Question 5
We use a gridsearchCV to perform a search on the hyperparameters
%% Cell type:code id: tags:
```
from sklearn.model_selection import GridSearchCV
parameters = {'C': np.logspace(-2, 3, 10), 'gamma': np.logspace(-4, 1, 10)}
clf = GridSearchCV(svc, parameters)
print(clf)
clf.fit(X_train_pca, y_train)
```
%% Cell type:code id: tags:
```
clf.best_params_
```
%% Cell type:markdown id: tags:
With these parameters :
%% Cell type:code id: tags:
```
svc = SVC(C=clf.best_params_['C'], gamma=clf.best_params_['gamma'])
svc.fit(X_train_pca, y_train)
y = svc.predict(X_test_pca)
print(classification_report(y_test, y))
```
%% Cell type:markdown id: tags:
#### Question 6
We want to add the number of principall components used to the CV
%% Cell type:code id: tags:
```
from sklearn.pipeline import make_pipeline
pca = PCA(whiten=True)
svc = SVC()
estimator = make_pipeline(pca, svc)
parameters = {
'pca__n_components': range(100),
'svc__C' : np.logspace(-2, 3, 10),
'svc__gamma' : np.logspace(-4, 1, 10)
}
clf = GridSearchCV(estimator, parameters)
clf.fit(X_train_pca, y_train)
clf.best_params_
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment