Commit 1b3c49af authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

Add some text

parent 5846c5ce
......@@ -27,13 +27,13 @@
},
{
"cell_type": "markdown",
"id": "12aaeba6",
"id": "065f5872",
"metadata": {},
"source": [
"### First part : tests with a vanilla SVM\n",
"\n",
"In this first part, we will use a vanilla SVM on the MINST dataset with the provided params.\n",
"We will observe the error of the SVM and the time for the test phase to compare them with the improved version"
"We will observe the error of the SVM and the time for the test phase to compare them with the improved version."
]
},
{
......@@ -55,7 +55,7 @@
},
{
"cell_type": "markdown",
"id": "855cdb06",
"id": "3fe8ac1e",
"metadata": {},
"source": [
"We do some inspection on the dataset :"
......@@ -64,7 +64,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "708c8ea1",
"id": "ab814c48",
"metadata": {},
"outputs": [],
"source": [
......@@ -80,7 +80,7 @@
},
{
"cell_type": "markdown",
"id": "4e49be54",
"id": "3774c25a",
"metadata": {},
"source": [
"The dataset contains 70k samples of 784 features.\n",
......@@ -93,7 +93,7 @@
},
{
"cell_type": "markdown",
"id": "60f31892",
"id": "f16e7cb9",
"metadata": {},
"source": [
"With our dataset, we can generate a training dataset and a testing dataset.\n",
......@@ -118,7 +118,7 @@
},
{
"cell_type": "markdown",
"id": "d0532cc1",
"id": "b416f8ec",
"metadata": {},
"source": [
"From the article, we retrieve the parameters of the SVM used.\n",
......@@ -145,7 +145,7 @@
},
{
"cell_type": "markdown",
"id": "a8cf4850",
"id": "3e624372",
"metadata": {},
"source": [
"Using the previously trained SVM, we make a prediction on the test dataset.\n",
......@@ -175,7 +175,7 @@
},
{
"cell_type": "markdown",
"id": "90f08e8b",
"id": "3cf8a68b",
"metadata": {},
"source": [
"Of course the prediction time varies between two splits of the dataset and two executions, but we will retain that is is close from 70s.\n",
......@@ -225,7 +225,7 @@
},
{
"cell_type": "markdown",
"id": "8f780139",
"id": "d379eaad",
"metadata": {},
"source": [
"As earlier, the values are affected by the selection of the training and testing dateset. Wi will retain a value of 3 % for the error.\n",
......@@ -256,7 +256,7 @@
},
{
"cell_type": "markdown",
"id": "14cb2622",
"id": "a6639c96",
"metadata": {},
"source": [
"There are around 7500 support vectors in the SVM.\n",
......@@ -266,7 +266,7 @@
},
{
"cell_type": "markdown",
"id": "fa1d441f",
"id": "f73d3b82",
"metadata": {},
"source": [
"### Implementing the \"Virtual Support Vectors\" method"
......@@ -308,10 +308,46 @@
},
{
"cell_type": "markdown",
"id": "dd4bc011",
"id": "599220d1",
"metadata": {},
"source": [
"### Implementing the \"Reduced Set\" method"
"### Implementing the \"Reduced Set\" method\n",
"\n",
"The method implemented above augments the accuracy of the SMV classifier, but it also increases the number of support vectors and thus the computation time for the test data.\n",
"\n",
"To reduce the classification time, the goal is to find a set of vectors $z_k \\in L, k = 1,\\dots,N_z$ and corresponding weights $\\gamma_k \\in R$.\n",
"\n",
"With $\\bar\\Psi$ the normal to the decision hyperplane determined by the SVM ans $\\Phi$ the mapping from the features space to the decision space, we note $$\\bar\\Psi' = \\sum_{k=1}^{N_z} \\gamma_k\\Phi(z_k)$$.\n",
"\n",
"We will now try to minimize for a fixed $N_z$ the Euclidian distance from the original solution $$\\rho = ||\\bar\\Psi - \\bar\\Psi'||$$\n",
"\n",
"Using these vectors and weights, the decision rule for a test point $x$ is $$\\sum_{k=1}^{N_z}\\gamma_k K(z_k, x)$$\n",
"\n",
"The way to find the $z_k$ is not described on this article, so we searched it in the mentioned article (Bruges 1996).\n",
"\n",
"We note the following properties for the reduced set vectors :\n",
"\n",
"- The decision rule for the approximate SMV using these vectors is the same as the SVM using the support vectors;\n",
"- They are not support vectors of the original SVM and **they are not training samples**;\n",
"- We choose their nubmer *a priori*.\n",
"\n",
"Here, the number is choosed to be 50 times fewer than the number of vectors found with the VSV method.\n",
"\n",
"We the apply the algorithm proposed in (Bruges 1996) to determine the vectors and weights."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8860ad4d",
"metadata": {},
"outputs": [],
"source": [
"##### TMP\n",
"vsv = 28032\n",
"##### TMP\n",
"\n",
"nz = int(vsv / 50)"
]
}
],
......
%% Cell type:markdown id:5c8980bd tags:
# AOS1 - Assignment
## Improving the accuracy and speed of support vector machines
Authors : Mathilde Rineau, Rémy Huet
### Abstract
The paper "Improving the Accuracy and Speed of Support Vector Machines" by Burges and Schölkopf is investigating a method to improve ht speed an accuracy of a support vector machine.
As the authors say, SVM are wildly used for several applications.
To improve this method, the authors make the difference between two types of improvements to achieve :
- improving the generalization performance;
- improving the speed in test phase.
The authors propose and combine two methods to improve SVM performances : the "virtual support vector" method and the "reduced set" method.
With those two improvements, they announce a machine much faster (22 times than the original one) and more precise (1.1% vs 1.4% error) than the original one.
In this work, we will describe and program the two techniques they are used to see if these method are working as they say.
%% Cell type:markdown id:12aaeba6 tags:
%% Cell type:markdown id:065f5872 tags:
### First part : tests with a vanilla SVM
In this first part, we will use a vanilla SVM on the MINST dataset with the provided params.
We will observe the error of the SVM and the time for the test phase to compare them with the improved version
We will observe the error of the SVM and the time for the test phase to compare them with the improved version.
%% Cell type:code id:9f152334 tags:
```
# We will work on the mnist data set
# We load it from fetch_openml
from sklearn.datasets import fetch_openml
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
```
%% Cell type:markdown id:855cdb06 tags:
%% Cell type:markdown id:3fe8ac1e tags:
We do some inspection on the dataset :
%% Cell type:code id:708c8ea1 tags:
%% Cell type:code id:ab814c48 tags:
```
# We print the caracteristics of X and Y
print(X.shape)
print(y.shape)
# Values taken by y
print(np.unique(y))
image = np.reshape(X[0], (28, 28))
plt.imshow(image, cmap='gray')
```
%% Cell type:markdown id:4e49be54 tags:
%% Cell type:markdown id:3774c25a tags:
The dataset contains 70k samples of 784 features.
The classes are 0 to 9 (the digits on the images).
The features are the pixels of a 28 x 28 image that we can retrieve using numpy's reshape function.
For example, the 1st image is a 5.
%% Cell type:markdown id:60f31892 tags:
%% Cell type:markdown id:f16e7cb9 tags:
With our dataset, we can generate a training dataset and a testing dataset.
As in the article, we will use 60k samples as training samples and 10k as testing.
We split the dataset using the `train_test_split` function from `sklearn`.
%% Cell type:code id:4d3fa1c7 tags:
```
# We divide the data set in two parts: train set and test set
# According to the recommended values the train set's size is 60000 and the test set's size is 10000
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=60000, test_size=10000)
```
%% Cell type:markdown id:d0532cc1 tags:
%% Cell type:markdown id:b416f8ec tags:
From the article, we retrieve the parameters of the SVM used.
We get C = 10, and a polynomial kernel of degree 5.
Coefficients `gamma` and `coef0` are respectively equals to 1 and 0.
We can now train a SVM with these params on the training dataset.
%% Cell type:code id:d809fc87 tags:
```
# First, we perform a SVC without preprocessing or improving in terms of accuracy or speed
from sklearn.svm import SVC
# we perform the default SVC, with the hyperparameter C=10 and a polynomial kernel of degree 5
# according to the recommandations
svc = SVC(C=10, kernel = 'poly', degree = 5, gamma=1, coef0=0)
svc.fit(X_train, y_train)
```
%% Cell type:markdown id:a8cf4850 tags:
%% Cell type:markdown id:3e624372 tags:
Using the previously trained SVM, we make a prediction on the test dataset.
One of the measured performance of the SVM in this article is the speed of the test phase.
We thus measure it.
%% Cell type:code id:8cb28178 tags:
```
import time
start = time.time()
# We predict the values for our test set
y_pred = svc.predict(X_test)
end = time.time()
# We predict 10 times to have a mean elapsed time
print(f'Elapsed time : {end - start}')
```
%% Cell type:markdown id:90f08e8b tags:
%% Cell type:markdown id:3cf8a68b tags:
Of course the prediction time varies between two splits of the dataset and two executions, but we will retain that is is close from 70s.
Using `y_test` the real classes of the `X_test` samples and `y_pred` the predicted classes from the SVM, we can compute the confusion matrix and the error to see the how good the predictions are.
%% Cell type:code id:c1248238 tags:
```
# We compute the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, accuracy_score
disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
disp.figure_.suptitle('Confusion matrix for the vanilla SVM')
plt.show()
```
%% Cell type:code id:ba4e38ac tags:
```
# We print the classification report
print(classification_report(y_test, y_pred))
```
%% Cell type:code id:947b0895 tags:
```
# We print the accuracy of the SVC and the error rate
acc = accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Error rate: ", (1-acc) * 100, "%")
```
%% Cell type:markdown id:8f780139 tags:
%% Cell type:markdown id:d379eaad tags:
As earlier, the values are affected by the selection of the training and testing dateset. Wi will retain a value of 3 % for the error.
The method described py the authors relies on the support vectors of the previously trained SVM.
We will thus do some inspection on them before going further.
%% Cell type:code id:81b09df7 tags:
```
s_vects = svc.support_vectors_
print(s_vects.shape)
v = s_vects[0]
v_index = svc.support_[0]
v_class = y_train[v_index]
print(f'Class of the first support vector : {v_class}')
img = np.reshape(v, (28, 28))
plt.imshow(img, cmap='gray')
```
%% Cell type:markdown id:14cb2622 tags:
%% Cell type:markdown id:a6639c96 tags:
There are around 7500 support vectors in the SVM.
Each support vector is a sample of `X_train`. We can thus retrieve its class using its index on the train dataset, and display it as an image as above.
%% Cell type:markdown id:fa1d441f tags:
%% Cell type:markdown id:f73d3b82 tags:
### Implementing the "Virtual Support Vectors" method
%% Cell type:code id:0e648133 tags:
```
def right_side_rescaling(support_vectors):
n,m = support_vectors.shape
#print(n,m)
support_vector_lin =support_vectors.reshape((-1, n*m))
#print(support_vector_lin.shape)
temp = support_vector_lin[0][0]
for i in range (n*m-2):
#print(support_vector_lin[0][i])
support_vector_lin[0][i] = support_vector_lin[0][i+1]
support_vector_lin[0][n*m-1] = temp
support_vectors = support_vector_lin.reshape(n,m)
return support_vectors
```
%% Cell type:code id:aa5535c9 tags:
```
m = []
m.append([1,2,3,4,5])
m.append([1,2,3,4,5])
print(right_side_rescaling(np.array(m)))
```
%% Cell type:markdown id:dd4bc011 tags:
%% Cell type:markdown id:599220d1 tags:
### Implementing the "Reduced Set" method
The method implemented above augments the accuracy of the SMV classifier, but it also increases the number of support vectors and thus the computation time for the test data.
To reduce the classification time, the goal is to find a set of vectors $z_k \in L, k = 1,\dots,N_z$ and corresponding weights $\gamma_k \in R$.
With $\bar\Psi$ the normal to the decision hyperplane determined by the SVM ans $\Phi$ the mapping from the features space to the decision space, we note $$\bar\Psi' = \sum_{k=1}^{N_z} \gamma_k\Phi(z_k)$$.
We will now try to minimize for a fixed $N_z$ the Euclidian distance from the original solution $$\rho = ||\bar\Psi - \bar\Psi'||$$
Using these vectors and weights, the decision rule for a test point $x$ is $$\sum_{k=1}^{N_z}\gamma_k K(z_k, x)$$
The way to find the $z_k$ is not described on this article, so we searched it in the mentioned article (Bruges 1996).
We note the following properties for the reduced set vectors :
- The decision rule for the approximate SMV using these vectors is the same as the SVM using the support vectors;
- They are not support vectors of the original SVM and **they are not training samples**;
- We choose their nubmer *a priori*.
Here, the number is choosed to be 50 times fewer than the number of vectors found with the VSV method.
We the apply the algorithm proposed in (Bruges 1996) to determine the vectors and weights.
%% Cell type:code id:8860ad4d tags:
```
##### TMP
vsv = 28032
##### TMP
nz = int(vsv / 50)
```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment