Commit 4fba66b9 authored by Rémy Huet's avatar Rémy Huet 💻
Browse files

Fin du problème

parent 5a9e29d1
......@@ -22,11 +22,7 @@
"The authors propose and combine two methods to improve SVM performances : the \"virtual support vector\" method and the \"reduced set\" method.\n",
"With those two improvements, they announce a machine much faster (22 times than the original one) and more precise (1.1% vs 1.4% error) than the original one.\n",
"\n",
"In this work, we will describe and program the two techniques they are used to see if these method are working as they say.\n",
"\n",
"**Warning :** the second method proposed in the article was far too difficult to understand and put in place in the time we had.\n",
"We decided to change the kernel used by a kernel of degree 2 to simplify the computation of the reduced set vectors.\n",
"We thus made all the etude with this kernel."
"In this assignment, we will implement the first method to test it."
]
},
{
......@@ -140,9 +136,9 @@
"source": [
"# First, we perform a SVC without preprocessing or improving in terms of accuracy or speed\n",
"from sklearn.svm import SVC\n",
"# we perform the default SVC, with the hyperparameter C=10 and a polynomial kernel of degree 2\n",
"# according to what we said in the introduction\n",
"svc = SVC(C=10, kernel = 'poly', degree = 2, gamma=1, coef0=0)\n",
"# we perform the default SVC, with the hyperparameter C=10 and a polynomial kernel of degree 5\n",
"# according to the article\n",
"svc = SVC(C=10, kernel = 'poly', degree = 5, gamma=1, coef0=0)\n",
"svc.fit(X_train, y_train)"
]
},
......@@ -179,7 +175,7 @@
"id": "90f08e8b",
"metadata": {},
"source": [
"Of course the prediction time varies between two splits of the dataset, two computers and two executions, but we will retain that is is close from 90s.\n",
"Of course the prediction time varies between two splits of the dataset, two computers and two executions, but we will retain that is is close from 70s.\n",
"\n",
"Using `y_test` the real classes of the `X_test` samples and `y_pred` the predicted classes from the SVM, we can compute the confusion matrix and the error to see the how good the predictions are."
]
......@@ -229,8 +225,7 @@
"id": "8f780139",
"metadata": {},
"source": [
"As earlier, the values are affected by the selection of the training and testing dateset. Wi will retain a value of 1.6 % for the error.\n",
"We notice that this error is smaller thant hte error we had with our tests in degree 5\n",
"As earlier, the values are affected by the selection of the training and testing dateset. Wi will retain a value of 3.4 % for the error.\n",
"\n",
"The method described py the authors relies on the support vectors of the previously trained SVM.\n",
"We will thus do some inspection on them before going further."
......@@ -263,7 +258,7 @@
"id": "14cb2622",
"metadata": {},
"source": [
"There are around 8500 support vectors in the SVM.\n",
"There are around 7300 support vectors in the SVM.\n",
"\n",
"Each support vector is a sample of `X_train`. We can thus retrieve its class using its index on the train dataset, and display it as an image as above."
]
......@@ -289,7 +284,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "bd185e34",
"id": "578b9f6c",
"metadata": {},
"outputs": [],
"source": [
......@@ -345,7 +340,7 @@
},
{
"cell_type": "markdown",
"id": "a5c2fba7",
"id": "78ee8e6a",
"metadata": {},
"source": [
"With this new dataset, we can now train a SVM with the same params."
......@@ -354,7 +349,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "5473ed82",
"id": "8dc4a410",
"metadata": {},
"outputs": [],
"source": [
......@@ -364,7 +359,7 @@
},
{
"cell_type": "markdown",
"id": "312a308a",
"id": "396c7337",
"metadata": {},
"source": [
"Let's make some inspection on the results of the training"
......@@ -373,7 +368,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "25b7b232",
"id": "6c8bfe0d",
"metadata": {},
"outputs": [],
"source": [
......@@ -382,11 +377,11 @@
},
{
"cell_type": "markdown",
"id": "2e6df941",
"id": "43d511a9",
"metadata": {},
"source": [
"With the \"vanilla SVM\", we got ~8500 support vectors for a training set of 60000 samples, so ~14 % of the dataset.\n",
"With this training, we notice that most of the data was selected (~65 %).\n",
"With the \"vanilla SVM\", we got ~7300 support vectors for a training set of 60000 samples, so ~12 % of the dataset.\n",
"With this training, we notice that most of the data was selected (~67 %).\n",
"\n",
"We can now use the trained SVM on the test date to measure the error and the time of the test."
]
......@@ -394,7 +389,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "e17ff49c",
"id": "a196291b",
"metadata": {},
"outputs": [],
"source": [
......@@ -408,10 +403,10 @@
},
{
"cell_type": "markdown",
"id": "264071fc",
"id": "bbda2f32",
"metadata": {},
"source": [
"The time for the SVM to predict the test dataset is about 230s, which is much more than th vanilla SVM.\n",
"The time for the SVM to predict the test dataset is about 200s, which is much more than th vanilla SVM.\n",
"It was predictable because the number of support vector is larger than the number of the support vectors of the vanilla SVM.\n",
"\n",
"Let's now see the error of the predictions."
......@@ -420,7 +415,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "214e3b7d",
"id": "2c99def0",
"metadata": {},
"outputs": [],
"source": [
......@@ -435,7 +430,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "99dd9008",
"id": "29d1ff85",
"metadata": {},
"outputs": [],
"source": [
......@@ -459,12 +454,26 @@
},
{
"cell_type": "markdown",
"id": "b8f73b8d",
"id": "ed73c147",
"metadata": {},
"source": [
"We can see that the error is smaller than the error of the vanilla SVM (1.2 % vs 1.6 %)\n",
"We can see that the error is smaller than the error of the vanilla SVM (2.2 % vs 3.4 %)."
]
},
{
"cell_type": "markdown",
"id": "2fc76a50",
"metadata": {},
"source": [
"### Conclusion\n",
"\n",
"By implementing the \"virtual support vectors\" technique, we were able to ass some invariance in the data.\n",
"This modification allowed an improvement of the accuracy of the SVM.\n",
"\n",
"However, most of the new data generated from the support vectors were selected as support vectors for the second machine.\n",
"The augmentation of the number of support vectors led to an augmentation of the computation time during the test phase.\n",
"\n",
"We will now try to reduce the time of execution of the test using the method proposed in the article."
"That is why the authors suggest in the article to use a second technique to create a reduced set of vectors to reduce the computation time in the test phase."
]
}
],
......
%% Cell type:markdown id:5c8980bd tags:
# AOS1 - Assignment
## Improving the accuracy and speed of support vector machines
Authors : Mathilde Rineau, Rémy Huet
### Abstract
The paper "Improving the Accuracy and Speed of Support Vector Machines" by Burges and Schölkopf is investigating a method to improve ht speed an accuracy of a support vector machine.
As the authors say, SVM are wildly used for several applications.
To improve this method, the authors make the difference between two types of improvements to achieve :
- improving the generalization performance;
- improving the speed in test phase.
The authors propose and combine two methods to improve SVM performances : the "virtual support vector" method and the "reduced set" method.
With those two improvements, they announce a machine much faster (22 times than the original one) and more precise (1.1% vs 1.4% error) than the original one.
In this work, we will describe and program the two techniques they are used to see if these method are working as they say.
**Warning :** the second method proposed in the article was far too difficult to understand and put in place in the time we had.
We decided to change the kernel used by a kernel of degree 2 to simplify the computation of the reduced set vectors.
We thus made all the etude with this kernel.
In this assignment, we will implement the first method to test it.
%% Cell type:markdown id:12aaeba6 tags:
### First part : tests with a vanilla SVM
In this first part, we will use a vanilla SVM on the MINST dataset with the provided params.
We will observe the error of the SVM and the time for the test phase to compare them with the improved version
%% Cell type:code id:9f152334 tags:
```
# We will work on the mnist data set
# We load it from fetch_openml
from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
import numpy as np
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
```
%% Cell type:markdown id:855cdb06 tags:
We do some inspection on the dataset :
%% Cell type:code id:708c8ea1 tags:
```
# We print the caracteristics of X and Y
print(X.shape)
print(y.shape)
# Values taken by y
print(np.unique(y))
image = np.reshape(X[0], (28, 28))
plt.imshow(image, cmap='gray')
```
%% Cell type:markdown id:4e49be54 tags:
The dataset contains 70k samples of 784 features.
The classes are 0 to 9 (the digits on the images).
The features are the pixels of a 28 x 28 image that we can retrieve using numpy's reshape function.
For example, the 1st image is a 5.
%% Cell type:markdown id:60f31892 tags:
With our dataset, we can generate a training dataset and a testing dataset.
As in the article, we will use 60k samples as training samples and 10k as testing.
We split the dataset using the `train_test_split` function from `sklearn`.
%% Cell type:code id:4d3fa1c7 tags:
```
# We divide the data set in two parts: train set and test set
# According to the recommended values the train set's size is 60000 and the test set's size is 10000
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=60000, test_size=10000)
```
%% Cell type:markdown id:d0532cc1 tags:
From the article, we retrieve the parameters of the SVM used.
We get C = 10, and a polynomial kernel of degree 5.
Coefficients `gamma` and `coef0` are respectively equals to 1 and 0.
We can now train a SVM with these params on the training dataset.
%% Cell type:code id:d809fc87 tags:
```
# First, we perform a SVC without preprocessing or improving in terms of accuracy or speed
from sklearn.svm import SVC
# we perform the default SVC, with the hyperparameter C=10 and a polynomial kernel of degree 2
# according to what we said in the introduction
svc = SVC(C=10, kernel = 'poly', degree = 2, gamma=1, coef0=0)
# we perform the default SVC, with the hyperparameter C=10 and a polynomial kernel of degree 5
# according to the article
svc = SVC(C=10, kernel = 'poly', degree = 5, gamma=1, coef0=0)
svc.fit(X_train, y_train)
```
%% Cell type:markdown id:a8cf4850 tags:
Using the previously trained SVM, we make a prediction on the test dataset.
One of the measured performance of the SVM in this article is the speed of the test phase.
We thus measure it.
%% Cell type:code id:8cb28178 tags:
```
import time
start = time.time()
# We predict the values for our test set
y_pred = svc.predict(X_test)
end = time.time()
print(f'Elapsed time : {end - start}')
```
%% Cell type:markdown id:90f08e8b tags:
Of course the prediction time varies between two splits of the dataset, two computers and two executions, but we will retain that is is close from 90s.
Of course the prediction time varies between two splits of the dataset, two computers and two executions, but we will retain that is is close from 70s.
Using `y_test` the real classes of the `X_test` samples and `y_pred` the predicted classes from the SVM, we can compute the confusion matrix and the error to see the how good the predictions are.
%% Cell type:code id:c1248238 tags:
```
# We compute the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, accuracy_score
disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
disp.figure_.suptitle('Confusion matrix for the vanilla SVM')
plt.show()
```
%% Cell type:code id:ba4e38ac tags:
```
# We print the classification report
print(classification_report(y_test, y_pred))
```
%% Cell type:code id:947b0895 tags:
```
# We print the accuracy of the SVC and the error rate
acc = accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Error rate: ", (1-acc) * 100, "%")
```
%% Cell type:markdown id:8f780139 tags:
As earlier, the values are affected by the selection of the training and testing dateset. Wi will retain a value of 1.6 % for the error.
We notice that this error is smaller thant hte error we had with our tests in degree 5
As earlier, the values are affected by the selection of the training and testing dateset. Wi will retain a value of 3.4 % for the error.
The method described py the authors relies on the support vectors of the previously trained SVM.
We will thus do some inspection on them before going further.
%% Cell type:code id:81b09df7 tags:
```
s_vects = svc.support_vectors_
print(s_vects.shape)
v = s_vects[0]
v_index = svc.support_[0]
v_class = y_train[v_index]
print(v_index)
print(v_class)
print(f'Class of the first support vector : {v_class}')
img = np.reshape(v, (28, 28))
plt.imshow(img, cmap='gray')
```
%% Cell type:markdown id:14cb2622 tags:
There are around 8500 support vectors in the SVM.
There are around 7300 support vectors in the SVM.
Each support vector is a sample of `X_train`. We can thus retrieve its class using its index on the train dataset, and display it as an image as above.
%% Cell type:markdown id:fa1d441f tags:
### Implementing the "Virtual Support Vectors" method
We will now implement the "Virtual Support Vectors" as proposed by the authors.
The aim of this method is to add some invariance in the data to make the previsions more robust.
For a given trained SVM, we now that the only data relevant for classification are the support vectors.
We will thus re-train a SVM, but with different data created from the support vectors.
Here, the invariance proposed is shifting the image to one of the four directions.
For each support vector, we will shift the image to the four directions, and use those images as a new dataset.
%% Cell type:code id:bd185e34 tags:
%% Cell type:code id:578b9f6c tags:
```
# Retrieve the indexes of the support vectors
sv_indexes = svc.support_
# Arrays for storing the data
X_vsv = []
y_vsv = []
for i in sv_indexes:
# Get the support vector and reshape it as image
sv = X_train[i].reshape((28, 28))
sv_class = y_train[i]
# Generate the four shifts, reshape them
sv_1 = np.roll(sv, 1, axis=0).reshape(784)
sv_2 = np.roll(sv, -1, axis=0).reshape(784)
sv_3 = np.roll(sv, 1, axis=1).reshape(784)
sv_4 = np.roll(sv, -1, axis=1).reshape(784)
# Add them to the dataset
X_vsv.append(sv_1)
X_vsv.append(sv_2)
X_vsv.append(sv_3)
X_vsv.append(sv_4)
# Add the corresponding classes
y_vsv.append(sv_class)
y_vsv.append(sv_class)
y_vsv.append(sv_class)
y_vsv.append(sv_class)
X_vsv = np.array(X_vsv)
y_vsv = np.array(y_vsv)
print(X_vsv.shape)
print(y_vsv.shape)
im0 = X_vsv[0].reshape((28, 28))
im1 = X_vsv[1].reshape((28, 28))
im2 = X_vsv[2].reshape((28, 28))
im3 = X_vsv[3].reshape((28, 28))
print(f'classes : {y_vsv[0]} {y_vsv[1]} {y_vsv[2]} {y_vsv[3]}')
plt.figure()
_, axis = plt.subplots(1, 4)
axis[0].imshow(im0, cmap='gray')
axis[1].imshow(im1, cmap='gray')
axis[2].imshow(im2, cmap='gray')
axis[3].imshow(im3, cmap='gray')
```
%% Cell type:markdown id:a5c2fba7 tags:
%% Cell type:markdown id:78ee8e6a tags:
With this new dataset, we can now train a SVM with the same params.
%% Cell type:code id:5473ed82 tags:
%% Cell type:code id:8dc4a410 tags:
```
# note that we can re-fit the same SVC object with the new dataset
svc.fit(X_vsv, y_vsv)
```
%% Cell type:markdown id:312a308a tags:
%% Cell type:markdown id:396c7337 tags:
Let's make some inspection on the results of the training
%% Cell type:code id:25b7b232 tags:
%% Cell type:code id:6c8bfe0d tags:
```
print(svc.support_.shape)
```
%% Cell type:markdown id:2e6df941 tags:
%% Cell type:markdown id:43d511a9 tags:
With the "vanilla SVM", we got ~8500 support vectors for a training set of 60000 samples, so ~14 % of the dataset.
With this training, we notice that most of the data was selected (~65 %).
With the "vanilla SVM", we got ~7300 support vectors for a training set of 60000 samples, so ~12 % of the dataset.
With this training, we notice that most of the data was selected (~67 %).
We can now use the trained SVM on the test date to measure the error and the time of the test.
%% Cell type:code id:e17ff49c tags:
%% Cell type:code id:a196291b tags:
```
start = time.time()
# We predict the values for our test set
y_pred = svc.predict(X_test)
end = time.time()
print(f'Elapsed time : {end - start}')
```
%% Cell type:markdown id:264071fc tags:
%% Cell type:markdown id:bbda2f32 tags:
The time for the SVM to predict the test dataset is about 230s, which is much more than th vanilla SVM.
The time for the SVM to predict the test dataset is about 200s, which is much more than th vanilla SVM.
It was predictable because the number of support vector is larger than the number of the support vectors of the vanilla SVM.
Let's now see the error of the predictions.
%% Cell type:code id:214e3b7d tags:
%% Cell type:code id:2c99def0 tags:
```
# We compute the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, accuracy_score
disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
disp.figure_.suptitle('Confusion matrix for the vanilla SVM')
plt.show()
```
%% Cell type:code id:99dd9008 tags:
%% Cell type:code id:29d1ff85 tags:
```
# We print the classification report
print(classification_report(y_test, y_pred))
```
%% Cell type:code id:947b0895 tags:
```
# We print the accuracy of the SVC and the error rate
acc = accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Error rate: ", (1-acc) * 100, "%")
```
%% Cell type:markdown id:b8f73b8d tags:
%% Cell type:markdown id:ed73c147 tags:
We can see that the error is smaller than the error of the vanilla SVM (2.2 % vs 3.4 %).
%% Cell type:markdown id:2fc76a50 tags:
### Conclusion
By implementing the "virtual support vectors" technique, we were able to ass some invariance in the data.
This modification allowed an improvement of the accuracy of the SVM.
We can see that the error is smaller than the error of the vanilla SVM (1.2 % vs 1.6 %)
However, most of the new data generated from the support vectors were selected as support vectors for the second machine.
The augmentation of the number of support vectors led to an augmentation of the computation time during the test phase.
We will now try to reduce the time of execution of the test using the method proposed in the article.
That is why the authors suggest in the article to use a second technique to create a reduced set of vectors to reduce the computation time in the test phase.
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment