{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TP Apprentissage supervisé: Classification / Discrimination\n", "\n", "Dans ce tp, on fait de la Classification / Discrimination, c'est-à-dire que l'on connaît les \"vrais\" labels de nos classes. \n", "\n", "On va utiliser les données Breast cancer dataset (classification).\n", "\n", "Une description de ces données est disponible à l'adresse https://scikit-learn.org/stable/datasets/index.html#breast-cancer-wisconsin-diagnostic-dataset. Jetez un coup d'oeil pour comprendre la problématique.\n", "\n", "Importez les libraries de ce matin: numpy et scikit datasets." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn import datasets\n", "from matplotlib import pyplot as plt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "breast_cancer = datasets.load_breast_cancer()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X = breast_cancer.data\n", "y = breast_cancer.target\n", "feature_names = breast_cancer.feature_names" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(569, 30) (569,)\n", "[0 0 0 0 0]\n", "['mean radius' 'mean texture' 'mean perimeter' 'mean area'\n", " 'mean smoothness' 'mean compactness' 'mean concavity'\n", " 'mean concave points' 'mean symmetry' 'mean fractal dimension'\n", " 'radius error' 'texture error' 'perimeter error' 'area error'\n", " 'smoothness error' 'compactness error' 'concavity error'\n", " 'concave points error' 'symmetry error' 'fractal dimension error'\n", " 'worst radius' 'worst texture' 'worst perimeter' 'worst area'\n", " 'worst smoothness' 'worst compactness' 'worst concavity'\n", " 'worst concave points' 'worst symmetry' 'worst fractal dimension']\n" ] } ], "source": [ "print(X.shape, y.shape)\n", "print(y[:5])\n", "print(feature_names)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X[1][:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chargez les données depuis datasets.load_boston. Que renvoie cette fonction ? Chargez vos données dans des variables appelées X et y pour avoir, respectivement, les données et les labels." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Formatage du jeu de données\n", "Pour entraîner nos algorithmes, on va splitter notre jeu de données en 3 sous-jeux de données: \n", "- train\n", "- validation\n", "- test\n", "\n", "Pourquoi est-ce nécessaire?\n", "\n", "Pour cela, utilisez la fonction scikit-learn sklearn.model_selection.train_test_split. Importez cette méthode, appliquer là à nos données.\n", "\n", "On utilise 2 fois train_test_split, afin de séparer 2 fois l'ensemble: une fois entre train_validation d'une part, unee fois entre train et validation." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "X_tv,X_test, y_tv,y_test = train_test_split(X,y,test_size=.2, random_state=42)\n", "X_train,X_validation,y_train,y_validation = train_test_split(X_tv,y_tv,test_size=.25,random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# K-NNs\n", "On va lancer les k-nns sur ce dataset. Essayez K = 1, puis K = n (n est le nombre de samples). Observez dans $R^2$. Commentez." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier \n", "from sklearn.metrics import confusion_matrix, accuracy_score" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 0.9298245614035088\n", "[[40 4]\n", " [ 4 66]]\n", "4 0.9210526315789473\n", "[[40 4]\n", " [ 5 65]]\n", "7 0.9385964912280702\n", "[[40 4]\n", " [ 3 67]]\n", "10 0.9385964912280702\n", "[[40 4]\n", " [ 3 67]]\n", "13 0.9298245614035088\n", "[[39 5]\n", " [ 3 67]]\n", "16 0.9210526315789473\n", "[[39 5]\n", " [ 4 66]]\n", "19 0.9298245614035088\n", "[[39 5]\n", " [ 3 67]]\n" ] } ], "source": [ "# hyperparamter\n", "K_max = 20\n", "for K in range(1,K_max,3):\n", " # declare classifier with hyperparameters\n", " knn = KNeighborsClassifier(n_neighbors=K)\n", " # train (aka fit) the classifier on the train dataset\n", " knn.fit(X_train,y_train)\n", " # predict the validation dataset\n", " y_validation_hat = knn.predict(X_validation)\n", " # check the result\n", " print(K,accuracy_score(y_pred=y_validation_hat,y_true=y_validation))\n", " print(confusion_matrix(y_pred=y_validation_hat,y_true=y_validation))\n", " # Now, adjust hyperparamaeters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment choisir K? Essayez différents K, regardez les résultats.\n", "\n", "Notre objectif est de minimiseer le taux d'erreur. On va tracer 1 - accuracy en fonction de K, et choisir le K le plus faibble:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "