### TP1 machine learning

parent 16e5e0e8
 # TP1 Lundi 21/01/2019 Aujourd'hui vous allez prendre en mains Python et quelques unes des libraires: * Numpy * Matplotlib * Scikit-Learn Commencez par le notebook python-numpy-matplotlib.ipynb (~1h pas plus) puis les notebooks dans le dossier machine learning en commençant par 05.00-Machine-Learning.ipynb.
 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Deuxième partie, introduction au Machine Learning\n", "Les notebooks qui suivent sont issu d'un livre \"Python Data Science Handbook\" ( https://github.com/jakevdp/PythonDataScienceHandbook ). L'auteur est l'un des plus grands contributeurs au projet Pandas, que vous aurez l'occasion d'utiliser pendant votre projet. Son livre est une référence et il propose de nombeux notebooks sur son Github pour apprendre. Nous vous en avons sélectionné quelques uns pour introduire le Machine Learning. Ne vous inquiétez pas si vous ne comprenez pas tout, vous aurez le temps de jouer avec scikit-learn et de mieux comprendre demain avec Sylvain Rousseau.\n", "\n", "Suivez le notebook en prenant soin de comprendre les explications et le code (quand il y en a). Si vous avez besoin d'explications, n'hésitez pas à parler aux tuteurs et/ou poser vos questions sur slack ! Les tuteurs ne sont pas forcément experts et ne pourront pas répondre à toutes vos questions, mais lancer la réflexion avec eux et les autres étudiants est bénéfique !" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many ways, machine learning is the primary means by which data science manifests itself to the broader world.\n", "Machine learning is where these computational and algorithmic skills of data science meet the statistical thinking of data science, and the result is a collection of approaches to inference and data exploration that are not about effective theory so much as effective computation.\n", "\n", "The term \"machine learning\" is sometimes thrown around as if it is some kind of magic pill: *apply machine learning to your data, and all your problems will be solved!*\n", "As you might expect, the reality is rarely this simple.\n", "While these methods can be incredibly powerful, to be effective they must be approached with a firm grasp of the strengths and weaknesses of each method, as well as a grasp of general concepts such as bias and variance, overfitting and underfitting, and more.\n", "\n", "This chapter will dive into practical aspects of machine learning, primarily using Python's [Scikit-Learn](http://scikit-learn.org) package.\n", "This is not meant to be a comprehensive introduction to the field of machine learning; that is a large subject and necessitates a more technical approach than we take here. Rather, the goals of this chapter are:\n", "\n", "- To introduce the fundamental vocabulary and concepts of machine learning.\n", "- To introduce the Scikit-Learn API and show some examples of its use.\n", "- To take a deeper dive into the details of several of the most important machine learning approaches, and develop an intuition into how they work and when and where they are applicable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Allez au prochain notebook** \"05.01-What-Is-Machine-Learning\"" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "48px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 1 }
 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# What Is Machine Learning?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we take a look at the details of various machine learning methods, let's start by looking at what machine learning is, and what it isn't.\n", "Machine learning is often categorized as a subfield of artificial intelligence, but I find that categorization can often be misleading at first brush.\n", "The study of machine learning certainly arose from research in this context, but in the data science application of machine learning methods, it's more helpful to think of machine learning as a means of *building models of data*.\n", "\n", "Fundamentally, machine learning involves building mathematical models to help understand data.\n", "\"Learning\" enters the fray when we give these models *tunable parameters* that can be adapted to observed data; in this way the program can be considered to be \"learning\" from the data.\n", "Once these models have been fit to previously seen data, they can be used to predict and understand aspects of newly observed data.\n", "I'll leave to the reader the more philosophical digression regarding the extent to which this type of mathematical, model-based \"learning\" is similar to the \"learning\" exhibited by the human brain.\n", "\n", "Understanding the problem setting in machine learning is essential to using these tools effectively, and so we will start with some broad categorizations of the types of approaches we'll discuss here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categories of Machine Learning\n", "\n", "At the most fundamental level, machine learning can be categorized into two main types: supervised learning and unsupervised learning.\n", "\n", "*Supervised learning* involves somehow modeling the relationship between measured features of data and some label associated with the data; once this model is determined, it can be used to apply labels to new, unknown data.\n", "This is further subdivided into *classification* tasks and *regression* tasks: in classification, the labels are discrete categories, while in regression, the labels are continuous quantities.\n", "We will see examples of both types of supervised learning in the following section.\n", "\n", "*Unsupervised learning* involves modeling the features of a dataset without reference to any label, and is often described as \"letting the dataset speak for itself.\"\n", "These models include tasks such as *clustering* and *dimensionality reduction.*\n", "Clustering algorithms identify distinct groups of data, while dimensionality reduction algorithms search for more succinct representations of the data.\n", "We will see examples of both types of unsupervised learning in the following section.\n", "\n", "In addition, there are so-called *semi-supervised learning* methods, which falls somewhere between supervised learning and unsupervised learning.\n", "Semi-supervised learning methods are often useful when only incomplete labels are available." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Qualitative Examples of Machine Learning Applications\n", "\n", "To make these ideas more concrete, let's take a look at a few very simple examples of a machine learning task.\n", "These examples are meant to give an intuitive, non-quantitative overview of the types of machine learning tasks we will be looking at in this chapter.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification: Predicting discrete labels\n", "\n", "We will first take a look at a simple *classification* task, in which you are given a set of labeled points and want to use these to classify some unlabeled points.\n", "\n", "Imagine that we have the data shown in this figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-classification-1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we have two-dimensional data: that is, we have two *features* for each point, represented by the *(x,y)* positions of the points on the plane.\n", "In addition, we have one of two *class labels* for each point, here represented by the colors of the points.\n", "From these features and labels, we would like to create a model that will let us decide whether a new point should be labeled \"blue\" or \"red.\"\n", "\n", "There are a number of possible models for such a classification task, but here we will use an extremely simple one. We will make the assumption that the two groups can be separated by drawing a straight line through the plane between them, such that points on each side of the line fall in the same group.\n", "Here the *model* is a quantitative version of the statement \"a straight line separates the classes\", while the *model parameters* are the particular numbers describing the location and orientation of that line for our data.\n", "The optimal values for these model parameters are learned from the data (this is the \"learning\" in machine learning), which is often called *training the model*.\n", "\n", "The following figure shows a visual representation of what the trained model looks like for this data:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-classification-2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that this model has been trained, it can be generalized to new, unlabeled data.\n", "In other words, we can take a new set of data, draw this model line through it, and assign labels to the new points based on this model.\n", "This stage is usually called *prediction*. See the following figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-classification-3.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the basic idea of a classification task in machine learning, where \"classification\" indicates that the data has discrete class labels.\n", "At first glance this may look fairly trivial: it would be relatively easy to simply look at this data and draw such a discriminatory line to accomplish this classification.\n", "A benefit of the machine learning approach, however, is that it can generalize to much larger datasets in many more dimensions.\n", "\n", "For example, this is similar to the task of automated spam detection for email; in this case, we might use the following features and labels:\n", "\n", "- *feature 1*, *feature 2*, etc. $\\to$ normalized counts of important words or phrases (\"Viagra\", \"Nigerian prince\", etc.)\n", "- *label* $\\to$ \"spam\" or \"not spam\"\n", "\n", "For the training set, these labels might be determined by individual inspection of a small representative sample of emails; for the remaining emails, the label would be determined using the model.\n", "For a suitably trained classification algorithm with enough well-constructed features (typically thousands or millions of words or phrases), this type of approach can be very effective." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regression: Predicting continuous labels\n", "\n", "In contrast with the discrete labels of a classification algorithm, we will next look at a simple *regression* task in which the labels are continuous quantities.\n", "\n", "Consider the data shown in the following figure, which consists of a set of points each with a continuous label:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-regression-1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with the classification example, we have two-dimensional data: that is, there are two features describing each data point.\n", "The color of each point represents the continuous label for that point.\n", "\n", "There are a number of possible regression models we might use for this type of data, but here we will use a simple linear regression to predict the points.\n", "This simple linear regression model assumes that if we treat the label as a third spatial dimension, we can fit a plane to the data.\n", "This is a higher-level generalization of the well-known problem of fitting a line to data with two coordinates.\n", "\n", "We can visualize this setup as shown in the following figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-regression-2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the *feature 1-feature 2* plane here is the same as in the two-dimensional plot from before; in this case, however, we have represented the labels by both color and three-dimensional axis position.\n", "From this view, it seems reasonable that fitting a plane through this three-dimensional data would allow us to predict the expected label for any set of input parameters.\n", "Returning to the two-dimensional projection, when we fit such a plane we get the result shown in the following figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-regression-3.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This plane of fit gives us what we need to predict labels for new points.\n", "Visually, we find the results shown in the following figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-regression-4.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As with the classification example, this may seem rather trivial in a low number of dimensions.\n", "But the power of these methods is that they can be straightforwardly applied and evaluated in the case of data with many, many features.\n", "\n", "For example, this is similar to the task of computing the distance to galaxies observed through a telescope—in this case, we might use the following features and labels:\n", "\n", "- *feature 1*, *feature 2*, etc. $\\to$ brightness of each galaxy at one of several wave lengths or colors\n", "- *label* $\\to$ distance or redshift of the galaxy\n", "\n", "The distances for a small number of these galaxies might be determined through an independent set of (typically more expensive) observations.\n", "Distances to remaining galaxies could then be estimated using a suitable regression model, without the need to employ the more expensive observation across the entire set.\n", "In astronomy circles, this is known as the \"photometric redshift\" problem.\n", "\n", "Some important regression algorithms that we will discuss are linear regression, support vector machines, and random forest regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clustering: Inferring labels on unlabeled data\n", "\n", "The classification and regression illustrations we just looked at are examples of supervised learning algorithms, in which we are trying to build a model that will predict labels for new data.\n", "Unsupervised learning involves models that describe data without reference to any known labels.\n", "\n", "One common case of unsupervised learning is \"clustering,\" in which data is automatically assigned to some number of discrete groups.\n", "For example, we might have some two-dimensional data like that shown in the following figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-clustering-1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By eye, it is clear that each of these points is part of a distinct group.\n", "Given this input, a clustering model will use the intrinsic structure of the data to determine which points are related.\n", "Using the very fast and intuitive *k*-means algorithm, we find the clusters shown in the following figure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](figures/05.01-clustering-2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*k*-means fits a model consisting of *k* cluster centers; the optimal centers are assumed to be those that minimize the distance of each point from its assigned center.\n", "Again, this might seem like a trivial exercise in two dimensions, but as our data becomes larger and more complex, such clustering algorithms can be employed to extract useful information from the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "Here we have seen a few simple examples of some of the basic types of machine learning approaches.\n", "Needless to say, there are a number of important practical details that we have glossed over, but I hope this section was enough to give you a basic idea of what types of problems machine learning approaches can solve.\n", "\n", "In short, we saw the following:\n", "\n", "- *Supervised learning*: Models that can predict labels based on labeled training data\n", "\n", " - *Classification*: Models that predict labels as two or more discrete categories\n", " - *Regression*: Models that predict continuous labels\n", " \n", "- *Unsupervised learning*: Models that identify structure in unlabeled data\n", "\n", " - *Clustering*: Models that detect and identify distinct groups in the data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Allez au prochain notebook** \"05.02-Introducing-Scikit-Learn\"" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "toc": { "colors": { "hover_highlight": "#DAA520", "navigate_num": "#000000", "navigate_text": "#333333", "running_highlight": "#FF0000", "selected_highlight": "#FFD700", "sidebar_border": "#EEEEEE", "wrapper_background": "#FFFFFF" }, "moveMenuLeft": true, "nav_menu": { "height": "138px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 1 }
 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introducing Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several Python libraries which provide solid implementations of a range of machine learning algorithms.\n", "One of the best known is [Scikit-Learn](http://scikit-learn.org), a package that provides efficient versions of a large number of common algorithms _(et c'est français !)_.\n", "Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation.\n", "A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.\n", "\n", "This section provides an overview of the Scikit-Learn API; a solid understanding of these API elements will form the foundation for understanding the deeper practical discussion of machine learning algorithms and approaches in the following chapters.\n", "\n", "We will start by covering *data representation* in Scikit-Learn, followed by covering the *Estimator* API, and finally go through a more interesting example of using these tools for exploring a set of images of hand-written digits." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Representation in Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be understood by the computer.\n", "The best way to think about data within Scikit-Learn is in terms of tables of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data as table\n", "\n", "A basic table is a two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the columns represent quantities related to each of these elements.\n", "For example, consider the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), famously analyzed by Ronald Fisher in 1936.\n", "We can download this dataset in the form of a Pandas DataFrame using the [seaborn](http://seaborn.pydata.org/) library:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import seaborn as sns\n", "iris = sns.load_dataset('iris')\n", "iris.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here each row of the data refers to a single observed flower, and the number of rows is the total number of flowers in the dataset.\n", "In general, we will refer to the rows of the matrix as *samples*, and the number of rows as n_samples.\n", "\n", "Likewise, each column of the data refers to a particular quantitative piece of information that describes each sample.\n", "In general, we will refer to the columns of the matrix as *features*, and the number of columns as n_features."