From 76dcfc431e6f6315876729f9c925a8ca2bd5c5f4 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Fri, 24 May 2024 20:26:56 +0200 Subject: [PATCH 01/11] DOC remove tutorials --- doc/tutorial/basic/tutorial.rst | 351 ----------- doc/tutorial/common_includes/info.txt | 3 - doc/tutorial/index.rst | 26 - doc/tutorial/machine_learning_map/README.md | 17 - doc/tutorial/machine_learning_map/index.rst | 75 --- doc/tutorial/statistical_inference/index.rst | 34 - .../statistical_inference/model_selection.rst | 315 ---------- .../putting_together.rst | 62 -- .../statistical_inference/settings.rst | 92 --- .../supervised_learning.rst | 528 ---------------- .../unsupervised_learning.rst | 297 --------- doc/tutorial/text_analytics/.gitignore | 25 - .../data/languages/fetch_data.py | 103 --- .../data/movie_reviews/fetch_data.py | 33 - .../exercise_01_language_train_model.py | 62 -- .../skeletons/exercise_02_sentiment.py | 63 -- .../exercise_01_language_train_model.py | 70 --- .../solutions/exercise_02_sentiment.py | 79 --- .../solutions/generate_skeletons.py | 38 -- .../text_analytics/working_with_text_data.rst | 586 ------------------ 20 files changed, 2859 deletions(-) delete mode 100644 doc/tutorial/basic/tutorial.rst delete mode 100644 doc/tutorial/common_includes/info.txt delete mode 100644 doc/tutorial/index.rst delete mode 100644 doc/tutorial/machine_learning_map/README.md delete mode 100644 doc/tutorial/machine_learning_map/index.rst delete mode 100644 doc/tutorial/statistical_inference/index.rst delete mode 100644 doc/tutorial/statistical_inference/model_selection.rst delete mode 100644 doc/tutorial/statistical_inference/putting_together.rst delete mode 100644 doc/tutorial/statistical_inference/settings.rst delete mode 100644 doc/tutorial/statistical_inference/supervised_learning.rst delete mode 100644 doc/tutorial/statistical_inference/unsupervised_learning.rst delete mode 100644 doc/tutorial/text_analytics/.gitignore delete mode 100644 doc/tutorial/text_analytics/data/languages/fetch_data.py delete mode 100644 doc/tutorial/text_analytics/data/movie_reviews/fetch_data.py delete mode 100644 doc/tutorial/text_analytics/skeletons/exercise_01_language_train_model.py delete mode 100644 doc/tutorial/text_analytics/skeletons/exercise_02_sentiment.py delete mode 100644 doc/tutorial/text_analytics/solutions/exercise_01_language_train_model.py delete mode 100644 doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py delete mode 100644 doc/tutorial/text_analytics/solutions/generate_skeletons.py delete mode 100644 doc/tutorial/text_analytics/working_with_text_data.rst diff --git a/doc/tutorial/basic/tutorial.rst b/doc/tutorial/basic/tutorial.rst deleted file mode 100644 index 27dddb4e0e909..0000000000000 --- a/doc/tutorial/basic/tutorial.rst +++ /dev/null @@ -1,351 +0,0 @@ -.. _introduction: - -An introduction to machine learning with scikit-learn -===================================================== - -.. topic:: Section contents - - In this section, we introduce the `machine learning - `_ - vocabulary that we use throughout scikit-learn and give a - simple learning example. - - -Machine learning: the problem setting -------------------------------------- - -In general, a learning problem considers a set of n -`samples `_ of -data and then tries to predict properties of unknown data. If each sample is -more than a single number and, for instance, a multi-dimensional entry -(aka `multivariate `_ -data), it is said to have several attributes or **features**. - -Learning problems fall into a few categories: - -* `supervised learning `_, - in which the data comes with additional attributes that we want to predict - (:ref:`Click here ` - to go to the scikit-learn supervised learning page).This problem - can be either: - - * `classification - `_: - samples belong to two or more classes and we - want to learn from already labeled data how to predict the class - of unlabeled data. An example of a classification problem would - be handwritten digit recognition, in which the aim is - to assign each input vector to one of a finite number of discrete - categories. Another way to think of classification is as a discrete - (as opposed to continuous) form of supervised learning where one has a - limited number of categories and for each of the n samples provided, - one is to try to label them with the correct category or class. - - * `regression `_: - if the desired output consists of one or more - continuous variables, then the task is called *regression*. An - example of a regression problem would be the prediction of the - length of a salmon as a function of its age and weight. - -* `unsupervised learning `_, - in which the training data consists of a set of input vectors x - without any corresponding target values. The goal in such problems - may be to discover groups of similar examples within the data, where - it is called `clustering `_, - or to determine the distribution of data within the input space, known as - `density estimation `_, or - to project the data from a high-dimensional space down to two or three - dimensions for the purpose of *visualization* - (:ref:`Click here ` - to go to the Scikit-Learn unsupervised learning page). - -.. topic:: Training set and testing set - - Machine learning is about learning some properties of a data set - and then testing those properties against another data set. A common - practice in machine learning is to evaluate an algorithm by splitting a data - set into two. We call one of those sets the **training set**, on which we - learn some properties; we call the other set the **testing set**, on which - we test the learned properties. - - -.. _loading_example_dataset: - -Loading an example dataset --------------------------- - -`scikit-learn` comes with a few standard datasets, for instance the -`iris `_ and `digits -`_ -datasets for classification and the `diabetes dataset -`_ for regression. - -In the following, we start a Python interpreter from our shell and then -load the ``iris`` and ``digits`` datasets. Our notational convention is that -``$`` denotes the shell prompt while ``>>>`` denotes the Python -interpreter prompt:: - - $ python - >>> from sklearn import datasets - >>> iris = datasets.load_iris() - >>> digits = datasets.load_digits() - -A dataset is a dictionary-like object that holds all the data and some -metadata about the data. This data is stored in the ``.data`` member, -which is a ``n_samples, n_features`` array. In the case of supervised -problems, one or more response variables are stored in the ``.target`` member. More -details on the different datasets can be found in the :ref:`dedicated -section `. - -For instance, in the case of the digits dataset, ``digits.data`` gives -access to the features that can be used to classify the digits samples:: - - >>> print(digits.data) - [[ 0. 0. 5. ... 0. 0. 0.] - [ 0. 0. 0. ... 10. 0. 0.] - [ 0. 0. 0. ... 16. 9. 0.] - ... - [ 0. 0. 1. ... 6. 0. 0.] - [ 0. 0. 2. ... 12. 0. 0.] - [ 0. 0. 10. ... 12. 1. 0.]] - -and ``digits.target`` gives the ground truth for the digit dataset, that -is the number corresponding to each digit image that we are trying to -learn:: - - >>> digits.target - array([0, 1, 2, ..., 8, 9, 8]) - -.. topic:: Shape of the data arrays - - The data is always a 2D array, shape ``(n_samples, n_features)``, although - the original data may have had a different shape. In the case of the - digits, each original sample is an image of shape ``(8, 8)`` and can be - accessed using:: - - >>> digits.images[0] - array([[ 0., 0., 5., 13., 9., 1., 0., 0.], - [ 0., 0., 13., 15., 10., 15., 5., 0.], - [ 0., 3., 15., 2., 0., 11., 8., 0.], - [ 0., 4., 12., 0., 0., 8., 8., 0.], - [ 0., 5., 8., 0., 0., 9., 8., 0.], - [ 0., 4., 11., 0., 1., 12., 7., 0.], - [ 0., 2., 14., 5., 10., 12., 0., 0.], - [ 0., 0., 6., 13., 10., 0., 0., 0.]]) - - The :ref:`simple example on this dataset - ` illustrates how starting - from the original problem one can shape the data for consumption in - scikit-learn. - -.. topic:: Loading from external datasets - - To load from an external dataset, please refer to :ref:`loading external datasets `. - -Learning and predicting ------------------------- - -In the case of the digits dataset, the task is to predict, given an image, -which digit it represents. We are given samples of each of the 10 -possible classes (the digits zero through nine) on which we *fit* an -`estimator `_ to be able to *predict* -the classes to which unseen samples belong. - -In scikit-learn, an estimator for classification is a Python object that -implements the methods ``fit(X, y)`` and ``predict(T)``. - -An example of an estimator is the class ``sklearn.svm.SVC``, which -implements `support vector classification -`_. The -estimator's constructor takes as arguments the model's parameters. - -For now, we will consider the estimator as a black box:: - - >>> from sklearn import svm - >>> clf = svm.SVC(gamma=0.001, C=100.) - -.. topic:: Choosing the parameters of the model - - In this example, we set the value of ``gamma`` manually. - To find good values for these parameters, we can use tools - such as :ref:`grid search ` and :ref:`cross validation - `. - -The ``clf`` (for classifier) estimator instance is first -fitted to the model; that is, it must *learn* from the model. This is -done by passing our training set to the ``fit`` method. For the training -set, we'll use all the images from our dataset, except for the last -image, which we'll reserve for our predicting. We select the training set with -the ``[:-1]`` Python syntax, which produces a new array that contains all but -the last item from ``digits.data``:: - - >>> clf.fit(digits.data[:-1], digits.target[:-1]) - SVC(C=100.0, gamma=0.001) - -Now you can *predict* new values. In this case, you'll predict using the last -image from ``digits.data``. By predicting, you'll determine the image from the -training set that best matches the last image. - - - >>> clf.predict(digits.data[-1:]) - array([8]) - -The corresponding image is: - -.. image:: /auto_examples/datasets/images/sphx_glr_plot_digits_last_image_001.png - :target: ../../auto_examples/datasets/plot_digits_last_image.html - :align: center - :scale: 50 - -As you can see, it is a challenging task: after all, the images are of poor -resolution. Do you agree with the classifier? - -A complete example of this classification problem is available as an -example that you can run and study: -:ref:`sphx_glr_auto_examples_classification_plot_digits_classification.py`. - -Conventions ------------ - -scikit-learn estimators follow certain rules to make their behavior more -predictive. These are described in more detail in the :ref:`glossary`. - -Type casting -~~~~~~~~~~~~ - -Where possible, input of type ``float32`` will maintain its data type. Otherwise -input will be cast to ``float64``:: - - >>> import numpy as np - >>> from sklearn import kernel_approximation - - >>> rng = np.random.RandomState(0) - >>> X = rng.rand(10, 2000) - >>> X = np.array(X, dtype='float32') - >>> X.dtype - dtype('float32') - - >>> transformer = kernel_approximation.RBFSampler() - >>> X_new = transformer.fit_transform(X) - >>> X_new.dtype - dtype('float32') - -In this example, ``X`` is ``float32``, and is unchanged by ``fit_transform(X)``. - -Using `float32`-typed training (or testing) data is often more -efficient than using the usual ``float64`` ``dtype``: it allows to -reduce the memory usage and sometimes also reduces processing time -by leveraging the vector instructions of the CPU. However it can -sometimes lead to numerical stability problems causing the algorithm -to be more sensitive to the scale of the values and :ref:`require -adequate preprocessing`. - -Keep in mind however that not all scikit-learn estimators attempt to -work in `float32` mode. For instance, some transformers will always -cast their input to `float64` and return `float64` transformed -values as a result. - -Regression targets are cast to ``float64`` and classification targets are -maintained:: - - >>> from sklearn import datasets - >>> from sklearn.svm import SVC - >>> iris = datasets.load_iris() - >>> clf = SVC() - >>> clf.fit(iris.data, iris.target) - SVC() - - >>> list(clf.predict(iris.data[:3])) - [0, 0, 0] - - >>> clf.fit(iris.data, iris.target_names[iris.target]) - SVC() - - >>> list(clf.predict(iris.data[:3])) - ['setosa', 'setosa', 'setosa'] - -Here, the first ``predict()`` returns an integer array, since ``iris.target`` -(an integer array) was used in ``fit``. The second ``predict()`` returns a string -array, since ``iris.target_names`` was for fitting. - -Refitting and updating parameters -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Hyper-parameters of an estimator can be updated after it has been constructed -via the :term:`set_params()` method. Calling ``fit()`` more than -once will overwrite what was learned by any previous ``fit()``:: - - >>> import numpy as np - >>> from sklearn.datasets import load_iris - >>> from sklearn.svm import SVC - >>> X, y = load_iris(return_X_y=True) - - >>> clf = SVC() - >>> clf.set_params(kernel='linear').fit(X, y) - SVC(kernel='linear') - >>> clf.predict(X[:5]) - array([0, 0, 0, 0, 0]) - - >>> clf.set_params(kernel='rbf').fit(X, y) - SVC() - >>> clf.predict(X[:5]) - array([0, 0, 0, 0, 0]) - -Here, the default kernel ``rbf`` is first changed to ``linear`` via -:func:`SVC.set_params()` after the estimator has -been constructed, and changed back to ``rbf`` to refit the estimator and to -make a second prediction. - -Multiclass vs. multilabel fitting -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When using :class:`multiclass classifiers `, -the learning and prediction task that is performed is dependent on the format of -the target data fit upon:: - - >>> from sklearn.svm import SVC - >>> from sklearn.multiclass import OneVsRestClassifier - >>> from sklearn.preprocessing import LabelBinarizer - - >>> X = [[1, 2], [2, 4], [4, 5], [3, 2], [3, 1]] - >>> y = [0, 0, 1, 1, 2] - - >>> classif = OneVsRestClassifier(estimator=SVC(random_state=0)) - >>> classif.fit(X, y).predict(X) - array([0, 0, 1, 1, 2]) - -In the above case, the classifier is fit on a 1d array of multiclass labels and -the ``predict()`` method therefore provides corresponding multiclass predictions. -It is also possible to fit upon a 2d array of binary label indicators:: - - >>> y = LabelBinarizer().fit_transform(y) - >>> classif.fit(X, y).predict(X) - array([[1, 0, 0], - [1, 0, 0], - [0, 1, 0], - [0, 0, 0], - [0, 0, 0]]) - -Here, the classifier is ``fit()`` on a 2d binary label representation of ``y``, -using the :class:`LabelBinarizer `. -In this case ``predict()`` returns a 2d array representing the corresponding -multilabel predictions. - -Note that the fourth and fifth instances returned all zeroes, indicating that -they matched none of the three labels ``fit`` upon. With multilabel outputs, it -is similarly possible for an instance to be assigned multiple labels:: - - >>> from sklearn.preprocessing import MultiLabelBinarizer - >>> y = [[0, 1], [0, 2], [1, 3], [0, 2, 3], [2, 4]] - >>> y = MultiLabelBinarizer().fit_transform(y) - >>> classif.fit(X, y).predict(X) - array([[1, 1, 0, 0, 0], - [1, 0, 1, 0, 0], - [0, 1, 0, 1, 0], - [1, 0, 1, 0, 0], - [1, 0, 1, 0, 0]]) - -In this case, the classifier is fit upon instances each assigned multiple labels. -The :class:`MultiLabelBinarizer ` is -used to binarize the 2d array of multilabels to ``fit`` upon. As a result, -``predict()`` returns a 2d array with multiple predicted labels for each instance. diff --git a/doc/tutorial/common_includes/info.txt b/doc/tutorial/common_includes/info.txt deleted file mode 100644 index f8e44fec90f2f..0000000000000 --- a/doc/tutorial/common_includes/info.txt +++ /dev/null @@ -1,3 +0,0 @@ -Meant to share common RST file snippets that we want to reuse by inclusion -in the real tutorial in order to lower the maintenance burden -of redundant sections. diff --git a/doc/tutorial/index.rst b/doc/tutorial/index.rst deleted file mode 100644 index bd4b8997f5f39..0000000000000 --- a/doc/tutorial/index.rst +++ /dev/null @@ -1,26 +0,0 @@ -.. _tutorial_menu: - -====================== -scikit-learn Tutorials -====================== - -.. toctree:: - :maxdepth: 2 - - basic/tutorial.rst - statistical_inference/index.rst - text_analytics/working_with_text_data.rst - machine_learning_map/index - ../presentations - -.. note:: **Doctest Mode** - - The code-examples in the above tutorials are written in a - *python-console* format. If you wish to easily execute these examples - in **IPython**, use:: - - %doctest_mode - - in the IPython-console. You can then simply copy and paste the examples - directly into IPython without having to worry about removing the **>>>** - manually. diff --git a/doc/tutorial/machine_learning_map/README.md b/doc/tutorial/machine_learning_map/README.md deleted file mode 100644 index 006b1e5e1a38c..0000000000000 --- a/doc/tutorial/machine_learning_map/README.md +++ /dev/null @@ -1,17 +0,0 @@ -The scikit-learn machine learning cheat sheet was originally created by Andreas Mueller: -https://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html - -The current version of the chart is located at `doc/images/ml_map.svg` in SVG+XML -format, created using [draw.io](https://draw.io/). To edit the chart, open the file in -draw.io, make changes, and export as SVG with the same filename. Export configurations -are: - -- Zoom: 100% -- Border width: 15 -- Size: Diagram -- Transparent Background: False -- Appearance: Light - -Each node in the chart that contains an estimator should have a link, where the root -directory is at `../../`. Note that after exporting the SVG, the links may be prefixed -with e.g. `https://app.diagrams.net/`. Remember to check and remove them. diff --git a/doc/tutorial/machine_learning_map/index.rst b/doc/tutorial/machine_learning_map/index.rst deleted file mode 100644 index 5fd6879563489..0000000000000 --- a/doc/tutorial/machine_learning_map/index.rst +++ /dev/null @@ -1,75 +0,0 @@ -:html_theme.sidebar_secondary.remove: - -.. _ml_map: - -Choosing the right estimator -============================ - -Often the hardest part of solving a machine learning problem can be finding the right -estimator for the job. Different estimators are better suited for different types of -data and different problems. - -The flowchart below is designed to give users a bit of a rough guide on how to approach -problems with regard to which estimators to try on your data. Click on any estimator in -the chart below to see its documentation. Use scroll wheel to zoom in and out, and click -and drag to pan around. You can also download the chart: -:download:`ml_map.svg <../../images/ml_map.svg>`. - -.. raw:: html - - - - - - -
- -.. raw:: html - :file: ../../images/ml_map.svg - -.. raw:: html - -
diff --git a/doc/tutorial/statistical_inference/index.rst b/doc/tutorial/statistical_inference/index.rst deleted file mode 100644 index 358bf16512254..0000000000000 --- a/doc/tutorial/statistical_inference/index.rst +++ /dev/null @@ -1,34 +0,0 @@ -.. _stat_learn_tut_index: - -========================================================================== -A tutorial on statistical-learning for scientific data processing -========================================================================== - -.. topic:: Statistical learning - - `Machine learning `_ is - a technique with a growing importance, as the - size of the datasets experimental sciences are facing is rapidly - growing. Problems it tackles range from building a prediction function - linking different observations, to classifying observations, or - learning the structure in an unlabeled dataset. - - This tutorial will explore *statistical learning*, the use of - machine learning techniques with the goal of `statistical inference - `_: - drawing conclusions on the data at hand. - - Scikit-learn is a Python module integrating classic machine - learning algorithms in the tightly-knit world of scientific Python - packages (`NumPy `_, `SciPy - `_, `matplotlib - `_). - -.. toctree:: - :maxdepth: 2 - - settings - supervised_learning - model_selection - unsupervised_learning - putting_together diff --git a/doc/tutorial/statistical_inference/model_selection.rst b/doc/tutorial/statistical_inference/model_selection.rst deleted file mode 100644 index 7d7d5f69f18c4..0000000000000 --- a/doc/tutorial/statistical_inference/model_selection.rst +++ /dev/null @@ -1,315 +0,0 @@ -.. _model_selection_tut: - -============================================================ -Model selection: choosing estimators and their parameters -============================================================ - -Score, and cross-validated scores -================================== - -As we have seen, every estimator exposes a ``score`` method that can judge -the quality of the fit (or the prediction) on new data. **Bigger is -better**. - -:: - - >>> from sklearn import datasets, svm - >>> X_digits, y_digits = datasets.load_digits(return_X_y=True) - >>> svc = svm.SVC(C=1, kernel='linear') - >>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:]) - 0.98 - -To get a better measure of prediction accuracy (which we can use as a -proxy for goodness of fit of the model), we can successively split the -data in *folds* that we use for training and testing:: - - >>> import numpy as np - >>> X_folds = np.array_split(X_digits, 3) - >>> y_folds = np.array_split(y_digits, 3) - >>> scores = list() - >>> for k in range(3): - ... # We use 'list' to copy, in order to 'pop' later on - ... X_train = list(X_folds) - ... X_test = X_train.pop(k) - ... X_train = np.concatenate(X_train) - ... y_train = list(y_folds) - ... y_test = y_train.pop(k) - ... y_train = np.concatenate(y_train) - ... scores.append(svc.fit(X_train, y_train).score(X_test, y_test)) - >>> print(scores) - [0.934..., 0.956..., 0.939...] - -.. currentmodule:: sklearn.model_selection - -This is called a :class:`KFold` cross-validation. - -.. _cv_generators_tut: - -Cross-validation generators -============================= - -Scikit-learn has a collection of classes which can be used to generate lists of -train/test indices for popular cross-validation strategies. - -They expose a ``split`` method which accepts the input -dataset to be split and yields the train/test set indices for each iteration -of the chosen cross-validation strategy. - -This example shows an example usage of the ``split`` method. - - >>> from sklearn.model_selection import KFold, cross_val_score - >>> X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"] - >>> k_fold = KFold(n_splits=5) - >>> for train_indices, test_indices in k_fold.split(X): - ... print('Train: %s | test: %s' % (train_indices, test_indices)) - Train: [2 3 4 5 6 7 8 9] | test: [0 1] - Train: [0 1 4 5 6 7 8 9] | test: [2 3] - Train: [0 1 2 3 6 7 8 9] | test: [4 5] - Train: [0 1 2 3 4 5 8 9] | test: [6 7] - Train: [0 1 2 3 4 5 6 7] | test: [8 9] - -The cross-validation can then be performed easily:: - - >>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) - ... for train, test in k_fold.split(X_digits)] - [0.963..., 0.922..., 0.963..., 0.963..., 0.930...] - -The cross-validation score can be directly calculated using the -:func:`cross_val_score` helper. Given an estimator, the cross-validation object -and the input dataset, the :func:`cross_val_score` splits the data repeatedly into -a training and a testing set, trains the estimator using the training set and -computes the scores based on the testing set for each iteration of cross-validation. - -By default the estimator's ``score`` method is used to compute the individual scores. - -Refer the :ref:`metrics module ` to learn more on the available scoring -methods. - - >>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1) - array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212]) - -`n_jobs=-1` means that the computation will be dispatched on all the CPUs -of the computer. - -Alternatively, the ``scoring`` argument can be provided to specify an alternative -scoring method. - - >>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, - ... scoring='precision_macro') - array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644]) - -**Cross-validation generators** - - -.. list-table:: - - * - - - :class:`KFold` **(n_splits, shuffle, random_state)** - - - :class:`StratifiedKFold` **(n_splits, shuffle, random_state)** - - - :class:`GroupKFold` **(n_splits)** - - - * - - - Splits it into K folds, trains on K-1 and then tests on the left-out. - - - Same as K-Fold but preserves the class distribution within each fold. - - - Ensures that the same group is not in both testing and training sets. - - -.. list-table:: - - * - - - :class:`ShuffleSplit` **(n_splits, test_size, train_size, random_state)** - - - :class:`StratifiedShuffleSplit` - - - :class:`GroupShuffleSplit` - - * - - - Generates train/test indices based on random permutation. - - - Same as shuffle split but preserves the class distribution within each iteration. - - - Ensures that the same group is not in both testing and training sets. - - -.. list-table:: - - * - - - :class:`LeaveOneGroupOut` **()** - - - :class:`LeavePGroupsOut` **(n_groups)** - - - :class:`LeaveOneOut` **()** - - - - * - - - Takes a group array to group observations. - - - Leave P groups out. - - - Leave one observation out. - - - -.. list-table:: - - * - - - :class:`LeavePOut` **(p)** - - - :class:`PredefinedSplit` - - * - - - Leave P observations out. - - - Generates train/test indices based on predefined splits. - - -.. currentmodule:: sklearn.svm - -.. topic:: **Exercise** - - On the digits dataset, plot the cross-validation score of a :class:`SVC` - estimator with a linear kernel as a function of parameter ``C`` (use a - logarithmic grid of points, from 1 to 10). - - :: - - >>> import numpy as np - >>> from sklearn import datasets, svm - >>> from sklearn.model_selection import cross_val_score - >>> X, y = datasets.load_digits(return_X_y=True) - >>> svc = svm.SVC(kernel="linear") - >>> C_s = np.logspace(-10, 0, 10) - >>> scores = list() - >>> scores_std = list() - - .. dropdown:: Solution - - .. plot:: - :context: close-figs - :align: center - - import numpy as np - from sklearn import datasets, svm - from sklearn.model_selection import cross_val_score - X, y = datasets.load_digits(return_X_y=True) - svc = svm.SVC(kernel="linear") - C_s = np.logspace(-10, 0, 10) - scores = list() - scores_std = list() - for C in C_s: - svc.C = C - this_scores = cross_val_score(svc, X, y, n_jobs=1) - scores.append(np.mean(this_scores)) - scores_std.append(np.std(this_scores)) - - import matplotlib.pyplot as plt - - plt.figure() - plt.semilogx(C_s, scores) - plt.semilogx(C_s, np.array(scores) + np.array(scores_std), "b--") - plt.semilogx(C_s, np.array(scores) - np.array(scores_std), "b--") - locs, labels = plt.yticks() - plt.yticks(locs, list(map(lambda x: "%g" % x, locs))) - plt.ylabel("CV score") - plt.xlabel("Parameter C") - plt.ylim(0, 1.1) - plt.show() - -Grid-search and cross-validated estimators -============================================ - -Grid-search -------------- - -.. currentmodule:: sklearn.model_selection - -scikit-learn provides an object that, given data, computes the score -during the fit of an estimator on a parameter grid and chooses the -parameters to maximize the cross-validation score. This object takes an -estimator during the construction and exposes an estimator API:: - - >>> from sklearn.model_selection import GridSearchCV, cross_val_score - >>> Cs = np.logspace(-6, -1, 10) - >>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), - ... n_jobs=-1) - >>> clf.fit(X_digits[:1000], y_digits[:1000]) # doctest: +SKIP - GridSearchCV(cv=None,... - >>> clf.best_score_ # doctest: +SKIP - 0.925... - >>> clf.best_estimator_.C # doctest: +SKIP - 0.0077... - - >>> # Prediction performance on test set is not as good as on train set - >>> clf.score(X_digits[1000:], y_digits[1000:]) # doctest: +SKIP - 0.943... - - -By default, the :class:`GridSearchCV` uses a 5-fold cross-validation. However, -if it detects that a classifier is passed, rather than a regressor, it uses -a stratified 5-fold. - -.. topic:: Nested cross-validation - - :: - - >>> cross_val_score(clf, X_digits, y_digits) # doctest: +SKIP - array([0.938..., 0.963..., 0.944...]) - - Two cross-validation loops are performed in parallel: one by the - :class:`GridSearchCV` estimator to set ``gamma`` and the other one by - ``cross_val_score`` to measure the prediction performance of the - estimator. The resulting scores are unbiased estimates of the - prediction score on new data. - -.. warning:: - - You cannot nest objects with parallel computing (``n_jobs`` different - than 1). - -.. _cv_estimators_tut: - -Cross-validated estimators ----------------------------- - -Cross-validation to set a parameter can be done more efficiently on an -algorithm-by-algorithm basis. This is why, for certain estimators, -scikit-learn exposes :ref:`cross_validation` estimators that set their -parameter automatically by cross-validation:: - - >>> from sklearn import linear_model, datasets - >>> lasso = linear_model.LassoCV() - >>> X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True) - >>> lasso.fit(X_diabetes, y_diabetes) - LassoCV() - >>> # The estimator chose automatically its lambda: - >>> lasso.alpha_ - 0.00375... - -These estimators are called similarly to their counterparts, with 'CV' -appended to their name. - -.. topic:: **Exercise** - - On the diabetes dataset, find the optimal regularization parameter - alpha. - - **Bonus**: How much can you trust the selection of alpha? - - .. literalinclude:: ../../auto_examples/exercises/plot_cv_diabetes.py - :lines: 17-24 - - **Solution:** :ref:`sphx_glr_auto_examples_exercises_plot_cv_diabetes.py` diff --git a/doc/tutorial/statistical_inference/putting_together.rst b/doc/tutorial/statistical_inference/putting_together.rst deleted file mode 100644 index b28ba77bfac33..0000000000000 --- a/doc/tutorial/statistical_inference/putting_together.rst +++ /dev/null @@ -1,62 +0,0 @@ -========================= -Putting it all together -========================= - -.. Imports - >>> import numpy as np - -Pipelining -============ - -We have seen that some estimators can transform data and that some estimators -can predict variables. We can also create combined estimators: - -.. literalinclude:: ../../auto_examples/compose/plot_digits_pipe.py - :lines: 23-63 - -.. image:: ../../auto_examples/compose/images/sphx_glr_plot_digits_pipe_001.png - :target: ../../auto_examples/compose/plot_digits_pipe.html - :scale: 65 - :align: center - -Face recognition with eigenfaces -================================= - -The dataset used in this example is a preprocessed excerpt of the -"Labeled Faces in the Wild", also known as LFW_: - -http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB) - -.. _LFW: http://vis-www.cs.umass.edu/lfw/ - -.. literalinclude:: ../../auto_examples/applications/plot_face_recognition.py - -.. figure:: ../../images/plot_face_recognition_1.png - :scale: 50 - - **Prediction** - -.. figure:: ../../images/plot_face_recognition_2.png - :scale: 50 - - **Eigenfaces** - -Expected results for the top 5 most represented people in the dataset:: - - precision recall f1-score support - - Gerhard_Schroeder 0.91 0.75 0.82 28 - Donald_Rumsfeld 0.84 0.82 0.83 33 - Tony_Blair 0.65 0.82 0.73 34 - Colin_Powell 0.78 0.88 0.83 58 - George_W_Bush 0.93 0.86 0.90 129 - - avg / total 0.86 0.84 0.85 282 - - -Open problem: Stock Market Structure -===================================== - -Can we predict the variation in stock prices for Google over a given time frame? - -:ref:`stock_market` diff --git a/doc/tutorial/statistical_inference/settings.rst b/doc/tutorial/statistical_inference/settings.rst deleted file mode 100644 index 422972fbd6cb4..0000000000000 --- a/doc/tutorial/statistical_inference/settings.rst +++ /dev/null @@ -1,92 +0,0 @@ - -========================================================================== -Statistical learning: the setting and the estimator object in scikit-learn -========================================================================== - -Datasets -========= - -Scikit-learn deals with learning information from one or more -datasets that are represented as 2D arrays. They can be understood as a -list of multi-dimensional observations. We say that the first axis of -these arrays is the **samples** axis, while the second is the -**features** axis. - -.. topic:: A simple example shipped with scikit-learn: iris dataset - - :: - - >>> from sklearn import datasets - >>> iris = datasets.load_iris() - >>> data = iris.data - >>> data.shape - (150, 4) - - It is made of 150 observations of irises, each described by 4 - features: their sepal and petal length and width, as detailed in - ``iris.DESCR``. - -When the data is not initially in the ``(n_samples, n_features)`` shape, it -needs to be preprocessed in order to be used by scikit-learn. - -.. topic:: An example of reshaping data would be the digits dataset - - The digits dataset is made of 1797 8x8 images of hand-written - digits :: - - >>> digits = datasets.load_digits() - >>> digits.images.shape - (1797, 8, 8) - >>> import matplotlib.pyplot as plt - >>> plt.imshow(digits.images[-1], - ... cmap=plt.cm.gray_r) - <...> - - .. image:: /auto_examples/datasets/images/sphx_glr_plot_digits_last_image_001.png - :target: ../../auto_examples/datasets/plot_digits_last_image.html - :align: center - - To use this dataset with scikit-learn, we transform each 8x8 image into a - feature vector of length 64 :: - - >>> data = digits.images.reshape( - ... (digits.images.shape[0], -1) - ... ) - -Estimators objects -=================== - -.. Some code to make the doctests run - - >>> from sklearn.base import BaseEstimator - >>> class Estimator(BaseEstimator): - ... def __init__(self, param1=0, param2=0): - ... self.param1 = param1 - ... self.param2 = param2 - ... def fit(self, data): - ... pass - >>> estimator = Estimator() - -**Fitting data**: the main API implemented by scikit-learn is that of the -`estimator`. An estimator is any object that learns from data; -it may be a classification, regression or clustering algorithm or -a *transformer* that extracts/filters useful features from raw data. - -All estimator objects expose a ``fit`` method that takes a dataset -(usually a 2-d array): - - >>> estimator.fit(data) - -**Estimator parameters**: All the parameters of an estimator can be set -when it is instantiated or by modifying the corresponding attribute:: - - >>> estimator = Estimator(param1=1, param2=2) - >>> estimator.param1 - 1 - -**Estimated parameters**: When data is fitted with an estimator, -parameters are estimated from the data at hand. All the estimated -parameters are attributes of the estimator object ending by an -underscore:: - - >>> estimator.estimated_param_ #doctest: +SKIP diff --git a/doc/tutorial/statistical_inference/supervised_learning.rst b/doc/tutorial/statistical_inference/supervised_learning.rst deleted file mode 100644 index 41adf60c44fc7..0000000000000 --- a/doc/tutorial/statistical_inference/supervised_learning.rst +++ /dev/null @@ -1,528 +0,0 @@ -.. _supervised_learning_tut: - -======================================================================================= -Supervised learning: predicting an output variable from high-dimensional observations -======================================================================================= - - -.. topic:: The problem solved in supervised learning - - :ref:`Supervised learning ` - consists in learning the link between two - datasets: the observed data ``X`` and an external variable ``y`` that we - are trying to predict, usually called "target" or "labels". Most often, - ``y`` is a 1D array of length ``n_samples``. - - All supervised `estimators `_ - in scikit-learn implement a ``fit(X, y)`` method to fit the model - and a ``predict(X)`` method that, given unlabeled observations ``X``, - returns the predicted labels ``y``. - -.. topic:: Vocabulary: classification and regression - - If the prediction task is to classify the observations in a set of - finite labels, in other words to "name" the objects observed, the task - is said to be a **classification** task. On the other hand, if the goal - is to predict a continuous target variable, it is said to be a - **regression** task. - - When doing classification in scikit-learn, ``y`` is a vector of integers - or strings. - - Note: See the :ref:`Introduction to machine learning with scikit-learn - Tutorial ` for a quick run-through on the basic machine - learning vocabulary used within scikit-learn. - -Nearest neighbor and the curse of dimensionality -================================================= - -.. topic:: Classifying irises: - - The iris dataset is a classification task consisting in identifying 3 - different types of irises (Setosa, Versicolour, and Virginica) from - their petal and sepal length and width:: - - >>> import numpy as np - >>> from sklearn import datasets - >>> iris_X, iris_y = datasets.load_iris(return_X_y=True) - >>> np.unique(iris_y) - array([0, 1, 2]) - - .. image:: /auto_examples/datasets/images/sphx_glr_plot_iris_dataset_001.png - :target: ../../auto_examples/datasets/plot_iris_dataset.html - :align: center - :scale: 50 - -k-Nearest neighbors classifier -------------------------------- - -The simplest possible classifier is the -`nearest neighbor `_: -given a new observation ``X_test``, find in the training set (i.e. the data -used to train the estimator) the observation with the closest feature vector. -(Please see the :ref:`Nearest Neighbors section` of the online -Scikit-learn documentation for more information about this type of classifier.) - -.. topic:: Training set and testing set - - While experimenting with any learning algorithm, it is important not to - test the prediction of an estimator on the data used to fit the - estimator as this would not be evaluating the performance of the - estimator on **new data**. This is why datasets are often split into - *train* and *test* data. - -**KNN (k nearest neighbors) classification example**: - -.. image:: /auto_examples/neighbors/images/sphx_glr_plot_classification_001.png - :target: ../../auto_examples/neighbors/plot_classification.html - :align: center - :scale: 70 - -:: - - >>> # Split iris data in train and test data - >>> # A random permutation, to split the data randomly - >>> np.random.seed(0) - >>> indices = np.random.permutation(len(iris_X)) - >>> iris_X_train = iris_X[indices[:-10]] - >>> iris_y_train = iris_y[indices[:-10]] - >>> iris_X_test = iris_X[indices[-10:]] - >>> iris_y_test = iris_y[indices[-10:]] - >>> # Create and fit a nearest-neighbor classifier - >>> from sklearn.neighbors import KNeighborsClassifier - >>> knn = KNeighborsClassifier() - >>> knn.fit(iris_X_train, iris_y_train) - KNeighborsClassifier() - >>> knn.predict(iris_X_test) - array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0]) - >>> iris_y_test - array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0]) - -.. _curse_of_dimensionality: - -The curse of dimensionality -------------------------------- - -For an estimator to be effective, you need the distance between neighboring -points to be less than some value :math:`d`, which depends on the problem. -In one dimension, this requires on average :math:`n \sim 1/d` points. -In the context of the above :math:`k`-NN example, if the data is described by -just one feature with values ranging from 0 to 1 and with :math:`n` training -observations, then new data will be no further away than :math:`1/n`. -Therefore, the nearest neighbor decision rule will be efficient as soon as -:math:`1/n` is small compared to the scale of between-class feature variations. - -If the number of features is :math:`p`, you now require :math:`n \sim 1/d^p` -points. Let's say that we require 10 points in one dimension: now :math:`10^p` -points are required in :math:`p` dimensions to pave the :math:`[0, 1]` space. -As :math:`p` becomes large, the number of training points required for a good -estimator grows exponentially. - -For example, if each point is just a single number (8 bytes), then an -effective :math:`k`-NN estimator in a paltry :math:`p \sim 20` dimensions would -require more training data than the current estimated size of the entire -internet (±1000 Exabytes or so). - -This is called the -`curse of dimensionality `_ -and is a core problem that machine learning addresses. - -Linear model: from regression to sparsity -========================================== - -.. topic:: Diabetes dataset - - The diabetes dataset consists of 10 physiological variables (age, - sex, weight, blood pressure) measured on 442 patients, and an - indication of disease progression after one year:: - - >>> diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True) - >>> diabetes_X_train = diabetes_X[:-20] - >>> diabetes_X_test = diabetes_X[-20:] - >>> diabetes_y_train = diabetes_y[:-20] - >>> diabetes_y_test = diabetes_y[-20:] - - The task at hand is to predict disease progression from physiological - variables. - -Linear regression ------------------- - -.. currentmodule:: sklearn.linear_model - -:class:`LinearRegression`, -in its simplest form, fits a linear model to the data set by adjusting -a set of parameters in order to make the sum of the squared residuals -of the model as small as possible. - -Linear models: :math:`y = X\beta + \epsilon` - -* :math:`X`: data -* :math:`y`: target variable -* :math:`\beta`: Coefficients -* :math:`\epsilon`: Observation noise - -.. image:: /auto_examples/linear_model/images/sphx_glr_plot_ols_001.png - :target: ../../auto_examples/linear_model/plot_ols.html - :scale: 50 - :align: center - -:: - - >>> from sklearn import linear_model - >>> regr = linear_model.LinearRegression() - >>> regr.fit(diabetes_X_train, diabetes_y_train) - LinearRegression() - >>> print(regr.coef_) # doctest: +SKIP - [ 0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937 - 492.81458798 102.84845219 184.60648906 743.51961675 76.09517222] - - - >>> # The mean square error - >>> np.mean((regr.predict(diabetes_X_test) - diabetes_y_test)**2) - 2004.5... - - >>> # Explained variance score: 1 is perfect prediction - >>> # and 0 means that there is no linear relationship - >>> # between X and y. - >>> regr.score(diabetes_X_test, diabetes_y_test) - 0.585... - - -.. _shrinkage: - -Shrinkage ----------- - -If there are few data points per dimension, noise in the observations -induces high variance: - -:: - - >>> X = np.c_[ .5, 1].T - >>> y = [.5, 1] - >>> test = np.c_[ 0, 2].T - >>> regr = linear_model.LinearRegression() - - >>> import matplotlib.pyplot as plt - >>> plt.figure() - <...> - >>> np.random.seed(0) - >>> for _ in range(6): - ... this_X = .1 * np.random.normal(size=(2, 1)) + X - ... regr.fit(this_X, y) - ... plt.plot(test, regr.predict(test)) - ... plt.scatter(this_X, y, s=3) - LinearRegression... - -.. image:: /auto_examples/linear_model/images/sphx_glr_plot_ols_ridge_variance_001.png - :target: ../../auto_examples/linear_model/plot_ols_ridge_variance.html - :align: center - -A solution in high-dimensional statistical learning is to *shrink* the -regression coefficients to zero: any two randomly chosen set of -observations are likely to be uncorrelated. This is called :class:`Ridge` -regression: - -:: - - >>> regr = linear_model.Ridge(alpha=.1) - - >>> plt.figure() - <...> - >>> np.random.seed(0) - >>> for _ in range(6): - ... this_X = .1 * np.random.normal(size=(2, 1)) + X - ... regr.fit(this_X, y) - ... plt.plot(test, regr.predict(test)) - ... plt.scatter(this_X, y, s=3) - Ridge... - -.. image:: /auto_examples/linear_model/images/sphx_glr_plot_ols_ridge_variance_002.png - :target: ../../auto_examples/linear_model/plot_ols_ridge_variance.html - :align: center - -This is an example of **bias/variance tradeoff**: the larger the ridge -``alpha`` parameter, the higher the bias and the lower the variance. - -We can choose ``alpha`` to minimize left out error, this time using the -diabetes dataset rather than our synthetic data:: - - >>> alphas = np.logspace(-4, -1, 6) - >>> print([regr.set_params(alpha=alpha) - ... .fit(diabetes_X_train, diabetes_y_train) - ... .score(diabetes_X_test, diabetes_y_test) - ... for alpha in alphas]) - [0.585..., 0.585..., 0.5854..., 0.5855..., 0.583..., 0.570...] - - -.. note:: - - Capturing in the fitted parameters noise that prevents the model to - generalize to new data is called - `overfitting `_. The bias introduced - by the ridge regression is called a - `regularization `_. - -.. _sparsity: - -Sparsity ----------- - - -.. |diabetes_ols_1| image:: /auto_examples/linear_model/images/sphx_glr_plot_ols_3d_001.png - :target: ../../auto_examples/linear_model/plot_ols_3d.html - :scale: 65 - -.. |diabetes_ols_3| image:: /auto_examples/linear_model/images/sphx_glr_plot_ols_3d_003.png - :target: ../../auto_examples/linear_model/plot_ols_3d.html - :scale: 65 - -.. |diabetes_ols_2| image:: /auto_examples/linear_model/images/sphx_glr_plot_ols_3d_002.png - :target: ../../auto_examples/linear_model/plot_ols_3d.html - :scale: 65 - - - - -.. rst-class:: centered - - **Fitting only features 1 and 2** - -.. centered:: |diabetes_ols_1| |diabetes_ols_3| |diabetes_ols_2| - -.. note:: - - A representation of the full diabetes dataset would involve 11 - dimensions (10 feature dimensions and one of the target variable). It - is hard to develop an intuition on such representation, but it may be - useful to keep in mind that it would be a fairly *empty* space. - - - -We can see that, although feature 2 has a strong coefficient on the full -model, it conveys little information on ``y`` when considered with feature 1. - -To improve the conditioning of the problem (i.e. mitigating the -:ref:`curse_of_dimensionality`), it would be interesting to select only the -informative features and set non-informative ones, like feature 2 to 0. Ridge -regression will decrease their contribution, but not set them to zero. Another -penalization approach, called :ref:`lasso` (least absolute shrinkage and -selection operator), can set some coefficients to zero. Such methods are -called **sparse methods** and sparsity can be seen as an -application of Occam's razor: *prefer simpler models*. - -:: - - >>> regr = linear_model.Lasso() - >>> scores = [regr.set_params(alpha=alpha) - ... .fit(diabetes_X_train, diabetes_y_train) - ... .score(diabetes_X_test, diabetes_y_test) - ... for alpha in alphas] - >>> best_alpha = alphas[scores.index(max(scores))] - >>> regr.alpha = best_alpha - >>> regr.fit(diabetes_X_train, diabetes_y_train) - Lasso(alpha=0.025118864315095794) - >>> print(regr.coef_) - [ 0. -212.4... 517.2... 313.7... -160.8... - -0. -187.1... 69.3... 508.6... 71.8... ] - -.. topic:: **Different algorithms for the same problem** - - Different algorithms can be used to solve the same mathematical - problem. For instance the ``Lasso`` object in scikit-learn - solves the lasso regression problem using a - `coordinate descent `_ method, - that is efficient on large datasets. However, scikit-learn also - provides the :class:`LassoLars` object using the *LARS* algorithm, - which is very efficient for problems in which the weight vector estimated - is very sparse (i.e. problems with very few observations). - -.. _clf_tut: - -Classification ---------------- - -For classification, as in the labeling -`iris `_ task, linear -regression is not the right approach as it will give too much weight to -data far from the decision frontier. A linear approach is to fit a sigmoid -function or **logistic** function: - -.. image:: /auto_examples/linear_model/images/sphx_glr_plot_logistic_001.png - :target: ../../auto_examples/linear_model/plot_logistic.html - :scale: 70 - :align: center - -.. math:: - - y = \textrm{sigmoid}(X\beta - \textrm{offset}) + \epsilon = - \frac{1}{1 + \textrm{exp}(- X\beta + \textrm{offset})} + \epsilon - -:: - - >>> log = linear_model.LogisticRegression(C=1e5) - >>> log.fit(iris_X_train, iris_y_train) - LogisticRegression(C=100000.0) - -This is known as :class:`LogisticRegression`. - -.. image:: /auto_examples/linear_model/images/sphx_glr_plot_iris_logistic_001.png - :target: ../../auto_examples/linear_model/plot_iris_logistic.html - :scale: 83 - :align: center - -.. topic:: Multiclass classification - - If you have several classes to predict, an option often used is to fit - one-versus-all classifiers and then use a voting heuristic for the final - decision. - -.. topic:: Shrinkage and sparsity with logistic regression - - The ``C`` parameter controls the amount of regularization in the - :class:`LogisticRegression` object: a large value for ``C`` results in - less regularization. - ``penalty="l2"`` gives :ref:`shrinkage` (i.e. non-sparse coefficients), while - ``penalty="l1"`` gives :ref:`sparsity`. - -.. topic:: **Exercise** - :class: green - - Try classifying the digits dataset with nearest neighbors and a linear - model. Leave out the last 10% and test prediction performance on these - observations. - - .. literalinclude:: ../../auto_examples/exercises/plot_digits_classification_exercise.py - :lines: 15-19 - - A solution can be downloaded :download:`here <../../auto_examples/exercises/plot_digits_classification_exercise.py>`. - - -Support vector machines (SVMs) -================================ - -Linear SVMs -------------- - - -:ref:`svm` belong to the discriminant model family: they try to find a combination of -samples to build a plane maximizing the margin between the two classes. -Regularization is set by the ``C`` parameter: a small value for ``C`` means the margin -is calculated using many or all of the observations around the separating line -(more regularization); -a large value for ``C`` means the margin is calculated on observations close to -the separating line (less regularization). - -.. currentmodule :: sklearn.svm - -.. figure:: /auto_examples/svm/images/sphx_glr_plot_svm_margin_001.png - :target: ../../auto_examples/svm/plot_svm_margin.html - - **Unregularized SVM** - -.. figure:: /auto_examples/svm/images/sphx_glr_plot_svm_margin_002.png - :target: ../../auto_examples/svm/plot_svm_margin.html - - **Regularized SVM (default)** - -.. rubric:: Examples - -- :ref:`sphx_glr_auto_examples_svm_plot_iris_svc.py` - - -SVMs can be used in regression --:class:`SVR` (Support Vector Regression)--, or in -classification --:class:`SVC` (Support Vector Classification). - -:: - - >>> from sklearn import svm - >>> svc = svm.SVC(kernel='linear') - >>> svc.fit(iris_X_train, iris_y_train) - SVC(kernel='linear') - - -.. warning:: **Normalizing data** - - For many estimators, including the SVMs, having datasets with unit - standard deviation for each feature is important to get good - prediction. - -.. _using_kernels_tut: - -Using kernels -------------- - -Classes are not always linearly separable in feature space. The solution is to -build a decision function that is not linear but may be polynomial instead. -This is done using the *kernel trick* that can be seen as -creating a decision energy by positioning *kernels* on observations: - -Linear kernel -^^^^^^^^^^^^^ - -:: - - >>> svc = svm.SVC(kernel='linear') - -.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_002.png - :target: ../../auto_examples/svm/plot_svm_kernels.html - -Polynomial kernel -^^^^^^^^^^^^^^^^^ - -:: - - >>> svc = svm.SVC(kernel='poly', - ... degree=3) - >>> # degree: polynomial degree - -.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_003.png - :target: ../../auto_examples/svm/plot_svm_kernels.html - -RBF kernel (Radial Basis Function) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -:: - - >>> svc = svm.SVC(kernel='rbf') - >>> # gamma: inverse of size of - >>> # radial kernel - -.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_004.png - :target: ../../auto_examples/svm/plot_svm_kernels.html - -Sigmoid kernel -^^^^^^^^^^^^^^ - -:: - - >>> svc = svm.SVC(kernel='sigmoid') - -.. image:: /auto_examples/svm/images/sphx_glr_plot_svm_kernels_005.png - :target: ../../auto_examples/svm/plot_svm_kernels.html - - -.. topic:: **Exercise** - :class: green - - Try classifying classes 1 and 2 from the iris dataset with SVMs, with - the 2 first features. Leave out 10% of each class and test prediction - performance on these observations. - - **Warning**: the classes are ordered, do not leave out the last 10%, - you would be testing on only one class. - - **Hint**: You can use the ``decision_function`` method on a grid to get - intuitions. - - .. literalinclude:: ../../auto_examples/exercises/plot_iris_exercise.py - :lines: 18-23 - - .. image:: /auto_examples/datasets/images/sphx_glr_plot_iris_dataset_001.png - :target: ../../auto_examples/datasets/plot_iris_dataset.html - :align: center - :scale: 70 - - - A solution can be downloaded :download:`here <../../auto_examples/exercises/plot_iris_exercise.py>` diff --git a/doc/tutorial/statistical_inference/unsupervised_learning.rst b/doc/tutorial/statistical_inference/unsupervised_learning.rst deleted file mode 100644 index fd827cc75b212..0000000000000 --- a/doc/tutorial/statistical_inference/unsupervised_learning.rst +++ /dev/null @@ -1,297 +0,0 @@ -============================================================ -Unsupervised learning: seeking representations of the data -============================================================ - -Clustering: grouping observations together -============================================ - -.. topic:: The problem solved in clustering - - Given the iris dataset, if we knew that there were 3 types of iris, but - did not have access to a taxonomist to label them: we could try a - **clustering task**: split the observations into well-separated group - called *clusters*. - -:: - - >>> # Set the PRNG - >>> import numpy as np - >>> np.random.seed(1) - -K-means clustering -------------------- - -Note that there exist a lot of different clustering criteria and associated -algorithms. The simplest clustering algorithm is :ref:`k_means`. - -:: - - >>> from sklearn import cluster, datasets - >>> X_iris, y_iris = datasets.load_iris(return_X_y=True) - - >>> k_means = cluster.KMeans(n_clusters=3) - >>> k_means.fit(X_iris) - KMeans(n_clusters=3) - >>> print(k_means.labels_[::10]) - [1 1 1 1 1 2 0 0 0 0 2 2 2 2 2] - >>> print(y_iris[::10]) - [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2] - -.. figure:: /auto_examples/cluster/images/sphx_glr_plot_cluster_iris_001.png - :target: ../../auto_examples/cluster/plot_cluster_iris.html - :scale: 63 - -.. warning:: - - There is absolutely no guarantee of recovering a ground truth. First, - choosing the right number of clusters is hard. Second, the algorithm - is sensitive to initialization, and can fall into local minima, - although scikit-learn employs several tricks to mitigate this issue. - - For instance, on the image above, we can observe the difference between the - ground-truth (bottom right figure) and different clustering. We do not - recover the expected labels, either because the number of cluster was - chosen to be to large (top left figure) or suffer from a bad initialization - (bottom left figure). - - **It is therefore important to not over-interpret clustering results.** - -.. topic:: **Application example: vector quantization** - - Clustering in general and KMeans, in particular, can be seen as a way - of choosing a small number of exemplars to compress the information. - The problem is sometimes known as - `vector quantization `_. - For instance, this can be used to posterize an image:: - - >>> import scipy as sp - >>> try: - ... face = sp.face(gray=True) - ... except AttributeError: - ... from scipy import misc - ... face = misc.face(gray=True) - >>> X = face.reshape((-1, 1)) # We need an (n_sample, n_feature) array - >>> k_means = cluster.KMeans(n_clusters=5, n_init=1) - >>> k_means.fit(X) - KMeans(n_clusters=5, n_init=1) - >>> values = k_means.cluster_centers_.squeeze() - >>> labels = k_means.labels_ - >>> face_compressed = np.choose(labels, values) - >>> face_compressed.shape = face.shape - -**Raw image** - -.. figure:: /auto_examples/cluster/images/sphx_glr_plot_face_compress_001.png - :target: ../../auto_examples/cluster/plot_face_compress.html - -**K-means quantization** - -.. figure:: /auto_examples/cluster/images/sphx_glr_plot_face_compress_004.png - :target: ../../auto_examples/cluster/plot_face_compress.html - -**Equal bins** - -.. figure:: /auto_examples/cluster/images/sphx_glr_plot_face_compress_002.png - :target: ../../auto_examples/cluster/plot_face_compress.html - -Hierarchical agglomerative clustering: Ward ---------------------------------------------- - -A :ref:`hierarchical_clustering` method is a type of cluster analysis -that aims to build a hierarchy of clusters. In general, the various approaches -of this technique are either: - -* **Agglomerative** - bottom-up approaches: each observation starts in its - own cluster, and clusters are iteratively merged in such a way to - minimize a *linkage* criterion. This approach is particularly interesting - when the clusters of interest are made of only a few observations. When - the number of clusters is large, it is much more computationally efficient - than k-means. - -* **Divisive** - top-down approaches: all observations start in one - cluster, which is iteratively split as one moves down the hierarchy. - For estimating large numbers of clusters, this approach is both slow (due - to all observations starting as one cluster, which it splits recursively) - and statistically ill-posed. - -Connectivity-constrained clustering -..................................... - -With agglomerative clustering, it is possible to specify which samples can be -clustered together by giving a connectivity graph. Graphs in scikit-learn -are represented by their adjacency matrix. Often, a sparse matrix is used. -This can be useful, for instance, to retrieve connected regions (sometimes -also referred to as connected components) when clustering an image. - -.. image:: /auto_examples/cluster/images/sphx_glr_plot_coin_ward_segmentation_001.png - :target: ../../auto_examples/cluster/plot_coin_ward_segmentation.html - :scale: 40 - :align: center - -:: - - >>> from skimage.data import coins - >>> from scipy.ndimage import gaussian_filter - >>> from skimage.transform import rescale - >>> rescaled_coins = rescale( - ... gaussian_filter(coins(), sigma=2), - ... 0.2, mode='reflect', anti_aliasing=False - ... ) - >>> X = np.reshape(rescaled_coins, (-1, 1)) - -We need a vectorized version of the image. `'rescaled_coins'` is a down-scaled -version of the coins image to speed up the process:: - - >>> from sklearn.feature_extraction import grid_to_graph - >>> connectivity = grid_to_graph(*rescaled_coins.shape) - -Define the graph structure of the data. Pixels connected to their neighbors:: - - >>> n_clusters = 27 # number of regions - - >>> from sklearn.cluster import AgglomerativeClustering - >>> ward = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward', - ... connectivity=connectivity) - >>> ward.fit(X) - AgglomerativeClustering(connectivity=..., n_clusters=27) - >>> label = np.reshape(ward.labels_, rescaled_coins.shape) - -Feature agglomeration -...................... - -We have seen that sparsity could be used to mitigate the curse of -dimensionality, *i.e* an insufficient amount of observations compared to the -number of features. Another approach is to merge together similar -features: **feature agglomeration**. This approach can be implemented by -clustering in the feature direction, in other words clustering the -transposed data. - -.. image:: /auto_examples/cluster/images/sphx_glr_plot_digits_agglomeration_001.png - :target: ../../auto_examples/cluster/plot_digits_agglomeration.html - :align: center - :scale: 57 - -:: - - >>> digits = datasets.load_digits() - >>> images = digits.images - >>> X = np.reshape(images, (len(images), -1)) - >>> connectivity = grid_to_graph(*images[0].shape) - - >>> agglo = cluster.FeatureAgglomeration(connectivity=connectivity, - ... n_clusters=32) - >>> agglo.fit(X) - FeatureAgglomeration(connectivity=..., n_clusters=32) - >>> X_reduced = agglo.transform(X) - - >>> X_approx = agglo.inverse_transform(X_reduced) - >>> images_approx = np.reshape(X_approx, images.shape) - -.. topic:: ``transform`` and ``inverse_transform`` methods - - Some estimators expose a ``transform`` method, for instance to reduce - the dimensionality of the dataset. - -Decompositions: from a signal to components and loadings -=========================================================== - -.. topic:: **Components and loadings** - - If X is our multivariate data, then the problem that we are trying to solve - is to rewrite it on a different observational basis: we want to learn - loadings L and a set of components C such that *X = L C*. - Different criteria exist to choose the components - -Principal component analysis: PCA ------------------------------------ - -:ref:`PCA` selects the successive components that explain the maximum variance in the -signal. Let's create a synthetic 3-dimensional dataset. - -.. np.random.seed(0) - -:: - - >>> # Create a signal with only 2 useful dimensions - >>> x1 = np.random.normal(size=(100, 1)) - >>> x2 = np.random.normal(size=(100, 1)) - >>> x3 = x1 + x2 - >>> X = np.concatenate([x1, x2, x3], axis=1) - -The point cloud spanned by the observations above is very flat in one -direction: one of the three univariate features (i.e. z-axis) can almost be exactly -computed using the other two. - -.. plot:: - :context: close-figs - :align: center - - >>> import matplotlib.pyplot as plt - >>> fig = plt.figure() - >>> ax = fig.add_subplot(111, projection='3d') - >>> ax.scatter(X[:, 0], X[:, 1], X[:, 2]) - <...> - >>> _ = ax.set(xlabel="x", ylabel="y", zlabel="z") - - -PCA finds the directions in which the data is not *flat*. - -:: - - >>> from sklearn import decomposition - >>> pca = decomposition.PCA() - >>> pca.fit(X) - PCA() - >>> print(pca.explained_variance_) # doctest: +SKIP - [ 2.18565811e+00 1.19346747e+00 8.43026679e-32] - -Looking at the explained variance, we see that only the first two components -are useful. PCA can be used to reduce dimensionality while preserving -most of the information. It will project the data on the principal subspace. - -:: - - >>> pca.set_params(n_components=2) - PCA(n_components=2) - >>> X_reduced = pca.fit_transform(X) - >>> X_reduced.shape - (100, 2) - -.. Eigenfaces here? - -Independent Component Analysis: ICA -------------------------------------- - -:ref:`ICA` selects components so that the distribution of their loadings carries -a maximum amount of independent information. It is able to recover -**non-Gaussian** independent signals: - -.. image:: /auto_examples/decomposition/images/sphx_glr_plot_ica_blind_source_separation_001.png - :target: ../../auto_examples/decomposition/plot_ica_blind_source_separation.html - :scale: 70 - :align: center - -.. np.random.seed(0) - -:: - - >>> # Generate sample data - >>> import numpy as np - >>> from scipy import signal - >>> time = np.linspace(0, 10, 2000) - >>> s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal - >>> s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal - >>> s3 = signal.sawtooth(2 * np.pi * time) # Signal 3: saw tooth signal - >>> S = np.c_[s1, s2, s3] - >>> S += 0.2 * np.random.normal(size=S.shape) # Add noise - >>> S /= S.std(axis=0) # Standardize data - >>> # Mix data - >>> A = np.array([[1, 1, 1], [0.5, 2, 1], [1.5, 1, 2]]) # Mixing matrix - >>> X = np.dot(S, A.T) # Generate observations - - >>> # Compute ICA - >>> ica = decomposition.FastICA() - >>> S_ = ica.fit_transform(X) # Get the estimated sources - >>> A_ = ica.mixing_.T - >>> np.allclose(X, np.dot(S_, A_) + ica.mean_) - True diff --git a/doc/tutorial/text_analytics/.gitignore b/doc/tutorial/text_analytics/.gitignore deleted file mode 100644 index 54c78634d9dd1..0000000000000 --- a/doc/tutorial/text_analytics/.gitignore +++ /dev/null @@ -1,25 +0,0 @@ -# cruft -.*.swp -*.pyc -.DS_Store -*.pdf - -# folder to be used for working on the exercises -workspace - -# output of the sphinx build of the documentation -tutorial/_build - -# datasets to be fetched from the web and cached locally -data/twenty_newsgroups/20news-bydate.tar.gz -data/twenty_newsgroups/20news-bydate-train -data/twenty_newsgroups/20news-bydate-test - -data/movie_reviews/txt_sentoken -data/movie_reviews/poldata.README.2.0 - -data/languages/paragraphs -data/languages/short_paragraphs -data/languages/html - -data/labeled_faces_wild/lfw_preprocessed/ diff --git a/doc/tutorial/text_analytics/data/languages/fetch_data.py b/doc/tutorial/text_analytics/data/languages/fetch_data.py deleted file mode 100644 index 2dd0f208ade86..0000000000000 --- a/doc/tutorial/text_analytics/data/languages/fetch_data.py +++ /dev/null @@ -1,103 +0,0 @@ - -# simple python script to collect text paragraphs from various languages on the -# same topic namely the Wikipedia encyclopedia itself - -import os -from urllib.request import Request, build_opener - -import lxml.html -from lxml.etree import ElementTree -import numpy as np - -import codecs - -pages = { - 'ar': 'http://ar.wikipedia.org/wiki/%D9%88%D9%8A%D9%83%D9%8A%D8%A8%D9%8A%D8%AF%D9%8A%D8%A7', # noqa: E501 - 'de': 'http://de.wikipedia.org/wiki/Wikipedia', - 'en': 'https://en.wikipedia.org/wiki/Wikipedia', - 'es': 'http://es.wikipedia.org/wiki/Wikipedia', - 'fr': 'http://fr.wikipedia.org/wiki/Wikip%C3%A9dia', - 'it': 'http://it.wikipedia.org/wiki/Wikipedia', - 'ja': 'http://ja.wikipedia.org/wiki/Wikipedia', - 'nl': 'http://nl.wikipedia.org/wiki/Wikipedia', - 'pl': 'http://pl.wikipedia.org/wiki/Wikipedia', - 'pt': 'http://pt.wikipedia.org/wiki/Wikip%C3%A9dia', - 'ru': 'http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F', # noqa: E501 -# u'zh': u'http://zh.wikipedia.org/wiki/Wikipedia', -} - -html_folder = 'html' -text_folder = 'paragraphs' -short_text_folder = 'short_paragraphs' -n_words_per_short_text = 5 - - -if not os.path.exists(html_folder): - os.makedirs(html_folder) - -for lang, page in pages.items(): - - text_lang_folder = os.path.join(text_folder, lang) - if not os.path.exists(text_lang_folder): - os.makedirs(text_lang_folder) - - short_text_lang_folder = os.path.join(short_text_folder, lang) - if not os.path.exists(short_text_lang_folder): - os.makedirs(short_text_lang_folder) - - opener = build_opener() - html_filename = os.path.join(html_folder, lang + '.html') - if not os.path.exists(html_filename): - print("Downloading %s" % page) - request = Request(page) - # change the User Agent to avoid being blocked by Wikipedia - # downloading a couple of articles should not be considered abusive - request.add_header('User-Agent', 'OpenAnything/1.0') - html_content = opener.open(request).read() - with open(html_filename, 'wb') as f: - f.write(html_content) - - # decode the payload explicitly as UTF-8 since lxml is confused for some - # reason - with codecs.open(html_filename,'r','utf-8') as html_file: - html_content = html_file.read() - tree = ElementTree(lxml.html.document_fromstring(html_content)) - i = 0 - j = 0 - for p in tree.findall('//p'): - content = p.text_content() - if len(content) < 100: - # skip paragraphs that are too short - probably too noisy and not - # representative of the actual language - continue - - text_filename = os.path.join(text_lang_folder, - '%s_%04d.txt' % (lang, i)) - print("Writing %s" % text_filename) - with open(text_filename, 'wb') as f: - f.write(content.encode('utf-8', 'ignore')) - i += 1 - - # split the paragraph into fake smaller paragraphs to make the - # problem harder e.g. more similar to tweets - if lang in ('zh', 'ja'): - # FIXME: whitespace tokenizing does not work on chinese and japanese - continue - words = content.split() - n_groups = len(words) / n_words_per_short_text - if n_groups < 1: - continue - groups = np.array_split(words, n_groups) - - for group in groups: - small_content = " ".join(group) - - short_text_filename = os.path.join(short_text_lang_folder, - '%s_%04d.txt' % (lang, j)) - print("Writing %s" % short_text_filename) - with open(short_text_filename, 'wb') as f: - f.write(small_content.encode('utf-8', 'ignore')) - j += 1 - if j >= 1000: - break - diff --git a/doc/tutorial/text_analytics/data/movie_reviews/fetch_data.py b/doc/tutorial/text_analytics/data/movie_reviews/fetch_data.py deleted file mode 100644 index 67def14889774..0000000000000 --- a/doc/tutorial/text_analytics/data/movie_reviews/fetch_data.py +++ /dev/null @@ -1,33 +0,0 @@ -"""Script to download the movie review dataset""" - -from pathlib import Path -from hashlib import sha256 -import tarfile -from urllib.request import urlopen - - -URL = "http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz" - -ARCHIVE_SHA256 = "fc0dccc2671af5db3c5d8f81f77a1ebfec953ecdd422334062df61ede36b2179" -ARCHIVE_NAME = Path(URL.rsplit("/", 1)[1]) -DATA_FOLDER = Path("txt_sentoken") - - -if not DATA_FOLDER.exists(): - - if not ARCHIVE_NAME.exists(): - print("Downloading dataset from %s (3 MB)" % URL) - opener = urlopen(URL) - with open(ARCHIVE_NAME, "wb") as archive: - archive.write(opener.read()) - - try: - print("Checking the integrity of the archive") - assert sha256(ARCHIVE_NAME.read_bytes()).hexdigest() == ARCHIVE_SHA256 - - print("Decompressing %s" % ARCHIVE_NAME) - with tarfile.open(ARCHIVE_NAME, "r:gz") as archive: - archive.extractall(path=".") - - finally: - ARCHIVE_NAME.unlink() diff --git a/doc/tutorial/text_analytics/skeletons/exercise_01_language_train_model.py b/doc/tutorial/text_analytics/skeletons/exercise_01_language_train_model.py deleted file mode 100644 index 438481120d126..0000000000000 --- a/doc/tutorial/text_analytics/skeletons/exercise_01_language_train_model.py +++ /dev/null @@ -1,62 +0,0 @@ -"""Build a language detector model - -The goal of this exercise is to train a linear classifier on text features -that represent sequences of up to 3 consecutive characters so as to be -recognize natural languages by using the frequencies of short character -sequences as 'fingerprints'. - -""" -# Author: Olivier Grisel -# License: Simplified BSD - -import sys - -from sklearn.feature_extraction.text import TfidfVectorizer -from sklearn.linear_model import Perceptron -from sklearn.pipeline import Pipeline -from sklearn.datasets import load_files -from sklearn.model_selection import train_test_split -from sklearn import metrics - - -# The training data folder must be passed as first argument -languages_data_folder = sys.argv[1] -dataset = load_files(languages_data_folder) - -# Split the dataset in training and test set: -docs_train, docs_test, y_train, y_test = train_test_split( - dataset.data, dataset.target, test_size=0.5) - - -# TASK: Build a vectorizer that splits strings into sequence of 1 to 3 -# characters instead of word tokens - -# TASK: Build a vectorizer / classifier pipeline using the previous analyzer -# the pipeline instance should stored in a variable named clf - -# TASK: Fit the pipeline on the training set - -# TASK: Predict the outcome on the testing set in a variable named y_predicted - -# Print the classification report -print(metrics.classification_report(y_test, y_predicted, - target_names=dataset.target_names)) - -# Plot the confusion matrix -cm = metrics.confusion_matrix(y_test, y_predicted) -print(cm) - -#import matplotlib.pyplot as plt -#plt.matshow(cm, cmap=plt.cm.jet) -#plt.show() - -# Predict the result on some short new sentences: -sentences = [ - 'This is a language detection test.', - 'Ceci est un test de d\xe9tection de la langue.', - 'Dies ist ein Test, um die Sprache zu erkennen.', -] -predicted = clf.predict(sentences) - -for s, p in zip(sentences, predicted): - print('The language of "%s" is "%s"' % (s, dataset.target_names[p])) diff --git a/doc/tutorial/text_analytics/skeletons/exercise_02_sentiment.py b/doc/tutorial/text_analytics/skeletons/exercise_02_sentiment.py deleted file mode 100644 index 23299f5f01b3d..0000000000000 --- a/doc/tutorial/text_analytics/skeletons/exercise_02_sentiment.py +++ /dev/null @@ -1,63 +0,0 @@ -"""Build a sentiment analysis / polarity model - -Sentiment analysis can be casted as a binary text classification problem, -that is fitting a linear classifier on features extracted from the text -of the user messages so as to guess whether the opinion of the author is -positive or negative. - -In this examples we will use a movie review dataset. - -""" -# Author: Olivier Grisel -# License: Simplified BSD - -import sys -from sklearn.feature_extraction.text import TfidfVectorizer -from sklearn.svm import LinearSVC -from sklearn.pipeline import Pipeline -from sklearn.model_selection import GridSearchCV -from sklearn.datasets import load_files -from sklearn.model_selection import train_test_split -from sklearn import metrics - - -if __name__ == "__main__": - # NOTE: we put the following in a 'if __name__ == "__main__"' protected - # block to be able to use a multi-core grid search that also works under - # Windows, see: http://docs.python.org/library/multiprocessing.html#windows - # The multiprocessing module is used as the backend of joblib.Parallel - # that is used when n_jobs != 1 in GridSearchCV - - # the training data folder must be passed as first argument - movie_reviews_data_folder = sys.argv[1] - dataset = load_files(movie_reviews_data_folder, shuffle=False) - print("n_samples: %d" % len(dataset.data)) - - # split the dataset in training and test set: - docs_train, docs_test, y_train, y_test = train_test_split( - dataset.data, dataset.target, test_size=0.25, random_state=None) - - # TASK: Build a vectorizer / classifier pipeline that filters out tokens - # that are too rare or too frequent - - # TASK: Build a grid search to find out whether unigrams or bigrams are - # more useful. - # Fit the pipeline on the training set using grid search for the parameters - - # TASK: print the cross-validated scores for the each parameters set - # explored by the grid search - - # TASK: Predict the outcome on the testing set and store it in a variable - # named y_predicted - - # Print the classification report - print(metrics.classification_report(y_test, y_predicted, - target_names=dataset.target_names)) - - # Print and plot the confusion matrix - cm = metrics.confusion_matrix(y_test, y_predicted) - print(cm) - - # import matplotlib.pyplot as plt - # plt.matshow(cm) - # plt.show() diff --git a/doc/tutorial/text_analytics/solutions/exercise_01_language_train_model.py b/doc/tutorial/text_analytics/solutions/exercise_01_language_train_model.py deleted file mode 100644 index 21cee0c80e00e..0000000000000 --- a/doc/tutorial/text_analytics/solutions/exercise_01_language_train_model.py +++ /dev/null @@ -1,70 +0,0 @@ -"""Build a language detector model - -The goal of this exercise is to train a linear classifier on text features -that represent sequences of up to 3 consecutive characters so as to be -recognize natural languages by using the frequencies of short character -sequences as 'fingerprints'. - -""" -# Author: Olivier Grisel -# License: Simplified BSD - -import sys - -from sklearn.feature_extraction.text import TfidfVectorizer -from sklearn.linear_model import Perceptron -from sklearn.pipeline import Pipeline -from sklearn.datasets import load_files -from sklearn.model_selection import train_test_split -from sklearn import metrics - - -# The training data folder must be passed as first argument -languages_data_folder = sys.argv[1] -dataset = load_files(languages_data_folder) - -# Split the dataset in training and test set: -docs_train, docs_test, y_train, y_test = train_test_split( - dataset.data, dataset.target, test_size=0.5) - - -# TASK: Build a vectorizer that splits strings into sequence of 1 to 3 -# characters instead of word tokens -vectorizer = TfidfVectorizer(ngram_range=(1, 3), analyzer='char', - use_idf=False) - -# TASK: Build a vectorizer / classifier pipeline using the previous analyzer -# the pipeline instance should stored in a variable named clf -clf = Pipeline([ - ('vec', vectorizer), - ('clf', Perceptron()), -]) - -# TASK: Fit the pipeline on the training set -clf.fit(docs_train, y_train) - -# TASK: Predict the outcome on the testing set in a variable named y_predicted -y_predicted = clf.predict(docs_test) - -# Print the classification report -print(metrics.classification_report(y_test, y_predicted, - target_names=dataset.target_names)) - -# Plot the confusion matrix -cm = metrics.confusion_matrix(y_test, y_predicted) -print(cm) - -#import matlotlib.pyplot as plt -#plt.matshow(cm, cmap=plt.cm.jet) -#plt.show() - -# Predict the result on some short new sentences: -sentences = [ - 'This is a language detection test.', - 'Ceci est un test de d\xe9tection de la langue.', - 'Dies ist ein Test, um die Sprache zu erkennen.', -] -predicted = clf.predict(sentences) - -for s, p in zip(sentences, predicted): - print('The language of "%s" is "%s"' % (s, dataset.target_names[p])) diff --git a/doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py b/doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py deleted file mode 100644 index 434bece341975..0000000000000 --- a/doc/tutorial/text_analytics/solutions/exercise_02_sentiment.py +++ /dev/null @@ -1,79 +0,0 @@ -"""Build a sentiment analysis / polarity model - -Sentiment analysis can be casted as a binary text classification problem, -that is fitting a linear classifier on features extracted from the text -of the user messages so as to guess whether the opinion of the author is -positive or negative. - -In this examples we will use a movie review dataset. - -""" -# Author: Olivier Grisel -# License: Simplified BSD - -import sys -from sklearn.feature_extraction.text import TfidfVectorizer -from sklearn.svm import LinearSVC -from sklearn.pipeline import Pipeline -from sklearn.model_selection import GridSearchCV -from sklearn.datasets import load_files -from sklearn.model_selection import train_test_split -from sklearn import metrics - - -if __name__ == "__main__": - # NOTE: we put the following in a 'if __name__ == "__main__"' protected - # block to be able to use a multi-core grid search that also works under - # Windows, see: http://docs.python.org/library/multiprocessing.html#windows - # The multiprocessing module is used as the backend of joblib.Parallel - # that is used when n_jobs != 1 in GridSearchCV - - # the training data folder must be passed as first argument - movie_reviews_data_folder = sys.argv[1] - dataset = load_files(movie_reviews_data_folder, shuffle=False) - print("n_samples: %d" % len(dataset.data)) - - # split the dataset in training and test set: - docs_train, docs_test, y_train, y_test = train_test_split( - dataset.data, dataset.target, test_size=0.25, random_state=None) - - # TASK: Build a vectorizer / classifier pipeline that filters out tokens - # that are too rare or too frequent - pipeline = Pipeline([ - ('vect', TfidfVectorizer(min_df=3, max_df=0.95)), - ('clf', LinearSVC(C=1000)), - ]) - - # TASK: Build a grid search to find out whether unigrams or bigrams are - # more useful. - # Fit the pipeline on the training set using grid search for the parameters - parameters = { - 'vect__ngram_range': [(1, 1), (1, 2)], - } - grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1) - grid_search.fit(docs_train, y_train) - - # TASK: print the mean and std for each candidate along with the parameter - # settings for all the candidates explored by grid search. - n_candidates = len(grid_search.cv_results_['params']) - for i in range(n_candidates): - print(i, 'params - %s; mean - %0.2f; std - %0.2f' - % (grid_search.cv_results_['params'][i], - grid_search.cv_results_['mean_test_score'][i], - grid_search.cv_results_['std_test_score'][i])) - - # TASK: Predict the outcome on the testing set and store it in a variable - # named y_predicted - y_predicted = grid_search.predict(docs_test) - - # Print the classification report - print(metrics.classification_report(y_test, y_predicted, - target_names=dataset.target_names)) - - # Print and plot the confusion matrix - cm = metrics.confusion_matrix(y_test, y_predicted) - print(cm) - - # import matplotlib.pyplot as plt - # plt.matshow(cm) - # plt.show() diff --git a/doc/tutorial/text_analytics/solutions/generate_skeletons.py b/doc/tutorial/text_analytics/solutions/generate_skeletons.py deleted file mode 100644 index 4729b976530c7..0000000000000 --- a/doc/tutorial/text_analytics/solutions/generate_skeletons.py +++ /dev/null @@ -1,38 +0,0 @@ -"""Generate skeletons from the example code""" -import os - -exercise_dir = os.path.dirname(__file__) -if exercise_dir == '': - exercise_dir = '.' - -skeleton_dir = os.path.abspath(os.path.join(exercise_dir, '..', 'skeletons')) -if not os.path.exists(skeleton_dir): - os.makedirs(skeleton_dir) - -solutions = os.listdir(exercise_dir) - -for f in solutions: - if not f.endswith('.py'): - continue - - if f == os.path.basename(__file__): - continue - - print("Generating skeleton for %s" % f) - - input_file = open(os.path.join(exercise_dir, f)) - output_file = open(os.path.join(skeleton_dir, f), 'w') - - in_exercise_region = False - - for line in input_file: - linestrip = line.strip() - if len(linestrip) == 0: - in_exercise_region = False - elif linestrip.startswith('# TASK:'): - in_exercise_region = True - - if not in_exercise_region or linestrip.startswith('#'): - output_file.write(line) - - output_file.close() diff --git a/doc/tutorial/text_analytics/working_with_text_data.rst b/doc/tutorial/text_analytics/working_with_text_data.rst deleted file mode 100644 index 43fd305c3b8b6..0000000000000 --- a/doc/tutorial/text_analytics/working_with_text_data.rst +++ /dev/null @@ -1,586 +0,0 @@ -.. _text_data_tutorial: - -====================== -Working With Text Data -====================== - -The goal of this guide is to explore some of the main ``scikit-learn`` -tools on a single practical task: analyzing a collection of text -documents (newsgroups posts) on twenty different topics. - -In this section we will see how to: - -- load the file contents and the categories - -- extract feature vectors suitable for machine learning - -- train a linear model to perform categorization - -- use a grid search strategy to find a good configuration of both - the feature extraction components and the classifier - - -Tutorial setup --------------- - -To get started with this tutorial, you must first install -*scikit-learn* and all of its required dependencies. - -Please refer to the :ref:`installation instructions ` -page for more information and for system-specific instructions. - -The source of this tutorial can be found within your scikit-learn folder:: - - scikit-learn/doc/tutorial/text_analytics/ - -The source can also be found `on Github -`_. - -The tutorial folder should contain the following sub-folders: - -* ``*.rst files`` - the source of the tutorial document written with sphinx - -* ``data`` - folder to put the datasets used during the tutorial - -* ``skeletons`` - sample incomplete scripts for the exercises - -* ``solutions`` - solutions of the exercises - - -You can already copy the skeletons into a new folder somewhere -on your hard-drive named ``sklearn_tut_workspace``, where you -will edit your own files for the exercises while keeping -the original skeletons intact: - -.. prompt:: bash $ - - cp -r skeletons work_directory/sklearn_tut_workspace - - -Machine learning algorithms need data. Go to each ``$TUTORIAL_HOME/data`` -sub-folder and run the ``fetch_data.py`` script from there (after -having read them first). - -For instance: - -.. prompt:: bash $ - - cd $TUTORIAL_HOME/data/languages - less fetch_data.py - python fetch_data.py - - -Loading the 20 newsgroups dataset ---------------------------------- - -The dataset is called "Twenty Newsgroups". Here is the official -description, quoted from the `website -`_: - - The 20 Newsgroups data set is a collection of approximately 20,000 - newsgroup documents, partitioned (nearly) evenly across 20 different - newsgroups. To the best of our knowledge, it was originally collected - by Ken Lang, probably for his paper "Newsweeder: Learning to filter - netnews," though he does not explicitly mention this collection. - The 20 newsgroups collection has become a popular data set for - experiments in text applications of machine learning techniques, - such as text classification and text clustering. - -In the following we will use the built-in dataset loader for 20 newsgroups -from scikit-learn. Alternatively, it is possible to download the dataset -manually from the website and use the :func:`sklearn.datasets.load_files` -function by pointing it to the ``20news-bydate-train`` sub-folder of the -uncompressed archive folder. - -In order to get faster execution times for this first example, we will -work on a partial dataset with only 4 categories out of the 20 available -in the dataset:: - - >>> categories = ['alt.atheism', 'soc.religion.christian', - ... 'comp.graphics', 'sci.med'] - -We can now load the list of files matching those categories as follows:: - - >>> from sklearn.datasets import fetch_20newsgroups - >>> twenty_train = fetch_20newsgroups(subset='train', - ... categories=categories, shuffle=True, random_state=42) - -The returned dataset is a ``scikit-learn`` "bunch": a simple holder -object with fields that can be both accessed as python ``dict`` -keys or ``object`` attributes for convenience, for instance the -``target_names`` holds the list of the requested category names:: - - >>> twenty_train.target_names - ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian'] - -The files themselves are loaded in memory in the ``data`` attribute. For -reference the filenames are also available:: - - >>> len(twenty_train.data) - 2257 - >>> len(twenty_train.filenames) - 2257 - -Let's print the first lines of the first loaded file:: - - >>> print("\n".join(twenty_train.data[0].split("\n")[:3])) - From: sd345@city.ac.uk (Michael Collier) - Subject: Converting images to HP LaserJet III? - Nntp-Posting-Host: hampton - - >>> print(twenty_train.target_names[twenty_train.target[0]]) - comp.graphics - -Supervised learning algorithms will require a category label for each -document in the training set. In this case the category is the name of the -newsgroup which also happens to be the name of the folder holding the -individual documents. - -For speed and space efficiency reasons, ``scikit-learn`` loads the -target attribute as an array of integers that corresponds to the -index of the category name in the ``target_names`` list. The category -integer id of each sample is stored in the ``target`` attribute:: - - >>> twenty_train.target[:10] - array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2]) - -It is possible to get back the category names as follows:: - - >>> for t in twenty_train.target[:10]: - ... print(twenty_train.target_names[t]) - ... - comp.graphics - comp.graphics - soc.religion.christian - soc.religion.christian - soc.religion.christian - soc.religion.christian - soc.religion.christian - sci.med - sci.med - sci.med - -You might have noticed that the samples were shuffled randomly when we called -``fetch_20newsgroups(..., shuffle=True, random_state=42)``: this is useful if -you wish to select only a subset of samples to quickly train a model and get a -first idea of the results before re-training on the complete dataset later. - - -Extracting features from text files ------------------------------------ - -In order to perform machine learning on text documents, we first need to -turn the text content into numerical feature vectors. - -.. currentmodule:: sklearn.feature_extraction.text - - -Bags of words -~~~~~~~~~~~~~ - -The most intuitive way to do so is to use a bags of words representation: - -1. Assign a fixed integer id to each word occurring in any document - of the training set (for instance by building a dictionary - from words to integer indices). - -2. For each document ``#i``, count the number of occurrences of each - word ``w`` and store it in ``X[i, j]`` as the value of feature - ``#j`` where ``j`` is the index of word ``w`` in the dictionary. - -The bags of words representation implies that ``n_features`` is -the number of distinct words in the corpus: this number is typically -larger than 100,000. - -If ``n_samples == 10000``, storing ``X`` as a NumPy array of type -float32 would require 10000 x 100000 x 4 bytes = **4GB in RAM** which -is barely manageable on today's computers. - -Fortunately, **most values in X will be zeros** since for a given -document less than a few thousand distinct words will be -used. For this reason we say that bags of words are typically -**high-dimensional sparse datasets**. We can save a lot of memory by -only storing the non-zero parts of the feature vectors in memory. - -``scipy.sparse`` matrices are data structures that do exactly this, -and ``scikit-learn`` has built-in support for these structures. - - -Tokenizing text with ``scikit-learn`` -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Text preprocessing, tokenizing and filtering of stopwords are all included -in :class:`CountVectorizer`, which builds a dictionary of features and -transforms documents to feature vectors:: - - >>> from sklearn.feature_extraction.text import CountVectorizer - >>> count_vect = CountVectorizer() - >>> X_train_counts = count_vect.fit_transform(twenty_train.data) - >>> X_train_counts.shape - (2257, 35788) - -:class:`CountVectorizer` supports counts of N-grams of words or consecutive -characters. Once fitted, the vectorizer has built a dictionary of feature -indices:: - - >>> count_vect.vocabulary_.get(u'algorithm') - 4690 - -The index value of a word in the vocabulary is linked to its frequency -in the whole training corpus. - -.. note: - - The method ``count_vect.fit_transform`` performs two actions: - it learns the vocabulary and transforms the documents into count vectors. - It's possible to separate these steps by calling - ``count_vect.fit(twenty_train.data)`` followed by - ``X_train_counts = count_vect.transform(twenty_train.data)``, - but doing so would tokenize and vectorize each text file twice. - - -From occurrences to frequencies -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Occurrence count is a good start but there is an issue: longer -documents will have higher average count values than shorter documents, -even though they might talk about the same topics. - -To avoid these potential discrepancies it suffices to divide the -number of occurrences of each word in a document by the total number -of words in the document: these new features are called ``tf`` for Term -Frequencies. - -Another refinement on top of tf is to downscale weights for words -that occur in many documents in the corpus and are therefore less -informative than those that occur only in a smaller portion of the -corpus. - -This downscaling is called `tf–idf`_ for "Term Frequency times -Inverse Document Frequency". - -.. _`tf–idf`: https://en.wikipedia.org/wiki/Tf-idf - - -Both **tf** and **tf–idf** can be computed as follows using -:class:`TfidfTransformer`:: - - >>> from sklearn.feature_extraction.text import TfidfTransformer - >>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts) - >>> X_train_tf = tf_transformer.transform(X_train_counts) - >>> X_train_tf.shape - (2257, 35788) - -In the above example-code, we firstly use the ``fit(..)`` method to fit our -estimator to the data and secondly the ``transform(..)`` method to transform -our count-matrix to a tf-idf representation. -These two steps can be combined to achieve the same end result faster -by skipping redundant processing. This is done through using the -``fit_transform(..)`` method as shown below, and as mentioned in the note -in the previous section:: - - >>> tfidf_transformer = TfidfTransformer() - >>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) - >>> X_train_tfidf.shape - (2257, 35788) - - -Training a classifier ---------------------- - -Now that we have our features, we can train a classifier to try to predict -the category of a post. Let's start with a :ref:`naïve Bayes ` -classifier, which -provides a nice baseline for this task. ``scikit-learn`` includes several -variants of this classifier, and the one most suitable for word counts is the -multinomial variant:: - - >>> from sklearn.naive_bayes import MultinomialNB - >>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target) - -To try to predict the outcome on a new document we need to extract -the features using almost the same feature extracting chain as before. -The difference is that we call ``transform`` instead of ``fit_transform`` -on the transformers, since they have already been fit to the training set:: - - >>> docs_new = ['God is love', 'OpenGL on the GPU is fast'] - >>> X_new_counts = count_vect.transform(docs_new) - >>> X_new_tfidf = tfidf_transformer.transform(X_new_counts) - - >>> predicted = clf.predict(X_new_tfidf) - - >>> for doc, category in zip(docs_new, predicted): - ... print('%r => %s' % (doc, twenty_train.target_names[category])) - ... - 'God is love' => soc.religion.christian - 'OpenGL on the GPU is fast' => comp.graphics - - -Building a pipeline -------------------- - -In order to make the vectorizer => transformer => classifier easier -to work with, ``scikit-learn`` provides a :class:`~sklearn.pipeline.Pipeline` class that behaves -like a compound classifier:: - - >>> from sklearn.pipeline import Pipeline - >>> text_clf = Pipeline([ - ... ('vect', CountVectorizer()), - ... ('tfidf', TfidfTransformer()), - ... ('clf', MultinomialNB()), - ... ]) - - -The names ``vect``, ``tfidf`` and ``clf`` (classifier) are arbitrary. -We will use them to perform grid search for suitable hyperparameters below. -We can now train the model with a single command:: - - >>> text_clf.fit(twenty_train.data, twenty_train.target) - Pipeline(...) - - -Evaluation of the performance on the test set ---------------------------------------------- - -Evaluating the predictive accuracy of the model is equally easy:: - - >>> import numpy as np - >>> twenty_test = fetch_20newsgroups(subset='test', - ... categories=categories, shuffle=True, random_state=42) - >>> docs_test = twenty_test.data - >>> predicted = text_clf.predict(docs_test) - >>> np.mean(predicted == twenty_test.target) - 0.8348... - -We achieved 83.5% accuracy. Let's see if we can do better with a -linear :ref:`support vector machine (SVM) `, -which is widely regarded as one of -the best text classification algorithms (although it's also a bit slower -than naïve Bayes). We can change the learner by simply plugging a different -classifier object into our pipeline:: - - >>> from sklearn.linear_model import SGDClassifier - >>> text_clf = Pipeline([ - ... ('vect', CountVectorizer()), - ... ('tfidf', TfidfTransformer()), - ... ('clf', SGDClassifier(loss='hinge', penalty='l2', - ... alpha=1e-3, random_state=42, - ... max_iter=5, tol=None)), - ... ]) - - >>> text_clf.fit(twenty_train.data, twenty_train.target) - Pipeline(...) - >>> predicted = text_clf.predict(docs_test) - >>> np.mean(predicted == twenty_test.target) - 0.9101... - -We achieved 91.3% accuracy using the SVM. ``scikit-learn`` provides further -utilities for more detailed performance analysis of the results:: - - >>> from sklearn import metrics - >>> print(metrics.classification_report(twenty_test.target, predicted, - ... target_names=twenty_test.target_names)) - precision recall f1-score support - - alt.atheism 0.95 0.80 0.87 319 - comp.graphics 0.87 0.98 0.92 389 - sci.med 0.94 0.89 0.91 396 - soc.religion.christian 0.90 0.95 0.93 398 - - accuracy 0.91 1502 - macro avg 0.91 0.91 0.91 1502 - weighted avg 0.91 0.91 0.91 1502 - - - >>> metrics.confusion_matrix(twenty_test.target, predicted) - array([[256, 11, 16, 36], - [ 4, 380, 3, 2], - [ 5, 35, 353, 3], - [ 5, 11, 4, 378]]) - -As expected the confusion matrix shows that posts from the newsgroups -on atheism and Christianity are more often confused for one another than -with computer graphics. - -.. note: - - SGD stands for Stochastic Gradient Descent. This is a simple - optimization algorithms that is known to be scalable when the dataset - has many samples. - - By setting ``loss="hinge"`` and ``penalty="l2"`` we are configuring - the classifier model to tune its parameters for the linear Support - Vector Machine cost function. - - Alternatively we could have used ``sklearn.svm.LinearSVC`` (Linear - Support Vector Machine Classifier) that provides an alternative - optimizer for the same cost function based on the liblinear_ C++ - library. - -.. _liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ - - -Parameter tuning using grid search ----------------------------------- - -We've already encountered some parameters such as ``use_idf`` in the -``TfidfTransformer``. Classifiers tend to have many parameters as well; -e.g., ``MultinomialNB`` includes a smoothing parameter ``alpha`` and -``SGDClassifier`` has a penalty parameter ``alpha`` and configurable loss -and penalty terms in the objective function (see the module documentation, -or use the Python ``help`` function to get a description of these). - -Instead of tweaking the parameters of the various components of the -chain, it is possible to run an exhaustive search of the best -parameters on a grid of possible values. We try out all classifiers -on either words or bigrams, with or without idf, and with a penalty -parameter of either 0.01 or 0.001 for the linear SVM:: - - >>> from sklearn.model_selection import GridSearchCV - >>> parameters = { - ... 'vect__ngram_range': [(1, 1), (1, 2)], - ... 'tfidf__use_idf': (True, False), - ... 'clf__alpha': (1e-2, 1e-3), - ... } - - -Obviously, such an exhaustive search can be expensive. If we have multiple -CPU cores at our disposal, we can tell the grid searcher to try these eight -parameter combinations in parallel with the ``n_jobs`` parameter. If we give -this parameter a value of ``-1``, grid search will detect how many cores -are installed and use them all:: - - >>> gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1) - -The grid search instance behaves like a normal ``scikit-learn`` -model. Let's perform the search on a smaller subset of the training data -to speed up the computation:: - - >>> gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400]) - -The result of calling ``fit`` on a ``GridSearchCV`` object is a classifier -that we can use to ``predict``:: - - >>> twenty_train.target_names[gs_clf.predict(['God is love'])[0]] - 'soc.religion.christian' - -The object's ``best_score_`` and ``best_params_`` attributes store the best -mean score and the parameters setting corresponding to that score:: - - >>> gs_clf.best_score_ - 0.9... - >>> for param_name in sorted(parameters.keys()): - ... print("%s: %r" % (param_name, gs_clf.best_params_[param_name])) - ... - clf__alpha: 0.001 - tfidf__use_idf: True - vect__ngram_range: (1, 1) - -A more detailed summary of the search is available at ``gs_clf.cv_results_``. - -The ``cv_results_`` parameter can be easily imported into pandas as a -``DataFrame`` for further inspection. - -.. note: - - A ``GridSearchCV`` object also stores the best classifier that it trained - as its ``best_estimator_`` attribute. In this case, that isn't much use as - we trained on a small, 400-document subset of our full training set. - - -Exercises -~~~~~~~~~ - -To do the exercises, copy the content of the 'skeletons' folder as -a new folder named 'workspace': - -.. prompt:: bash $ - - cp -r skeletons workspace - - -You can then edit the content of the workspace without fear of losing -the original exercise instructions. - -Then fire an ipython shell and run the work-in-progress script with:: - - [1] %run workspace/exercise_XX_script.py arg1 arg2 arg3 - -If an exception is triggered, use ``%debug`` to fire-up a post -mortem ipdb session. - -Refine the implementation and iterate until the exercise is solved. - -**For each exercise, the skeleton file provides all the necessary import -statements, boilerplate code to load the data and sample code to evaluate -the predictive accuracy of the model.** - - -Exercise 1: Language identification ------------------------------------ - -- Write a text classification pipeline using a custom preprocessor and - ``TfidfVectorizer`` set up to use character based n-grams, using data from Wikipedia articles as the training set. - -- Evaluate the performance on some held out test set. - -ipython command line:: - - %run workspace/exercise_01_language_train_model.py data/languages/paragraphs/ - - -Exercise 2: Sentiment Analysis on movie reviews ------------------------------------------------ - -- Write a text classification pipeline to classify movie reviews as either - positive or negative. - -- Find a good set of parameters using grid search. - -- Evaluate the performance on a held out test set. - -ipython command line:: - - %run workspace/exercise_02_sentiment.py data/movie_reviews/txt_sentoken/ - - -Exercise 3: CLI text classification utility -------------------------------------------- - -Using the results of the previous exercises and the ``cPickle`` -module of the standard library, write a command line utility that -detects the language of some text provided on ``stdin`` and estimate -the polarity (positive or negative) if the text is written in -English. - -Bonus point if the utility is able to give a confidence level for its -predictions. - - -Where to from here ------------------- - -Here are a few suggestions to help further your scikit-learn intuition -upon the completion of this tutorial: - - -* Try playing around with the ``analyzer`` and ``token normalisation`` under - :class:`CountVectorizer`. - -* If you don't have labels, try using - :ref:`Clustering ` - on your problem. - -* If you have multiple labels per document, e.g. categories, have a look - at the :ref:`Multiclass and multilabel section `. - -* Try using :ref:`Truncated SVD ` for - `latent semantic analysis `_. - -* Have a look at using - :ref:`Out-of-core Classification - ` to - learn from data that would not fit into the computer main memory. - -* Have a look at the :ref:`Hashing Vectorizer ` - as a memory efficient alternative to :class:`CountVectorizer`. From da38895c5264b6174d77b2f1a45770765491ade9 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Sat, 25 May 2024 00:18:28 +0200 Subject: [PATCH 02/11] remove exclude from pyproject.toml --- pyproject.toml | 2 -- 1 file changed, 2 deletions(-) diff --git a/pyproject.toml b/pyproject.toml index 9f1fd9ec3b1bb..80636a4dcaa50 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -115,7 +115,6 @@ exclude = ''' | \.vscode | build | dist - | doc/tutorial | doc/_build | doc/auto_examples | sklearn/externals @@ -134,7 +133,6 @@ exclude=[ "sklearn/externals", "doc/_build", "doc/auto_examples", - "doc/tutorial", "build", "asv_benchmarks/env", "asv_benchmarks/html", From 85eac1d767623849fc8b715d7cb158568f242787 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Sat, 25 May 2024 13:13:52 +0200 Subject: [PATCH 03/11] DOC add back ML Map and reference from getting started --- doc/developers/contributing.rst | 3 -- doc/getting_started.rst | 2 + doc/machine_learning_map.rst | 75 +++++++++++++++++++++++++++++++++ 3 files changed, 77 insertions(+), 3 deletions(-) create mode 100644 doc/machine_learning_map.rst diff --git a/doc/developers/contributing.rst b/doc/developers/contributing.rst index 402711dcd1bf3..2900ed02803d7 100644 --- a/doc/developers/contributing.rst +++ b/doc/developers/contributing.rst @@ -659,9 +659,6 @@ We are glad to accept any sort of documentation: `doc/ `_ directory and `doc/modules/ `_. -* **tutorials** - these introduce various statistical learning and machine learning - concepts and are located in - `doc/tutorial `_. * **examples** - these provide full code examples that may demonstrate the use of scikit-learn modules, compare different algorithms or discuss their interpretation etc. Examples live in diff --git a/doc/getting_started.rst b/doc/getting_started.rst index cd4d953db1b8a..295671a2a2e0e 100644 --- a/doc/getting_started.rst +++ b/doc/getting_started.rst @@ -53,6 +53,8 @@ new data. You don't need to re-train the estimator:: >>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data array([0, 1]) +You can check :ref: on how to choose the right model for your use case. + Transformers and pre-processors ------------------------------- diff --git a/doc/machine_learning_map.rst b/doc/machine_learning_map.rst new file mode 100644 index 0000000000000..0c1e811716648 --- /dev/null +++ b/doc/machine_learning_map.rst @@ -0,0 +1,75 @@ +:html_theme.sidebar_secondary.remove: + +.. _ml_map: + +Choosing the right estimator +============================ + +Often the hardest part of solving a machine learning problem can be finding the right +estimator for the job. Different estimators are better suited for different types of +data and different problems. + +The flowchart below is designed to give users a bit of a rough guide on how to approach +problems with regard to which estimators to try on your data. Click on any estimator in +the chart below to see its documentation. Use scroll wheel to zoom in and out, and click +and drag to pan around. You can also download the chart: +:download:`ml_map.svg `. + +.. raw:: html + + + + + + +
+ +.. raw:: html + :file: ../../images/ml_map.svg + +.. raw:: html + +
From 7c41009a324bece48b57391c12d87e6701d4a7f7 Mon Sep 17 00:00:00 2001 From: adrinjalali Date: Sat, 25 May 2024 13:15:16 +0200 Subject: [PATCH 04/11] fix paths --- doc/machine_learning_map.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/machine_learning_map.rst b/doc/machine_learning_map.rst index 0c1e811716648..fb2bae2a53716 100644 --- a/doc/machine_learning_map.rst +++ b/doc/machine_learning_map.rst @@ -35,7 +35,7 @@ and drag to pan around. You can also download the chart: } - + +