pprett
diff --git a/‎tutorial/conf.py
Lines changed: 3 additions & 3 deletions b/‎tutorial/conf.py
Lines changed: 3 additions & 3 deletions
diff --git a/‎tutorial/themes/scikit-learn/layout.html
Lines changed: 2 additions & 2 deletions b/‎tutorial/themes/scikit-learn/layout.html
Lines changed: 2 additions & 2 deletions
diff --git a/‎tutorial/working_with_text_data.rst
Lines changed: 90 additions & 49 deletions b/‎tutorial/working_with_text_data.rst
Lines changed: 90 additions & 49 deletions
@@ -24,14 +24,14 @@
 
 # General information about the project.
 project = u'scikit-learn tutorial'
-copyright = u'2010, scikits.learn developers (BSD License)'
+copyright = u'2010–2011, scikits.learn developers (BSD License)'
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the
 # built documents.
 #
-# The short X.Y version.
-version = '0.1'
+# The short X.Y version. Should track scikit-learn version.
+version = '0.9'
 # The full version, including alpha/beta/rc tags.
 release = version
 
 
@@ -85,8 +85,8 @@ <h3>{{ _('Contents') }}</h3>
 
 {%- block footer %}
     <div class="footer">
-        <p style="text-align: center">This documentation is relative
-        to {{project}} version {{ release|e }}<p>
+        <p style="text-align: center">This tutorial is for scikit-learn
+        version {{ release|e }}<p>
 
     {%- if show_copyright %}
       {%- if hasdoc('copyright') %}
 
@@ -36,8 +36,8 @@ description, quoted from the `website
 To download the dataset, go to ``$TUTORIAL_HOME/twenty_newsgroups``
 run the ``fetch_data.py`` script.
 
-Once the data is downloaded, fire an ipython shell in the
-``$TUTORIAL_HOME`` folder and define a variable to hold the list
+Once the data is downloaded, start a Python interpreter (or IPython shell)
+in the ``$TUTORIAL_HOME`` folder and define a variable to hold the list
 of categories to load. In order to get fast execution times for
 this first example we will work on a partial dataset with only 4
 categories out of the 20 available in the dataset::
@@ -67,14 +67,14 @@ The files themselves are not loaded in memory yet::
   >>> twenty_train.filenames[0]
   'data/twenty_newsgroups/20news-bydate-train/comp.graphics/38244'
 
-Let us print the first 2 lines of the first file::
+Let's print the first 2 lines of the first file::
 
   >>> print "".join(open(twenty_train.filenames[0]).readlines()[:2]).strip()
   From: clipper@mccarthy.csd.uwo.ca (Khun Yee Fung)
   Subject: Re: looking for circle algorithm faster than Bresenhams
 
-Supervised learning algorithms will require the category to predict
-for each document. In this case the category is the name of the
+Supervised learning algorithms will require a category label for each
+document in the training set. In this case the category is the name of the
 newsgroup which also happens to be the name of the folder holding the
 individual documents.
 
@@ -111,8 +111,8 @@ before re-training on the complete dataset later.
 Extracting features from text files
 -----------------------------------
 
-In order to perform machine learning on text documents, one first
-needs to turn the text content into numerical feature vectors.
+In order to perform machine learning on text documents, we first need to
+turn the text content into numerical feature vectors.
 
 
 Bags of words
@@ -175,12 +175,14 @@ instead of single words::
   [u'ai', u'bien', u'mange', u'ai bien', u'bien mange']
 
 These tools are wrapped into a higher level component that is able to build a
-dictionary of features::
+dictionary of features and transform documents to feature vectors::
 
   >>> from scikits.learn.feature_extraction.text import CountVectorizer
   >>> count_vect = CountVectorizer()
   >>> docs_train = [open(f).read() for f in twenty_train.filenames]
-  >>> _ = count_vect.fit(docs_train)
+  >>> X_train_counts = count_vect.fit_transform(docs_train)
+  >>> X_train_counts.shape
+  (2257, 33881)
 
 Once fitted, the vectorizer has built a dictionary of feature indices::
 
@@ -190,18 +192,14 @@ Once fitted, the vectorizer has built a dictionary of feature indices::
 The index value of a word in the vocabulary is linked to its frequency
 in the whole training corpus.
 
-Once the vocabulary is built, it is possible to rescan the training
-set so as to perform the actual feature extraction::
-
-  >>> X_train_counts = count_vect.transform(docs_train)
-  >>> X_train_counts.shape
-  (2257, 33881)
-
 .. note:
 
-  to avoid reading and tokenizing each text file twice it is possible
-  to call ``count_vect.fit_transform(documents)`` and get the
-  same output as ``count_vect.fit(documents).transform(documents)``.
+  The method ``count_vect.fit_transform`` performs two actions:
+  it learns the vocabulary and transforms the documents into count vectors.
+  It's possible to separate these steps by calling
+  ``count_vect.fit(docs_train)`` followed by
+  ``X_train_counts = count_vect.transform(docs_train)``,
+  but doing so would read and tokenize each text file twice.
 
 
 From occurrences to frequencies
@@ -213,45 +211,50 @@ even though they might talk about the same topics.
 
 To avoid these potential discrepancies it suffices to divide the
 number of occurrences of each word in a document by the total number
-of words in the document: these new features are called "TF" for Term
+of words in the document: these new features are called "tf" for Term
 Frequencies.
 
-Another refinement on top of TF is to downscale weights for words
+Another refinement on top of tf is to downscale weights for words
 that occur in many documents in the corpus and are therefore less
 informative than those that occur only in a smaller portion of the
 corpus.
 
-This downscaling is called `TF-IDF`_ for "Term Frequency times
+This downscaling is called `tf–idf`_ for "Term Frequency times
 Inverse Document Frequency".
 
-.. _`TF-IDF`: http://en.wikipedia.org/wiki/Tf-idf
+.. _`tf–idf`: http://en.wikipedia.org/wiki/Tf–idf
 
 
-Both TF and TF-IDF can be computed as follows::
+Both tf and tf–idf can be computed as follows::
 
   >>> from scikits.learn.feature_extraction.text import TfidfTransformer
   >>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
   >>> X_train_tf = tf_transformer.transform(X_train_counts)
   >>> X_train_tf.shape
   (2257, 33881)
 
-  >>> tfidf_transformer = TfidfTransformer().fit(X_train_counts)
-  >>> X_train_tfidf = tfidf_transformer.transform(X_train_counts)
+  >>> tfidf_transformer = TfidfTransformer()
+  >>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
   >>> X_train_tfidf.shape
   (2257, 33881)
 
 
 Training a linear classifier
 ----------------------------
 
-Now that we have our feature, we can train a linear classifier to
-try to predict the category of a post::
+Now that we have our feature, we can train a classifier to try to predict
+the category of a post. Let's start with a naïve Bayes classifier, which
+provides a nice baseline for this task. ``scikit-learn`` includes several
+variants of this classifier; the one most suitable for word counts is the
+multinomial variant::
 
-  >>> from scikits.learn.svm.sparse import LinearSVC
-  >>> clf = LinearSVC(C=1000).fit(X_train_tfidf, twenty_train.target)
+  >>> from scikits.learn.naive_bayes import MultinomialNB
+  >>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
 
 To try to predict the outcome on a new document we need to extract
-the features using the same feature extracting chain::
+the features using almost the same feature extracting chain as before.
+The difference is that we call ``transform`` instead of ``fit_transform``
+on the transformers, since they have already been fit to the training set::
 
 
   >>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
@@ -271,16 +274,18 @@ Building a pipeline
 -------------------
 
 In order to make the vectorizer => transformer => classifier easier
-to work with, scikit-learn provides a ``Pipeline`` class that behaves
-like a compound estimator::
+to work with, ``scikit-learn`` provides a ``Pipeline`` class that behaves
+like a compound classifier::
 
   >>> from scikits.learn.pipeline import Pipeline
   >>> text_clf = Pipeline([
   ...     ('vect', CountVectorizer()),
   ...     ('tfidf', TfidfTransformer()),
-  ...     ('clf', LinearSVC(C=1000)),
+  ...     ('clf', MultinomialNB()),
   ... ])
 
+The names ``vect``, ``tfidf`` and ``clf`` (classifier) are arbitrary.
+We shall see their use in the section on grid search, below.
 We can now train the model with a single command::
 
   >>> _ = text_clf.fit(docs_train, twenty_train.target)
@@ -289,7 +294,7 @@ We can now train the model with a single command::
 Evaluation of the performance on the test set
 ---------------------------------------------
 
-Evaluating the predictive accurracy of the model is equally easy::
+Evaluating the predictive accuracy of the model is equally easy::
 
   >>> import numpy as np
   >>> twenty_test = load_files('data/twenty_newsgroups/20news-bydate-test',
@@ -298,50 +303,86 @@ Evaluating the predictive accurracy of the model is equally easy::
   >>> docs_test = [open(f).read() for f in twenty_test.filenames]
   >>> predicted = text_clf.predict(docs_test)
   >>> np.mean(predicted == twenty_test.target)
-  0.93075898801597867
+  0.86884154460719043
+
+I.e., we achieved 86.9% accuracy. Let's see if we can do better with a
+linear support vector machine (SVM), which is widely regarded as one of
+the best text classification algorithms (although it's also a bit slower
+than naïve Bayes). We can change the learner by just plugging a different
+classifier object into our pipeline::
+
+  >>> from scikits.learn.svm.sparse import LinearSVC
+  >>> text_clf = Pipeline([
+  ...     ('vect', CountVectorizer()),
+  ...     ('tfidf', TfidfTransformer()),
+  ...     ('clf', LinearSVC()),
+  ... ])
+  0.92410119840213045
 
 ``scikit-learn`` further provides utilities for more detailed performance
 analysis of the results::
 
   >>> from scikits.learn import metrics
   >>> print metrics.classification_report(
   ...     twenty_test.target, predicted,
-  ...     class_names=twenty_test.target_names)
+  ...     target_names=twenty_test.target_names)
   ...
+
                           precision    recall  f1-score   support
   <BLANKLINE>
-             alt.atheism       0.93      0.85      0.89       319
-           comp.graphics       0.97      0.95      0.96       389
-                 sci.med       0.94      0.95      0.95       396
-  soc.religion.christian       0.88      0.95      0.92       398
+             alt.atheism       0.95      0.80      0.87       319
+           comp.graphics       0.96      0.97      0.96       389
+                 sci.med       0.95      0.95      0.95       396
+  soc.religion.christian       0.86      0.96      0.90       398
   <BLANKLINE>
-             avg / total       0.93      0.93      0.93      1502
+             avg / total       0.93      0.92      0.92      1502
   <BLANKLINE>
 
   >>> metrics.confusion_matrix(twenty_test.target, predicted)
-  array([[271,   3,   9,  36],
-         [  4, 371,   9,   5],
-         [  4,   6, 377,   9],
-         [ 11,   4,   4, 379]])
+  array([[254,   4,  11,  50],
+         [  3, 376,   6,   4],
+         [  1,   9, 377,   9],
+         [  9,   4,   4, 381]])
+
 
+.. note:
+
+  SVC stands for support vector classifier. ``scikit-learn`` also
+  includes support vector machine for regression tasks, which are
+  called SVR.
 
 Parameter tuning using grid search
 ----------------------------------
 
+We've already encountered some parameters such as ``use_idf`` in the
+``TfidfTransformer``. Classifiers tend to have many parameters as well;
+e.g., ``MultinomialNB`` includes a smoothing parameter ``alpha``
+and ``LinearSVC`` has a penalty parameter ``C``
+(see the module documentation, or use the Python ``help`` function,
+to get a description of these).
+
 Instead of tweaking the parameters of the various components of the
 chain, it is possible to run an exhaustive search of the best
-parameters on a grid of possible values::
+parameters on a grid of possible values. We try out all classifiers
+on either words or bigrams, with or without idf, and with a penalty
+parameter of either 100 or 1000 for the linear SVM::
 
   >>> from scikits.learn.grid_search import GridSearchCV
   >>> parameters = {
-  ...     'vect__analyzer__max_n': (1, 2), # words or bigrams
+  ...     'vect__analyzer__max_n': (1, 2),
   ...     'tfidf__use_idf': (True, False),
   ...     'clf__C': (100, 1000),
   ... }
+
+Obviously, such an exhaustive search can be expensive. If we have multiple
+CPU cores at our disposal, we can tell the grid searcher to try these eight
+parameter combinations in parallel with the ``n_jobs`` parameter. If we give
+this parameter a value of ``-1``, grid search will detect how many cores are installed and uses them all::
+
   >>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
 
 The grid search instance behaves like a normal ``scikit-learn``
-model. Let us perform the search on a smaller subset of the dataset
+model. Let's perform the search on a smaller subset of the training data
 to speed up the computation::
 
   >>> gs_clf = gs_clf.fit(docs_train[:400], twenty_train.target[:400])