8000 Merge pull request #6 from larsmans/master · pprett/scikit-learn@5e57728 · GitHub
[go: up one dir, main page]

Skip to content

Commit 5e57728

Browse files
committed
Merge pull request #6 from larsmans/master
Revised text classification chapter
2 parents 7f6eb6e + dca2c71 commit 5e57728

File tree

3 files changed

+95
-54
lines changed

3 files changed

+95
-54
lines changed

tutorial/conf.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,14 @@
2424

2525
# General information about the project.
2626
project = u'scikit-learn tutorial'
27-
copyright = u'2010, scikits.learn developers (BSD License)'
27+
copyright = u'2010–2011, scikits.learn developers (BSD License)'
2828

2929
# The version info for the project you're documenting, acts as replacement for
3030
# |version| and |release|, also used in various other places throughout the
3131
# built documents.
3232
#
33-
# The short X.Y version.
34-
version = '0.1'
33+
# The short X.Y version. Should track scikit-learn version.
34+
version = '0.9'
3535
# The full version, including alpha/beta/rc tags.
3636
release = version
3737

tutorial/themes/scikit-learn/layout.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ <h3>{{ _('Contents') }}</h3>
8585

8686
{%- block footer %}
8787
<div class="footer">
88-
<p style="text-align: center">This documentation is relative
89-
to {{project}} version {{ release|e }}<p>
88+
<p style="text-align: center">This tutorial is for scikit-learn
89+
version {{ release|e }}<p>
9090

9191
{%- if show_copyright %}
9292
{%- if hasdoc('copyright') %}

tutorial/working_with_text_data.rst

Lines changed: 90 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,8 @@ description, quoted from the `website
3636
To download the dataset, go to ``$TUTORIAL_HOME/twenty_newsgroups``
3737
run the ``fetch_data.py`` script.
3838

39-
Once the data is downloaded, fire an ipython shell in the
40-
``$TUTORIAL_HOME`` folder and define a variable to hold the list
39+
Once the data is downloaded, start a Python interpreter (or IPython shell)
40+
in the ``$TUTORIAL_HOME`` folder and define a variable to hold the list
4141
of categories to load. In order to get fast execution times for
4242
this first example we will work on a partial dataset with only 4
4343
categories out of the 20 available in the dataset::
@@ -67,14 +67,14 @@ The files themselves are not loaded in memory yet::
6767
>>> twenty_train.filenames[0]
6868
'data/twenty_newsgroups/20news-bydate-train/comp.graphics/38244'
6969

70-
Let us print the first 2 lines of the first file::
70+
Let's print the first 2 lines of the first file::
7171

7272
>>> print "".join(open(twenty_train.filenames[0]).readlines()[:2]).strip()
7373
From: clipper@mccarthy.csd.uwo.ca (Khun Yee Fung)
7474
Subject: Re: looking for circle algorithm faster than Bresenhams
7575

76-
Supervised learning algorithms will require the category to predict
77-
for each document. In this case the category is the name of the
76+
Supervised learning algorithms will require a category label for each
77+
document in the training set. In this case the category is the name of the
7878
newsgroup which also happens to be the name of the folder holding the
7979
individual documents.
8080

@@ -111,8 +111,8 @@ before re-training on the complete dataset later.
111111
Extracting features from text files
112112
-----------------------------------
113113

114-
In order to perform machine learning on text documents, one first
115-
needs to turn the text content into numerical feature vectors.
114+
In order to perform machine learning on text documents, we first need to
115+
turn the text content into numerical feature vectors.
116116

117117

118118
Bags of words
@@ -175,12 +175,14 @@ instead of single words::
175175
[u'ai', u'bien', u'mange', u'ai bien', u'bien mange']
176176

177177
These tools are wrapped into a higher level component that is able to build a
178-
dictionary of features::
178+
dictionary of features and transform documents to feature vectors::
179179

180180
>>> from scikits.learn.feature_extraction.text import CountVectorizer
181181
>>> count_vect = CountVectorizer()
182182
>>> docs_train = [open(f).read() for f in twenty_train.filenames]
183-
>>> _ = count_vect.fit(docs_train)
183+
>>> X_train_counts = count_vect.fit_transform(docs_train)
184+
>>> X_train_counts.shape
185+
(2257, 33881)
184186

185187
Once fitted, the vectorizer has built a dictionary of feature indices::
186188

@@ -190,18 +192,14 @@ Once fitted, the vectorizer has built a dictionary of feature indices::
190192
The index value of a word in the vocabulary is linked to its frequency
191193
in the whole training corpus.
192194

193-
Once the vocabulary is built, it is possible to rescan the training
194-
set so as to perform the actual feature extraction::
195-
196-
>>> X_train_counts = count_vect.transform(docs_train)
197-
>>> X_train_counts.shape
198-
(2257, 33881)
199-
200195
.. note:
201196
202-
to avoid reading and tokenizing each text file twice it is possible
203-
to call ``count_vect.fit_transform(documents)`` and get the
204-
same output as ``count_vect.fit(documents).transform(documents)``.
197+
The method ``count_vect.fit_transform`` performs two actions:
198+
it learns the vocabulary and transforms the documents into count vectors.
199+
It's possible to separate these steps by calling
200+
``count_vect.fit(docs_train)`` followed by
201+
``X_train_counts = count_vect.transform(docs_train)``,
202+
but doing so would read and tokenize each text file twice.
205203
206204
207205
From occurrences to frequencies
@@ -213,45 +211,50 @@ even though they might talk about the same topics.
213211

214212
To avoid these potential discrepancies it suffices to divide the
215213
number of occurrences of each word in a document by the total number
216-
of words in the document: these new features are called "TF" for Term
214+
of words in the document: these new features are called "tf" for Term
217215
Frequencies.
218216

219-
Another refinement on top of TF is to downscale weights for words
217+
Another refinement on top of tf is to downscale weights for words
220218
that occur in many documents in the corpus and are therefore less
221219
informative than those that occur only in a smaller portion of the
222220
corpus.
223221

224-
This downscaling is called `TF-IDF`_ for "Term Frequency times
222+
This downscaling is called `tf–idf`_ for "Term Frequency times
225223
Inverse Document Frequency".
226224

227-
.. _`TF-IDF`: http://en.wikipedia.org/wiki/Tf-idf
225+
.. _`tf–idf`: http://en.wikipedia.org/wiki/Tfidf
228226

229227

230-
Both TF and TF-IDF can be computed as follows::
228+
Both tf and tf–idf can be computed as follows::
231229

232230
>>> from scikits.learn.feature_extraction.text import TfidfTransformer
233231
>>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
234232
>>> X_train_tf = tf_transformer.transform(X_train_counts)
235233
>>> X_train_tf.shape
236234
(2257, 33881)
237235

238-
>>> tfidf_transformer = TfidfTransformer().fit(X_train_counts)
239-
>>> X_train_tfidf = tfidf_transformer.transform(X_train_counts)
236+
>>> tfidf_transformer = TfidfTransformer()
237+
>>> X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
240238
>>> X_train_tfidf.shape
241239
(2257, 33881)
242240

243241

244242
Training a linear classifier
245243
----------------------------
246244

247-
Now that we have our feature, we can train a linear classifier to
248-
try to predict the category of a post::
245+
Now that we have our feature, we can train a classifier to try to predict
246+
the category of a post. Let's start with a naïve Bayes classifier, which
247+
provides a nice baseline for this task. ``scikit-learn`` includes several
248+
variants of this classifier; the one most suitable for word counts is the
249+
multinomial variant::
10000 249250

250-
>>> from scikits.learn.svm.sparse import LinearSVC
251-
>>> clf = LinearSVC(C=1000).fit(X_train_tfidf, twenty_train.target)
251+
>>> from scikits.learn.naive_bayes import MultinomialNB
252+
>>> clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
252253

253254
To try to predict the outcome on a new document we need to extract
254-
the features using the same feature extracting chain::
255+
the features using almost the same feature extracting chain as before.
256+
The difference is that we call ``transform`` instead of ``fit_transform``
257+
on the transformers, since they have already been fit to the training set::
255258

256259

257260
>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
@@ -271,16 +274,18 @@ Building a pipeline
271274
-------------------
272275

273276
In order to make the vectorizer => transformer => classifier easier
274-
to work with, scikit-learn provides a ``Pipeline`` class that behaves
275-
like a compound estimator::
277+
to work with, ``scikit-learn`` provides a ``Pipeline`` class that behaves
278+
like a compound classifier::
276279

277280
>>> from scikits.learn.pipeline import Pipeline
278281
>>> text_clf = Pipeline([
279282
... ('vect', CountVectorizer()),
280283
... ('tfidf', TfidfTransformer()),
281-
... ('clf', LinearSVC(C=1000)),
284+
... ('clf', MultinomialNB()),
282285
... ])
283286

287+
The names ``vect``, ``tfidf`` and ``clf`` (classifier) are arbitrary.
288+
We shall see their use in the section on grid search, below.
284289
We can now train the model with a single command::
285290

286291
>>> _ = text_clf.fit(docs_train, twenty_train.target)
@@ -289,7 +294,7 @@ We can now train the model with a single command::
289294
Evaluation of the performance on the test set
290295
---------------------------------------------
291296

292-
Evaluating the predictive accurracy of the model is equally easy::
297+
Evaluating the predictive accuracy of the model is equally easy::
293298

294299
>>> import numpy as np
295300
>>> twenty_test = load_files('data/twenty_newsgroups/20news-bydate-test',
@@ -298,50 +303,86 @@ Evaluating the predictive accurracy of the model is equally easy::
298303
>>> docs_test = [open(f).read() for f in twenty_test.filenames]
299304
>>> predicted = text_clf.predict(docs_test)
300305
>>> np.mean(predicted == twenty_test.target)
301-
0.93075898801597867
306+
0.86884154460719043
307+
308+
I.e., we achieved 86.9% accuracy. Let's see if we can do better with a
309+
linear support vector machine (SVM), which is widely regarded as one of
310+
the best text classification algorithms (although it's also a bit slower
311+
than naïve Bayes). We can change the learner by just plugging a different
312+
classifier object into our pipeline::
313+
314+
>>> from scikits.learn.svm.sparse import LinearSVC
315+
>>> text_clf = Pipeline([
316+
... ('vect', CountVectorizer()),
317+
... ('tfidf', TfidfTransformer()),
318+
... ('clf', LinearSVC()),
319+
... ])
320+
0.92410119840213045
302321

303322
``scikit-learn`` further provides utilities for more detailed performance
304323
analysis of the results::
305324

306325
>>> from scikits.learn import metrics
307326
>>> print metrics.classification_report(
308327
... twenty_test.target, predicted,
309-
... class_names=twenty_test.target_names)
328+
... target_names=twenty_test.target_names)
310329
...
330+
311331
precision recall f1-score support
312332
<BLANKLINE>
313-
alt.atheism 0.93 0.85 0.89 319
314-
comp.graphics 0.97 0.95 0.96 389
315-
sci.med 0.94 0.95 0.95 396
316-
soc.religion.christian 0.88 0.95 0.92 398
333+
alt.atheism 0.95 0.80 0.87 319
334+
comp.graphics 0.96 0.97 0.96 389
335+
sci.med 0.95 0.95 0.95 396
336+
soc.religion.christian 0.86 0.96 0.90 398
317337
<BLANKLINE>
318-
avg / total 0.93 0.93 0.93 1502
338+
avg / total 0.93 0.92 0.92 1502
319339
<BLANKLINE>
320340

321341
>>> metrics.confusion_matrix(twenty_test.target, predicted)
322-
array([[271, 3, 9, 36],
323-
[ 4, 371, 9, 5],
324-
[ 4, 6, 377, 9],
325-
[ 11, 4, 4, 379]])
342+
array([[254, 4, 11, 50],
343+
[ 3, 376, 6, 4],
344+
[ 1, 9, 377, 9],
345+
[ 9, 4, 4, 381]])
346+
326347

348+
.. note:
349+
350+
SVC stands for support vector classifier. ``scikit-learn`` also
351+
includes support vector machine for regression tasks, which are
352+
called SVR.
327353
328354
Parameter tuning using grid search
329355
----------------------------------
330356

357+
We've already encountered some parameters such as ``use_idf`` in the
358+
``TfidfTransformer``. Classifiers tend to have many parameters as well;
359+
e.g., ``MultinomialNB`` includes a smoothing parameter ``alpha``
360+
and ``LinearSVC`` has a penalty parameter ``C``
361+
(see the module documentation, or use the Python ``help`` function,
362+
to get a description of these).
363+
331364
Instead of tweaking the parameters of the various components of the
332365
chain, it is possible to run an exhaustive search of the best
333-
parameters on a grid of possible values::
366+
parameters on a grid of possible values. We try out all classifiers
367+
on either words or bigrams, with or without idf, and with a penalty
368+
parameter of either 100 or 1000 for the linear SVM::
334369

335370
>>> from scikits.learn.grid_search import GridSearchCV
336371
>>> parameters = {
337-
... 'vect__analyzer__max_n': (1, 2), # words or bigrams
372+
... 'vect__analyzer__max_n': (1, 2),
338373
... 'tfidf__use_idf': (True, False),
339374
... 'clf__C': (100, 1000),
340375
... }
376+
377+
Obviously, such an exhaustive search can be expensive. If we have multiple
378+
CPU cores at our disposal, we can tell the grid searcher to try these eight
379+
parameter combinations in parallel with the ``n_jobs`` parameter. If we give
380+
this parameter a value of ``-1``, grid search will detect how many cores are installed and uses them all::
381+
341382
>>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
342383

343384
The grid search instance behaves like a normal ``scikit-learn``
344-
model. Let us perform the search on a smaller subset of the dataset
385+
model. Let's perform the search on a smaller subset of the training data
345386
to speed up the computation::
346387

347388
>>> gs_clf = gs_clf.fit(docs_train[:400], twenty_train.target[:400])

0 commit comments

Comments
 (0)
0