@@ -36,8 +36,8 @@ description, quoted from the `website
36
36
To download the dataset, go to ``$TUTORIAL_HOME/twenty_newsgroups ``
37
37
run the ``fetch_data.py `` script.
38
38
39
- Once the data is downloaded, fire an ipython shell in the
40
- ``$TUTORIAL_HOME `` folder and define a variable to hold the list
39
+ Once the data is downloaded, start a Python interpreter (or IPython shell)
40
+ in the ``$TUTORIAL_HOME `` folder and define a variable to hold the list
41
41
of categories to load. In order to get fast execution times for
42
42
this first example we will work on a partial dataset with only 4
43
43
categories out of the 20 available in the dataset::
@@ -67,14 +67,14 @@ The files themselves are not loaded in memory yet::
67
67
>>> twenty_train.filenames[0]
68
68
'data/twenty_newsgroups/20news-bydate-train/comp.graphics/38244'
69
69
70
- Let us print the first 2 lines of the first file::
70
+ Let's print the first 2 lines of the first file::
71
71
72
72
>>> print "".join(open(twenty_train.filenames[0]).readlines()[:2]).strip()
73
73
From: clipper@mccarthy.csd.uwo.ca (Khun Yee Fung)
74
74
Subject: Re: looking for circle algorithm faster than Bresenhams
75
75
76
- Supervised learning algorithms will require the category to predict
77
- for each document . In this case the category is the name of the
76
+ Supervised learning algorithms will require a category label for each
77
+ document in the training set . In this case the category is the name of the
78
78
newsgroup which also happens to be the name of the folder holding the
79
79
individual documents.
80
80
@@ -111,8 +111,8 @@ before re-training on the complete dataset later.
111
111
Extracting features from text files
112
112
-----------------------------------
113
113
114
- In order to perform machine learning on text documents, one first
115
- needs to turn the text content into numerical feature vectors.
114
+ In order to perform machine learning on text documents, we first need to
115
+ turn the text content into numerical feature vectors.
116
116
117
117
118
118
Bags of words
@@ -175,12 +175,14 @@ instead of single words::
175
175
[u'ai', u'bien', u'mange', u'ai bien', u'bien mange']
176
176
177
177
These tools are wrapped into a higher level component that is able to build a
178
- dictionary of features::
178
+ dictionary of features and transform documents to feature vectors ::
179
179
180
180
>>> from scikits.learn.feature_extraction.text import CountVectorizer
181
181
>>> count_vect = CountVectorizer()
182
182
>>> docs_train = [open(f).read() for f in twenty_train.filenames]
183
- >>> _ = count_vect.fit(docs_train)
183
+ >>> X_train_counts = count_vect.fit_transform(docs_train)
184
+ >>> X_train_counts.shape
185
+ (2257, 33881)
184
186
185
187
Once fitted, the vectorizer has built a dictionary of feature indices::
186
188
@@ -190,18 +192,14 @@ Once fitted, the vectorizer has built a dictionary of feature indices::
190
192
The index value of a word in the vocabulary is linked to its frequency
191
193
in the whole training corpus.
192
194
193
- Once the vocabulary is built, it is possible to rescan the training
194
- set so as to perform the actual feature extraction::
195
-
196
- >>> X_train_counts = count_vect.transform(docs_train)
197
- >>> X_train_counts.shape
198
- (2257, 33881)
199
-
200
195
.. note:
201
196
202
- to avoid reading and tokenizing each text file twice it is possible
203
- to call ``count_vect.fit_transform(documents)`` and get the
204
- same output as ``count_vect.fit(documents).transform(documents)``.
197
+ The method ``count_vect.fit_transform`` performs two actions:
198
+ it learns the vocabulary and transforms the documents into count vectors.
199
+ It's possible to separate these steps by calling
200
+ ``count_vect.fit(docs_train)`` followed by
201
+ ``X_train_counts = count_vect.transform(docs_train)``,
202
+ but doing so would read and tokenize each text file twice.
205
203
206
204
207
205
From occurrences to frequencies
@@ -213,45 +211,50 @@ even though they might talk about the same topics.
213
211
214
212
To avoid these potential discrepancies it suffices to divide the
215
213
number of occurrences of each word in a document by the total number
216
- of words in the document: these new features are called "TF " for Term
214
+ of words in the document: these new features are called "tf " for Term
217
215
Frequencies.
218
216
219
- Another refinement on top of TF is to downscale weights for words
217
+ Another refinement on top of tf is to downscale weights for words
220
218
that occur in many documents in the corpus and are therefore less
221
219
informative than those that occur only in a smaller portion of the
222
220
corpus.
223
221
224
- This downscaling is called `TF-IDF `_ for "Term Frequency times
222
+ This downscaling is called `tf–idf `_ for "Term Frequency times
225
223
Inverse Document Frequency".
226
224
227
- .. _`TF-IDF ` : http://en.wikipedia.org/wiki/Tf- idf
225
+ .. _`tf–idf ` : http://en.wikipedia.org/wiki/Tf– idf
228
226
229
227
230
- Both TF and TF-IDF can be computed as follows::
228
+ Both tf and tf–idf can be computed as follows::
231
229
232
230
>>> from scikits.learn.feature_extraction.text import TfidfTransformer
233
231
>>> tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
234
232
>>> X_train_tf = tf_transformer.transform(X_train_counts)
235
233
>>> X_train_tf.shape
236
234
(2257, 33881)
237
235
238
- >>> tfidf_transformer = TfidfTransformer().fit(X_train_counts)
239
- >>> X_train_tfidf = tfidf_transformer.transform (X_train_counts)
236
+ >>> tfidf_transformer = TfidfTransformer()
237
+ >>> X_train_tfidf = tfidf_transformer.fit_transform (X_train_counts)
240
238
>>> X_train_tfidf.shape
241
239
(2257, 33881)
242
240
243
241
244
242
Training a linear classifier
245
243
----------------------------
246
244
247
- Now that we have our feature, we can train a linear classifier to
248
- try to predict the category of a post::
245
+ Now that we have our feature, we can train a classifier to try to predict
246
+ the category of a post. Let's start with a naïve Bayes classifier, which
247
+ provides a nice baseline for this task. ``scikit-learn `` includes several
248
+ variants of this classifier; the one most suitable for word counts is the
249
+ multinomial variant::
10000
249
250
250
- >>> from scikits.learn.svm.sparse import LinearSVC
251
- >>> clf = LinearSVC(C=1000 ).fit(X_train_tfidf, twenty_train.target)
251
+ >>> from scikits.learn.naive_bayes import MultinomialNB
252
+ >>> clf = MultinomialNB( ).fit(X_train_tfidf, twenty_train.target)
252
253
253
254
To try to predict the outcome on a new document we need to extract
254
- the features using the same feature extracting chain::
255
+ the features using almost the same feature extracting chain as before.
256
+ The difference is that we call ``transform `` instead of ``fit_transform ``
257
+ on the transformers, since they have already been fit to the training set::
255
258
256
259
257
260
>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
@@ -271,16 +274,18 @@ Building a pipeline
271
274
-------------------
272
275
273
276
In order to make the vectorizer => transformer => classifier easier
274
- to work with, scikit-learn provides a ``Pipeline `` class that behaves
275
- like a compound estimator ::
277
+ to work with, `` scikit-learn `` provides a ``Pipeline `` class that behaves
278
+ like a compound classifier ::
276
279
277
280
>>> from scikits.learn.pipeline import Pipeline
278
281
>>> text_clf = Pipeline([
279
282
... ('vect', CountVectorizer()),
280
283
... ('tfidf', TfidfTransformer()),
281
- ... ('clf', LinearSVC(C=1000 )),
284
+ ... ('clf', MultinomialNB( )),
282
285
... ])
283
286
287
+ The names ``vect ``, ``tfidf `` and ``clf `` (classifier) are arbitrary.
288
+ We shall see their use in the section on grid search, below.
284
289
We can now train the model with a single command::
285
290
286
291
>>> _ = text_clf.fit(docs_train, twenty_train.target)
@@ -289,7 +294,7 @@ We can now train the model with a single command::
289
294
Evaluation of the performance on the test set
290
295
---------------------------------------------
291
296
292
- Evaluating the predictive accurracy of the model is equally easy::
297
+ Evaluating the predictive accuracy of the model is equally easy::
293
298
294
299
>>> import numpy as np
295
300
>>> twenty_test = load_files('data/twenty_newsgroups/20news-bydate-test',
@@ -298,50 +303,86 @@ Evaluating the predictive accurracy of the model is equally easy::
298
303
>>> docs_test = [open(f).read() for f in twenty_test.filenames]
299
304
>>> predicted = text_clf.predict(docs_test)
300
305
>>> np.mean(predicted == twenty_test.target)
301
- 0.93075898801597867
306
+ 0.86884154460719043
307
+
308
+ I.e., we achieved 86.9% accuracy. Let's see if we can do better with a
309
+ linear support vector machine (SVM), which is widely regarded as one of
310
+ the best text classification algorithms (although it's also a bit slower
311
+ than naïve Bayes). We can change the learner by just plugging a different
312
+ classifier object into our pipeline::
313
+
314
+ >>> from scikits.learn.svm.sparse import LinearSVC
315
+ >>> text_clf = Pipeline([
316
+ ... ('vect', CountVectorizer()),
317
+ ... ('tfidf', TfidfTransformer()),
318
+ ... ('clf', LinearSVC()),
319
+ ... ])
320
+ 0.92410119840213045
302
321
303
322
``scikit-learn `` further provides utilities for more detailed performance
304
323
analysis of the results::
305
324
306
325
>>> from scikits.learn import metrics
307
326
>>> print metrics.classification_report(
308
327
... twenty_test.target, predicted,
309
- ... class_names =twenty_test.target_names)
328
+ ... target_names =twenty_test.target_names)
310
329
...
330
+
311
331
precision recall f1-score support
312
332
<BLANKLINE>
313
- alt.atheism 0.93 0.85 0.89 319
314
- comp.graphics 0.97 0.95 0.96 389
315
- sci.med 0.94 0.95 0.95 396
316
- soc.religion.christian 0.88 0.95 0.92 398
333
+ alt.atheism 0.95 0.80 0.87 319
334
+ comp.graphics 0.96 0.97 0.96 389
335
+ sci.med 0.95 0.95 0.95 396
336
+ soc.religion.christian 0.86 0.96 0.90 398
317
337
<BLANKLINE>
318
- avg / total 0.93 0.93 0.93 1502
338
+ avg / total 0.93 0.92 0.92 1502
319
339
<BLANKLINE>
320
340
321
341
>>> metrics.confusion_matrix(twenty_test.target, predicted)
322
- array([[271, 3, 9, 36],
323
- [ 4, 371, 9, 5],
324
- [ 4, 6, 377, 9],
325
- [ 11, 4, 4, 379]])
342
+ array([[254, 4, 11, 50],
343
+ [ 3, 376, 6, 4],
344
+ [ 1, 9, 377, 9],
345
+ [ 9, 4, 4, 381]])
346
+
326
347
348
+ .. note:
349
+
350
+ SVC stands for support vector classifier. ``scikit-learn`` also
351
+ includes support vector machine for regression tasks, which are
352
+ called SVR.
327
353
328
354
Parameter tuning using grid search
329
355
----------------------------------
330
356
357
+ We've already encountered some parameters such as ``use_idf `` in the
358
+ ``TfidfTransformer ``. Classifiers tend to have many parameters as well;
359
+ e.g., ``MultinomialNB `` includes a smoothing parameter ``alpha ``
360
+ and ``LinearSVC `` has a penalty parameter ``C ``
361
+ (see the module documentation, or use the Python ``help `` function,
362
+ to get a description of these).
363
+
331
364
Instead of tweaking the parameters of the various components of the
332
365
chain, it is possible to run an exhaustive search of the best
333
- parameters on a grid of possible values::
366
+ parameters on a grid of possible values. We try out all classifiers
367
+ on either words or bigrams, with or without idf, and with a penalty
368
+ parameter of either 100 or 1000 for the linear SVM::
334
369
335
370
>>> from scikits.learn.grid_search import GridSearchCV
336
371
>>> parameters = {
337
- ... 'vect__analyzer__max_n': (1, 2), # words or bigrams
372
+ ... 'vect__analyzer__max_n': (1, 2),
338
373
... 'tfidf__use_idf': (True, False),
339
374
... 'clf__C': (100, 1000),
340
375
... }
376
+
377
+ Obviously, such an exhaustive search can be expensive. If we have multiple
378
+ CPU cores at our disposal, we can tell the grid searcher to try these eight
379
+ parameter combinations in parallel with the ``n_jobs `` parameter. If we give
380
+ this parameter a value of ``-1 ``, grid search will detect how many cores are installed and uses them all::
381
+
341
382
>>> gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
342
383
343
384
The grid search instance behaves like a normal ``scikit-learn ``
344
- model. Let us perform the search on a smaller subset of the dataset
385
+ model. Let's perform the search on a smaller subset of the training data
345
386
to speed up the computation::
346
387
347
388
>>> gs_clf = gs_clf.fit(docs_train[:400], twenty_train.target[:400])
0 commit comments