|
4 | 4 | =========================================================================
|
5 | 5 |
|
6 | 6 | The following example illustrates the effect of scaling the
|
7 |
| -regularization parameter when using :ref:`svm` for |
8 |
| -:ref:`classification <svm_classification>`. |
| 7 | +regularization parameter when using :ref:`svm` for |
| 8 | +:ref:`classification <svm_classification>`. |
9 | 9 | For SVC classification, we are interested in a risk minimization for the
|
10 | 10 | equation:
|
11 | 11 |
|
|
21 | 21 | and our model parameters.
|
22 | 22 | - :math:`\Omega` is a `penalty` function of our model parameters
|
23 | 23 |
|
24 |
| -If we consider the :math:`\mathcal{L}` function to be the individual error per |
25 |
| -sample, then the data-fit term, or the sum of the error for each sample, will |
26 |
| -increase as we add more samples. The penalization term, however, will not |
| 24 | +If we consider the loss function to be the individual error per |
| 25 | +sample, then the data-fit term, or the sum of the error for each sample, will |
| 26 | +increase as we add more samples. The penalization term, however, will not |
27 | 27 | increase.
|
28 | 28 |
|
29 | 29 | When using, for example, :ref:`cross validation <cross_validation>`, to
|
30 |
| -set amount of regularization with :math:`C`, there will be a different |
31 |
| -amount of samples between every problem that we are using for model |
32 |
| -selection, as well as for the final problem that we want to use for |
| 30 | +set amount of regularization with `C`, there will be a different |
| 31 | +amount of samples between every problem that we are using for model |
| 32 | +selection, as well as for the final problem that we want to use for |
33 | 33 | training.
|
34 | 34 |
|
35 | 35 | Since our loss function is dependant on the amount of samples, the latter
|
36 |
| -will influence the selected value of :math:`C`. |
| 36 | +will influence the selected value of `C`. |
37 | 37 | The question that arises is `How do we optimally adjust C to
|
38 | 38 | account for the different training samples?`
|
39 | 39 |
|
40 | 40 | The figures below are used to illustrate the effect of scaling our
|
41 |
| -:math:`C` to compensate for the change in the amount of samples, in the |
42 |
| -case of using an :math:`L1` penalty, as well as the :math:`L2` penalty. |
| 41 | +`C` to compensate for the change in the amount of samples, in the |
| 42 | +case of using an `L1` penalty, as well as the `L2` penalty. |
43 | 43 |
|
44 | 44 | L1-penalty case
|
45 | 45 | -----------------
|
46 |
| -In the :math:`L1` case, theory says that prediction consistency |
| 46 | +In the `L1` case, theory says that prediction consistency |
47 | 47 | (i.e. that under given hypothesis, the estimator
|
48 |
| -learned predicts as well as an model knowing the true distribution) |
49 |
| -is not possible because of the biasof the :math:`L1`. It does say, however, |
| 48 | +learned predicts as well as an model knowing the true distribution) |
| 49 | +is not possible because of the biasof the `L1`. It does say, however, |
50 | 50 | that model consistancy, in terms of finding the right set of non-zero
|
51 |
| -parameters as well as their signs, can be achieved by scaling |
52 |
| -:math:`C1`. |
| 51 | +parameters as well as their signs, can be achieved by scaling |
| 52 | +`C1`. |
53 | 53 |
|
54 | 54 | L2-penalty case
|
55 | 55 | -----------------
|
|
59 | 59 | Simulations
|
60 | 60 | ------------
|
61 | 61 |
|
62 |
| -The two figures below plot the values of :math:`C` on the `x-axis` and the |
| 62 | +The two figures below plot the values of `C` on the `x-axis` and the |
63 | 63 | corresponding cross-validation scores on the `y-axis`, for several different
|
64 | 64 | fractions of a generated data-set.
|
65 | 65 |
|
66 |
| -In the :math:`L1` penalty case, the results are best when scaling our :math:`C` with |
| 66 | +In the `L1` penalty case, the results are best when scaling our `C` with |
67 | 67 | the amount of samples, `n`, which can be seen in the third plot of the first figure.
|
68 | 68 |
|
69 |
| -For the :math:`L2` penalty case, the best result comes from the case where :math:`C` |
| 69 | +For the `L2` penalty case, the best result comes from the case where `C` |
70 | 70 | is not scaled.
|
71 | 71 |
|
| 72 | +.. topic:: Note: |
72 | 73 |
|
| 74 | + Two seperate datasets are used for the two different plots. The reason |
| 75 | + behind this is the `L1` case works better on sparse data, while `L2` |
| 76 | + is better suited to the non-sparse case. |
73 | 77 | """
|
74 | 78 | print __doc__
|
75 | 79 |
|
|
94 | 98 | # set up dataset
|
95 | 99 | n_samples = 100
|
96 | 100 | n_features = 300
|
97 |
| - |
| 101 | + |
98 | 102 | #L1 data (only 5 informative features)
|
99 | 103 | X_1, y_1 = datasets.make_classification(n_samples=n_samples, n_features=n_features,
|
100 | 104 | n_informative=5, random_state=1)
|
101 |
| - |
| 105 | + |
102 | 106 | #L2 data: non sparse, but less features
|
103 | 107 | y_2 = np.sign(.5 - rnd.rand(n_samples))
|
104 | 108 | X_2 = rnd.randn(n_samples, n_features/5) + y_2[:, np.newaxis]
|
105 | 109 | X_2 += 5 * rnd.randn(n_samples, n_features/5)
|
106 |
| - |
107 |
| -clf_sets = [(LinearSVC(penalty='L1', loss='L2', dual=False, |
| 110 | + |
| 111 | +clf_sets = [(LinearSVC(penalty='L1', loss='L2', dual=False, |
108 | 112 | tol=1e-3),
|
109 | 113 | np.logspace(-2.2, -1.2, 10), X_1, y_1),
|
110 |
| - (LinearSVC(penalty='L2', loss='L2', dual=True, |
| 114 | + (LinearSVC(penalty='L2', loss='L2', dual=True, |
111 | 115 | tol=1e-4),
|
112 | 116 | np.logspace(-4.5, -2, 10), X_2, y_2)]
|
113 |
| - |
| 117 | + |
114 | 118 | colors = ['b', 'g', 'r', 'c']
|
115 | 119 |
|
116 | 120 | for fignum, (clf, cs, X, y) in enumerate(clf_sets):
|
117 | 121 | # set up the plot for each regressor
|
118 | 122 | pl.figure(fignum, figsize=(9, 10))
|
119 |
| - pl.clf |
120 |
| - pl.xlabel('C') |
121 |
| - pl.ylabel('CV Score') |
122 |
| - |
| 123 | + |
123 | 124 | for k, train_size in enumerate(np.linspace(0.3, 0.7, 3)[::-1]):
|
124 | 125 | param_grid = dict(C=cs)
|
125 | 126 | # To get nice curve, we need a large number of iterations to
|
|
129 | 130 | n_iterations=250, random_state=1))
|
130 | 131 | grid.fit(X, y)
|
131 | 132 | scores = [x[1] for x in grid.grid_scores_]
|
132 |
| - |
133 |
| - scales = [(1, 'No scaling'), |
134 |
| - ((n_samples * train_size), '1/n_samples'), |
| 133 | + |
| 134 | + scales = [(1, 'No scaling'), |
| 135 | + ((n_samples * train_size), '1/n_samples'), |
135 | 136 | ]
|
136 | 137 |
|
137 | 138 | for subplotnum, (scaler, name) in enumerate(scales):
|
138 | 139 | pl.subplot(2, 1, subplotnum + 1)
|
139 |
| - grid_cs = cs * float(scaler) # scale the C's |
| 140 | + pl.xlabel('C') |
| 141 | + pl.ylabel('CV Score') |
| 142 | + grid_cs = cs * float(scaler) # scale the C's |
140 | 143 | pl.semilogx(grid_cs, scores, label="fraction %.2f" %
|
141 | 144 | train_size)
|
142 | 145 | pl.title('scaling=%s, penalty=%s, loss=%s' % (name, clf.penalty, clf.loss))
|
143 | 146 |
|
144 |
| - #ymin, ymax = pl.ylim() |
145 |
| - #pl.axvline(grid_cs[np.argmax(scores)], 0, 1, |
146 |
| - # color=colors[k]) |
147 |
| - #pl.ylim(ymin=ymin-0.0025, ymax=ymax+0.008) # adjust the y-axis |
148 |
| - |
149 | 147 | pl.legend(loc="best")
|
150 | 148 | pl.show()
|
151 |
| - |
| 149 | + |
0 commit comments