MRG add log loss (cross-entropy loss) to metrics #2013

larsmans · 2013-05-28T13:01:45Z

Reissue of #1125 with a clearer implementation, more documentation and more tests.

arjoly · 2013-05-28T13:14:35Z

Can you add your function to the common tests?

arjoly · 2013-05-28T13:16:06Z

Can you add some narrative doc?

arjoly · 2013-05-28T13:17:30Z

sklearn/metrics/metrics.py

+    predictions. For a single sample with true label yₜ ∈ {0,1} and
+    estimated probability yₚ that yₜ = 1, the log loss is
+
+        -log P(yₜ|yₚ) = -(yₜ log(yₚ) + (1 - yₜ) log(1 - yₚ))


Is the character ₜ intentional?

The character doesn't render on my screen. I guess I'm missing some unicode font.

I'll change it to yt and yp.

larsmans · 2013-05-28T13:21:16Z

The common tests don't work for this loss since the input format is different; it's labels (or indicators, but usually 1-d) and predict_proba output (2-d).

larsmans · 2013-05-28T14:09:56Z

Added narrative docs. I hope they build correctly; I tested with rst2pdf because make doc is broken on my box.

ogrisel · 2013-05-28T14:59:01Z

sklearn/metrics/tests/test_metrics.py

+    y_pred = np.array([[0.5, 0.5], [0.1, 0.9], [0.01, 0.99],
+                       [0.9, 0.1], [0.75, 0.25], [0.001, 0.999]])
+    loss = log_loss(y_true, y_pred)
+    assert_almost_equal(loss, 1.8817970689982668)


AFAIK assert_almost_equal checks for 7 decimal places by default, hence it would probably make more sense to truncate the excess digits.

This is a leftover from @ephes' original test, but you're right.

arjoly · 2013-05-28T16:48:54Z

Can you add in the list of classification in doc/module/module_evalutation.rst the log_loss.
Can you also add a link in the reference?

arjoly · 2013-05-28T17:03:56Z

The common tests don't work for this loss since the input format is different; it's labels (or indicators, but usually 1-d) and predict_proba output (2-d).

Too bad :-(

There is not test for the normalize option. If you want, you can add invariance tests for those kind of metrics.

arjoly · 2013-05-28T17:05:39Z

I think that this might lead to a bug

y_true = [1, 0, 2]
y_pred = [[0, 0, 1], [0.6, 0.2, 0.2], [0.6, 0.1, 0.3]]
loss = log_loss(y_true, y_pred)

GaelVaroquaux · 2013-05-28T20:02:16Z

I think that this might lead to a bug

y_true = [1, 0, 2]
y_pred = [[0, 0, 1], [0.6, 0.2, 0.2], [0.6, 0.1, 0.3]]
loss = log_loss(y_true, y_pred)

I am indeed quite weary of the attempt to make input arguments too
flexible in scikit-learn. It can easily open the door to errors or bugs
going unnoticed.

larsmans · 2013-05-28T20:27:37Z

I'm not really sure what you mean. There is a test for normalization: the lines starting with

y_true *= 2
y_pred *= 2

extend the test before that to a larger (non-square!) matrix of probabilities, then check that the total loss is indeed 6 * the mean loss established earlier. See also the comment.

Or should log_loss only accept indicator matrices? Those are not objects we usually pass around, so I stuck in the binarizer in between.

GaelVaroquaux · 2013-05-28T20:54:00Z

Sorry, @larsmans : this remark was completely out of place and not useful. I was reviewing other PRs and dealing with mail in parallel, and simply got confused.

larsmans · 2013-05-28T21:03:02Z

Ok! Then @arjoly what exactly do you mean? And regarding a link, Bishop's book is not available online (I had to borrow it from the library today -- haven't done that in quite some time).

arjoly · 2013-05-29T09:15:44Z

Ok! Then @arjoly what exactly do you mean? And regarding a link, Bishop's book is not available online (I had to borrow it from the library today -- haven't done that in quite some time).

Sorry, I was wrong. I thought it would raise a issue with np.log, but you clip everything.
Maybe add a test to check that case?

edit : there is a test

arjoly · 2013-05-29T09:17:19Z

sklearn/metrics/metrics.py

+
+    Parameters
+    ----------
+    y_true : array-like or list of labels or label indicator matrix


I think you mean y_true : array-like.

Actually anything that goes into a LabelBinarizer is accepted. The question is whether we should advertise that (probably not without a test for all cases).

But "list of labels" might not be the right term, even if it's used elsewhere. Seems ambiguous to me.

arjoly · 2013-05-29T09:34:12Z

Just to let you know that if one label is missing, an error is raised. "log_loss([0,0,1], [[0, 0, 1], [0.6, 0.2, 0.2], [0.6, 0.1, 0.3]])". Not a bad behavior.

mblondel · 2013-05-30T02:02:24Z

I would like to review this PR over the week-end.

jnothman · 2013-05-30T02:52:31Z

sklearn/metrics/metrics.py

+    -------
+    loss : float
+
+    References


Could we get an example here?

jnothman · 2013-06-02T07:49:46Z

When new metrics are added, should they not also be added to scorer.py?

mblondel · 2013-06-02T08:02:55Z

For the record, I have also implemented multiclass log loss here:
https://github.com/mblondel/lightning/blob/master/lightning/loss_fast.pyx#L158

It takes the output of decision_function as input. It doesn't implement the same formula as in this PR but I would assume that the two should give A3E2 the same results.

larsmans · 2013-06-30T13:02:07Z

@jnothman A scorer would get predict output, right? That doesn't work here since predict_proba output is expected.

jnothman · 2013-06-30T13:07:08Z

A major intention of the Scorer abstraction is that it takes (est, X, y) rather than (y_true, y_pred) and so can call whatever it likes. The only implementation so far will use predict_proba (or decision_function) if its needs_threshold constructor argument is True.

amueller · 2013-07-01T09:05:24Z

This definitely needs to be added to scorer.py. @larsmans allowing both outputs is the main reason for having the scorer interface.

larsmans · 2013-07-01T09:31:37Z

The whole Scorer thing mostly went by me because I was busy, but I see why we have it now :)
I'll make a scorer out of this.

larsmans · 2013-07-01T10:09 F438 :14Z

I can make a Scorer, but only if I modify that class. This function won't work with a decision_function, it really needs predict_proba.

amueller · 2013-07-01T10:35:24Z

The problem is then that the output in GridSearchCV is also the negative score. That is somewhat suboptimal.
It would be great to get rid if this tie between GridSearchCV and the scorers, but I think printing negative likelihoods is pretty counter-intiuitive.

larsmans · 2013-07-01T10:47:54Z

True. Another option would be to return a pair from Scorer.__call__, say (-1, score) or (1, score) where <0 means "minimize this" (greater_is_better=False) and >=0 means "maximize this". (Or the strings "min" and "max", or something like that.) If we do it this way, it becomes much easier to write a scorer as a simple Python function instead of a class or an actual Scorer-typed object.

larsmans · 2013-07-01T10:49:48Z

Shall I send a separate PR with the latter idea?

jnothman · 2013-07-01T10:57:26Z

@larsmans, I think this might be part of the solution to getting Scorers that return multiple scores (#1837, #1850): we need to separate the concept of the (vector/dict of) scores as returned to the user and the (scalar) objective to be maximised.

jnothman · 2013-07-01T10:59:05Z

(That's fine conceptually, of course, but it's harder to get the API right. And I'd be very happy to see a PR that does! I agree that checking the greater_is_better attribute of a callable smells fishy.)

larsmans · 2013-07-26T11:30:29Z

Ok, rebased on master, stuff seems to work. I renamed the scorer "log_likelihood" because that's actually what it's computing. Should I change the log_loss function as well? It makes the connection clearer, but users would get a metric that returns strictly negative values, which might be confusing.

I can haz reviews?

amueller · 2013-07-26T11:51:12Z

I'm unsure about the naming issue :-/

ogrisel · 2013-07-26T12:35:54Z

doc/modules/model_evaluation.rst

@@ -1031,6 +1077,7 @@ Scoring                 Function
 'accuracy'              :func:`sklearn.metrics.accuracy_score`
 'average_precision'     :func:`sklearn.metrics.average_precision_score`
 'f1'                    :func:`sklearn.metrics.f1_score`
+'log_likelihood'        :func:`sklearn.metric.log_loss`


We should put a comment that the scorer returns the opposite of the loss function. Or introduce an alias function named log_likelihood_score(y_true, y_pred) that returns - log_loss(y_true, y_pred).

I am fine with having both functions as in practice sometimes you want once or the other and that makes the code more readable to avoid having to throw a - in the middle of an expression. We would also increase google-ability by having both as the reference API has a very good page-rank on such keywords apparently.

Look like, we should add _score, _loss or _error to be clear to all score
strings.

What do you think?

You mean log_likelihood_score? That's pretty verbose. I'd like to just say scoring="log_likelihood". Users can look this up.

How about log_loss? I'd really prefer if the metric function and the scorer
name were consistent.

+1 with @larsmans proposal: the name of the scoring should be short: "log_likelihood" and the function name should be log_likelihood_score to follow the convention.

larsmans · 2013-07-26T15:11:32Z

Ok, any votes for adding a log_likelihood function as well?

... or rather, the other way around.

ogrisel · 2013-07-26T17:41:10Z

Looks good. +1 for merging if once rebased with fixed conflicts the tests pass.

larsmans · 2013-07-26T17:41:27Z

Ok, merged by rebase!

GaelVaroquaux · 2013-07-26T20:47:16Z

Ok, merged by rebase!

Coming late to the battle (things have been going too fast for me). I
have a question: in the name 'log_likelihood', what is the model
underlying this likelihood? I am not thrilled by this name, as I don't
find it very explicit.

mblondel · 2013-07-27T00:06:59Z

I agree with @GaelVaroquaux. One can compute the log likelihood of any
model.

larsmans · 2013-07-27T07:56:13Z

It's the log-likelihood of a categorical distribution, i.e. a classifier. We could still rename it classifier_log_likelihood_score, but then I'd rather revert the last commit and call everything log-loss.

mblondel · 2013-07-27T08:32:56Z

How do you plan to handle the hinge loss, for instance? (sorry I haven't
had the time to follow the discussion regarding the new API)

arjoly · 2013-07-27T08:37:25Z

@larsmans can you add your name in the author of the metrics module? :-)

Thanks !!!

GaelVaroquaux · 2013-07-27T08:42:17Z

It's the log-likelihood of a categorical distribution, i.e. a classifier.

Isn't it the same thing as multinomial distribution? It's a name that I
know better. I would personnally feel more at ease with
'multinomial_loss'.

ogrisel · 2013-07-27T09:00:55Z

"Log loss" is the name from Bishop so I think it's pretty standard. The problem is the name of the "score" (which is the negative of the log loss). I see two option:

just keep the log_loss function, remove the log_likelihood_score function.
rename the log_likelihood_score to something more explicit like classification_log_likelihood_score.

Then we need to agree on the name of the scoring string to be consistent with other scoring names which are all positive (greater is better):

'accuracy' sklearn.metrics.accuracy_score
'average_precision' :sklearn.metrics.average_precision_score
'f1' sklearn.metrics.f1_score
'precision' sklearn.metrics.precision_score
'recall' sklearn.metrics.recall_score
'roc_auc' sklearn.metrics.auc_score

GaelVaroquaux · 2013-07-27T09:03:10Z

"Log loss" is the name from Bishop so I think it's pretty standard.

I am fine with log_loss

larsmans · 2013-07-28T10:25:24Z

Ok, reverting in a bit.

larsmans · 2013-07-28T10:31:40Z

Hadn't seen @ogrisel's comment. Actually we have the same problem with MSE, which is also a loss function with a scorer that flips the sign.

arjoly reviewed May 28, 2013
View reviewed changes

ogrisel reviewed May 28, 2013
View reviewed changes

arjoly reviewed May 29, 2013
View reviewed changes

jnothman reviewed May 30, 2013
View reviewed changes

sklearn/metrics/metrics.py

-------

loss : float

References

Copy link

Member

jnothman May 30, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get an example here?

This was referenced Jul 1, 2013

Cross-validation returning multiple scores #1850

Closed

WIP new, simpler scorer API #2123

Closed

arjoly mentioned this pull request Jul 25, 2013

added multiclass_log_loss metric #1125

Closed

ENH added multiclass_log_loss metric

89f8831

larsmans added 2 commits July 26, 2013 14:19

ENH rewrite multiclass_log_loss, rename log_loss, document it

f2c5088

ENH Scorer object for log loss

d0cf3a6

ogrisel reviewed Jul 26, 2013
View reviewed changes

ENH add log_likelihood_score as -log_loss

dd602cd

... or rather, the other way around.

larsmans closed this Jul 26, 2013

larsmans deleted the log-loss branch July 27, 2013 07:55

Uh oh!

MRG add log loss (cross-entropy loss) to metrics #2013

MRG add log loss (cross-entropy loss) to metrics #2013

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!