8000 Ensemble models (and maybe others?) don't check for negative sample_weight · Issue #3774 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Ensemble models (and maybe others?) don't check for negative sample_weight #3774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
larsmans opened this issue Oct 15, 2014 · 40 comments
Open

Comments

@larsmans
Copy link
Member

When sample weights are negative, the probabilities can come out negative as well:

>>> rng = np.random.RandomState(10)
>>> X = rng.randn(10, 4)
>>> y = rng.randint(0, 2, 10)
>>> sample_weight = rng.randn(10)
>>> clf = RandomForestClassifier().fit(X, y, sample_weight)
>>> clf.predict_proba(X)
array([[ 0.56133774,  0.43866226],
       [ 1.03235924, -0.03235924],
       [ 1.03235924, -0.03235924],
       [ 1.03235924, -0.03235924],
       [ 1.03235924, -0.03235924],
       [ 1.03235924, -0.03235924],
       [ 0.98071868,  0.01928132],
       [ 0.56133774,  0.43866226],
       [ 1.03235924, -0.03235924],
       [ 1.03235924, -0.03235924]])
@larsmans larsmans added the Bug label Oct 15, 2014
@larsmans
Copy link
Member Author

DecisionTreeClassifier doesn't have this problem. Should be dealt with in the common tests.

@MechCoder
Copy link
Member

I think this should be a quick fix. Have you already fixed this?

@larsmans
Copy link
Member Author

Nope, I just went back to my own code when I figured out the bug wasn't there. Be my guest!

@MechCoder
Copy link
Member

Err. I cannot reproduce this. In any case does a negative sample_weight have any meaning?

@larsmans
Copy link
Member Author

Not even if you set np.random.seed(0)?

@larsmans
Copy link
Member Author

I'm not sure if negative sample weight has any meaning, I produced them by accident. I can't really imagine a well-defined meaning, but I didn't think this through.

@MechCoder
Copy link
Member

I can reproduce if I set random_state to 42 :) . Should we just raise an error in that case?

@larsmans
Copy link
Member Author

I'm not sure. What do other estimators do?

@GaelVaroquaux
Copy link
Member

Should we just raise an error in that case?

I'd say so.

@MechCoder
Copy link
Member

Should I do it myself or let the issue open for potential new contributors?

@larsmans
Copy link
Member Author

If you have time, please do it. There will be other easy issues.

@glouppe
Copy link
Contributor
glouppe commented Oct 15, 2014

Hi,

Negative sample weights are not checked on purpose. In some applications
(high energy physics in particular) it may happen for weights to be
negative. This has been discussed before with @ndawe during one of the
rewritings.

On 15 October 2014 20:33, Lars Buitinck notifications@github.com wrote:

If you have time, please do it. There will be other easy issues.


Reply to this email directly or view it on GitHub
#3774 (comment)
.

@MechCoder
Copy link
Member

Interesting, but just out of curiosity, what does a negative weight supposed to mean intuitively?

@glouppe
Copy link
Contributor
glouppe commented Oct 15, 2014

Interesting, but just out of curiosity, what does a negative weight
supposed to mean intuitively?

To be honest, I never understood either. Can you comment on this Noel?

@ndawe
Copy link
Member
ndawe commented Oct 18, 2014

The meaning of negative weight depends on the context of course. Here is our previous discussion:

#1488

The fact is, some MC event generators produce events with negative weights because of various corrections, cancellations, etc. I can only comment on this particular case in HEP. Who knows what other valid negative-weight situations are out there in other fields.

The conclusion seemed to be that if we can support negative weights (within reason...) then why not. Supporting them really means not preventing their use. Of course if the majority of your samples have negative weights, then you will get garbage... Garbage in, garbage out.

In the situations I've faced, only a very small fraction of the dataset had negative weights.

@ndawe
Copy link
Member
ndawe commented Oct 18, 2014

The DT node splitting should prevent a child node from obtaining an overall negative weight. I'm not sure if this protection was implemented correctly everywhere for all forms of splitting. If the root node cannot be split because all possible splits would result in a child with negative weight, then I suppose an exception should be raised. Otherwise a DT should be able to deal with it.

@larsmans
Copy link
Member Author

Coming back to this discussion: over at #1488, @ndawe said

The point is: weights aren't necessarily nonnegative. If we can accommodate them without adversely affecting the common case, then why not?

I think I've already proven that negative weights adversely affect the common case, being so arrogant as to label my use case as common. I also still have no intuition as to what negative weights represent (I know what they do, but not why I'd ever need them). Does your usecase preclude doing some pre-processing to get rid of the negative samples?

@alexpearce
Copy link

After a very nice talk given by @glouppe to the LHCb collaboration today, someone asked a question about the possibility of using negative weights for the training sample.

This is very common in high energy physics (HEP), and the feeling I got from the answer was that negative weights are not supported by sklearn, however @larsmans says that DecisionTreeClassifier shouldn't have a problem with negative weights. Boosted decision trees are also very common in HEP, so being able to use negative weights with them would be really useful!

So, I would like to know: is the use of training samples including negative weights ‘supported’ by sklearn? In the sense that ‘sensible’ input which includes negative weights can produce ‘sensible’ predictions.

Negative weights how?

I'll give a bit of background as to how negative weights crop up in HEP, in case it helps or anyone is interested.

The most common usage of classifiers that I know of is to discriminate between a ‘signal’ species and a ‘background’ species. You train a classifier, often a BDT, with a signal sample and a background sample, and then compute the responses of the trained BDT on your data and make some selection requirement on it (e.g. BDT response > 0.1).

Ideally you would use real data, rather than simulation, for both training samples, because then you don't have to worry about your simulation perfectly modelling the data. However obtaining a ‘pure’ dataset for each species is tricky, because the data you have to hand is some mix of the two.

What might be available to you though is a particular variable/feature with distinctly different behaviour for signal and background. Most often this is the mass of some particle, where the signal is a Gaussian distribution centered around the true mass, and the background some constant slope, for example, which might look like this:

A discriminating variable. It exhibits visibly different behaviour for the two species, signal and background, which you can try to model with probability density functions.

So, you can try to model this distribution with one probability density function per component, signal and background, and perform a maximum likelihood fit to obtain the relative yields (the amount) of each component in your dataset.

Then enters the sPlot technique. This takes as input the result of the likelihood fit, and gives you one weight per species per input vector, so in this case one signal weight and one background weight.

These weights have a number of interesting statistical properties, the most important of which being that the sum of the weights for a given species returns the number of that species in the dataset, even if the dataset has been partitioned by some other feature (assuming that the partitioning feature is uncorrelated with the discriminating feature that was fitted).

This allows you to statistically unfold a feature's signal and background distributions as histograms simply by summing the appropriate weight within each histogram bin.

Due to the way these weights, often called sWeights, are computed, they can often be negative, but the “negativity” within an unfolded distribution cancels on a statistical basis.

With all that, you now want to give your BDT a signal training sample, which you now have to hand! You are just thinking of passing negative weights as giving the BDT your signal "distributions", rather than individual negatively weighted vectors.

@glouppe
Copy link
Contributor
glouppe commented Apr 24, 2015

Thanks for the feedback @alexpearce! After checking what we did back in the implementation, negative sample weights are in fact supported. However, nothing is done to check whether what is happening makes sense or not. In particular, it may happen that the overall weight of a node becomes negative, in which case how to split such a node or how to make a prediction out of it is undefined.

In addition, from a pure machine learning point of view, it should be understood by the HEP community how negative weights are handled, and whether it makes sense or not for their application -- which I am not so sure about. In the case of trees, negative weights affect the impurity criteria (which assume positive weights), the decrease of impurity computation (which also assumes positive weights) and the way predictions are computed.

@alexpearce
Copy link

And thanks again @glouppe!

I agree that negative weights are something that should be better understood. I suspect the problem is that the most commonly used classification packages, TMVA and NeuroBayes, do support negative weights. And, again, because ‘support’ means that people who use negative weights see good separation power in the trained classifier, they do not care about the theoretical details. (This is of course a very bad thing.)

I'm not enough of an expert to know internally how either package actually deals with the theoretical problems of negative weights. It would be interesting to try, say, scitkit-learn vs. TMVA, with negative weights and the same classifier, and see how the results differ.

@glouppe
Copy link
Contributor
glouppe commented Apr 24, 2015

In fact, it is not so difficult to understand what is happening. Intuitively, in my opinion, having negative weights is like adding samples of the opposite class (e.g., having negative signal events is like having positive background events).

Take for example how predictions are computed within a leaf. Let assume that we have N samples in this leaf, that the sample weight sum for the first class is N1-w (where N1 is the sum of the positive weights, and w is the (absolute value of the) sum of the negative weights) and that the sample weight sum for the second class in N2. The empirical probabilities for the classes are therefore (N1-w)/N for the first class and N2/N for the second class. The ratio between these two determines the prediction decision: if ((N1-w)/N) / (N2/N) = (N1-w)/N2 is greater than one, then the prediction is the first class, and if it lower than one, then the prediction is second class. Accordingly, we have the same ratio if we remove the negatively weighted sample of the first class and make it into the second class with a positive weight w2, if and only if (N1-w)/N2 = N1 / (N2 + w2), that is if w2 = (N2 w) / (N1 - w).

Similarly, the same kind of reasoning could be carried out for the impurity reduction, and I am sure we would end up with the same kind of conclusions.

So this is what is happening when you apply blindly negative sample weights to the same algorithm. My question then is: is it what physicists actually expect?

@ndawe
Copy link
Member
ndawe commented Apr 25, 2015

Thanks for the feedback @alexpearce! After checking what we did back in the implementation, negative sample weights are in fact supported. However, nothing is done to check whether what is happening makes sense or not. In particular, it may happen that the overall weight of a node becomes negative, in which case how to split such a node or how to make a prediction out of it is undefined.

I know that at one point I did have protection in the node splitting that protected against the overall weight becoming negative. But that code was since changed and I think that protection was lost.

@ndawe
Copy link
Member
ndawe commented Apr 25, 2015

Actually, the min_weight_fraction_leaf parameter that defaults to 0 will protect against the weight at a leaf becoming negative in the tree models.

@ndawe
Copy link
Member
ndawe commented Apr 25, 2015

@glouppe yes, that's a good way of looking at it. In the case of binary classification, it should be identical to flipping the class label and modifying the weight accordingly. I'd like to see a test of this with each of the classification and regression criteria to check if we do in fact build the same trees.

But I suppose this gets more complicated for multiclass problems. Although the majority of problems in HEP are binary signal vs background, multiclass problems can crop up in things like decay mode classification, or in general with analyses that consider multiple signal hypotheses. I wonder if it's then like replacing negative samples from one class with a sample from each of the other classes with the appropriate positive weight.

@ndawe
Copy link
Member
ndawe commented Apr 25, 2015

We need to update the documentation here:

http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation

Currently it doesn't mention sample weights, even though they are indeed considered.

@amueller
Copy link
Member

I still don't quite grasp what the negative weights mean @alexpearce. Is there a way to understand them without reading the sPlot paper?

@alexpearce
Copy link

The reason for getting negative sWeights is due to negative components in the covariance matrix that is used to compute them. With both positive and negative weights, the sum of the weights of a given species gives the the number of that species, whilst the sum of the squares gives the square of the statistical uncertainty on that number.

What the ‘physical’ meaning of these negative weights is when given to a classifier I am less sure. Conceptually, as someone with little understanding of the technicalities, I think physicists in HEP ‘expect’ these sWeights to work because using them allows you to see some seperation in signal and background distributions, as histograms, by weighting the input data with signal and background weights respectively. So, we think “if I can see the separation between S and B in the two histograms, then I can feed them to a classifier and it too should see this difference”.

How accurate this assumption is… Again I am less sure. What I know is that many classifiers have been trained with sWeights and good discriminatory power has been seen.

@GaelVaroquaux
Copy link
Member

I am very unconfortable with negative sample weights. They seem to be
standard in a community, high energy physics, but this community is a
niche with regards to the bigger picture in data processing / data
science. Outside of this scope, negative sample weights do not mean
anything. Inside this world, there doesn't seem to be a very simple way
to convey the idea of what negative sample weights mean.

Given this, I don't think that it is realistic to hope that support for
negative weights will be general, or that it will not surprise people and
work in a way that they expect.

I would personnally advise trying to limit it as much as possible, to
control the damage that it can do.

@ndawe
Copy link
Member
ndawe commented Apr 30, 2015

Sure, but why must we explicitly prevent negative weights from being used? At the moment we aren't explicitly adding support for negative weights. We just have support for weights, and the user can use any values they desire. IMHO it's the responsibility of the user to be sure that the weights they are using make sense in their particular context. I don't think scikit-learn needs to hold their hand. Users ought to be aware that if they give an estimator garbage that they should expect garbage in return; In this case, feeding in a large portion of negative weight will result in negative class probabilities (unless the tree growth is properly controlled with stopping criteria like min_weight_fraction_leaf). Anyway, that's just my opinion.

@alexpearce
Copy link

I agree with @ndawe. Perhaps I should try to explain in a different way.

A set of signal & background weights for a single event does not mean much, physically. The weights are computed from the result of a maximum likelihood fit and this can only discriminate between S and B on a statistical basis, that is across a ‘large enough’ sum of some subset. It cannot say “this is signal” or “this is background”.

With that, when you apply a requirement on your input data, as is done in a decision tree, the sum of the weights of the remaining data is still meaningful: it is the number of S & B events passing the requirement. (This assumes the training variables are uncorrelated with the variable used in the maximum likelihood fit, but satisfying this requirement is of course the responsibility of the user.)

If you adjust your hyperparameters such that you could end up with one of these sums being negative, or you otherwise use input data that is dominated by background (such that many sets of requirements are likely to lead to sum of signal weights being negative) then the classifier will struggle to make sense of what's going on. As @ndawe says, giving the classifier something ‘reasonable’ to work with is the responsibility of the user, but it would nice if scikit-learn supported at least such ‘reasonable’ weighted datasets. TMVA, the classification package often used in HEP, does not cope well regardless of the input; it too requires the user to be careful when using sWeighted inputs.

To clarify my earlier post, I'm not saying “because the HEP community do this and it works for them, it is justified, correct, and should be implemented”. As you say @GaelVaroquaux, scikit-lear 67ED n is based on well-motived machine learning principles, and shouldn't bend to a relatively small community who have found a corner that seems to work. But, if negative weights are not strictly disallowed by the theory, I think scikit-learn should support them. (And maybe that is already the case, I still need to test scikit-learn with sWeights. Perhaps you've already tested with negative weights from MC generators @ndawe?)

@ndawe
Copy link
Member
ndawe commented Apr 30, 2015

Yes, I've had O(1-10%) of my dataset containing negative weights and had no problem at all with the tree models in scikit-learn.

It's important to note that negative weights don't only enter in through the MC generator, but can also be the result of certain subtraction techniques in background estimation. For those not familiar with this, it is rather common to have a background process estimated by real data where some portion of simulated data has been subtracted. Say we want to get a handle on some process we can't model very well in simulation, but we can box it up in a region of the real data that we observe. Say we know that this box also contains some other processes that are not of interest, but can be well-modelled by simulation. We then construct a dataset that is composed of the real data in this box, with positive weight, and the simulated data in this box, with negative weight. Then we have a nice handle on this difficult process and can accurately model the shape of various feature distributions (kinematics, event shapes, etc. in our context). We then also want the node-splitting to be aware of this subtraction (manifest in the negative weights) when learning our classification problem since it affects the shapes of the feature distributions and thus the locations of the optimal splits.

@betatim
Copy link
Member
betatim commented Apr 30, 2015

I think I agree with @GaelVaroquaux that HEP is a niche and you don't want to make things more complicated for the non HEP-ers by introducing weird behaviour. However if we can organise so that sklearn doesn't stand in the way of you using -ve weights, or maybe we come up with a recipe so that you can use negative weights (@glouppe seems to suggest you can just flip the sign??). That would be super.

To potentially add to the confusion, @ndawe and @alexpearce do either of you know if the negative weights that arise in generators like MC@NLO "mean" the same thing as the negative weights in sPlots? They come about in very different ways, but I think they "mean" the same, unsure though.

I think negative weights only make sense if you think of a group of n_samples together. Basically when you ask questions like: what does the (binned) distribution of this feature look like? What is the mean value of this feature? If you try to think about a single sample with a negative weight, it gets tricky/meaningless.

@glouppe
Copy link
Contributor
glouppe commented Apr 30, 2015

If you try to think about a single sample with a negative weight, it gets tricky/meaningless.

But this is where i A3DB t is important for ML methods. For us, the (normalized) sample weight of a sample should really be seen as (an estimate of) its probability of occurrence.

@betatim
Copy link
Member
betatim commented Apr 30, 2015

On Thu, Apr 30, 2015 at 9:49 AM Gilles Louppe notifications@github.com
wrote:

If you try to think about a single sample with a negative weight, it gets
tricky/meaningless.

But this is where it is important for ML methods. For us, the (normalized)
sample weight of a sample should really be seen as (an estimate of) its
probability of occurrence.

Exactly. Going slightly OT I wonder if the right way to deal with this is
to sample from your pool of samples (with a method that takes into account
the negative weights somehow) and then use the re-sampled samples to train
a ML method.

@arjoly
Copy link
Member
arjoly commented Apr 30, 2015

Are there papers showing that using the negative sample weights is significantly better than discarding those samples or putting them in a third or a fourth class?

@alexpearce
Copy link

For sWeights, throwing away the negative weights invalidates the properties (sum of weights for a species is the species count, etc.). I would suspect the same is true for "normalising" the weights into the [0, 1] range.

@ndawe
Copy link
Member
ndawe commented Apr 30, 2015

@arjoly my attempt at describing a very common scenario is above. The short answer is yes.

I just don't see the point in discussing this endlessly. We keep repeating the same points. scikit-learn should just not care. As long as I can specify weights as I wish and not have some exception needlessly thrown if a weight is negative, then I am happy (and I know many others in my field would also be happy). This isn't supporting a niche, but just being indifferent.

@ndawe
Copy link
Member
ndawe commented Apr 30, 2015

@arjoly You don't often read this level of detail in a final published result out of some big experiment (the treatment of negative weights in a classification problem, etc...) but having been directly involved in producing this recent paper: http://link.springer.com/article/10.1007/JHEP04%282015%29117 I know that we tried both including and excluding the events ("samples" in scikit-learn language) with negative weight and we had the best expected sensitivity to seeing the Higgs boson coupling to tau leptons when we included the events in our background model with negative weights (I also used scikit-learn for my contribution to this result).

@arjoly
Copy link
Member
arjoly commented Apr 30, 2015

Thanks @ndawe !

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
@amueller amueller modified the milestone: 0.19 Jun 12, 2017
JohnStott added a commit to JohnStott/scikit-learn that referenced this issue Jul 9, 2018
Tree MAE is not considering sample_weights when calculating impurity!

In the proposed fix, you will see I have multiplied by the sample weight *after* applying the absolute to the difference (not before).  This is in line with the consensus / discussion found here, where negative sample weights are considered: scikit-learn#3774 (and also because during initialisation, self.weighted_n_node_samples is a summation of the sample weights with no "absolute" applied (this is used in the impurity divisor)).
@rth
Copy link
Member
rth commented Nov 4, 2019

#15531 proposes to address this by checking for positive sample weight by default (maybe starting with warnings instead of errors), but with a global parameter in sklearn.set_config which can be used to disable these checks for HEP users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0