-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Ensemble models (and maybe others?) don't check for negative sample_weight #3774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
I think this should be a quick fix. Have you already fixed this? |
Nope, I just went back to my own code when I figured out the bug wasn't there. Be my guest! |
Err. I cannot reproduce this. In any case does a negative sample_weight have any meaning? |
Not even if you set |
I'm not sure if negative sample weight has any meaning, I produced them by accident. I can't really imagine a well-defined meaning, but I didn't think this through. |
I can reproduce if I set random_state to 42 :) . Should we just raise an error in that case? |
I'm not sure. What do other estimators do? |
I'd say so. |
Should I do it myself or let the issue open for potential new contributors? |
If you have time, please do it. There will be other easy issues. |
Hi, Negative sample weights are not checked on purpose. In some applications On 15 October 2014 20:33, Lars Buitinck notifications@github.com wrote:
|
Interesting, but just out of curiosity, what does a negative weight supposed to mean intuitively? |
|
The meaning of negative weight depends on the context of course. Here is our previous discussion: The fact is, some MC event generators produce events with negative weights because of various corrections, cancellations, etc. I can only comment on this particular case in HEP. Who knows what other valid negative-weight situations are out there in other fields. The conclusion seemed to be that if we can support negative weights (within reason...) then why not. Supporting them really means not preventing their use. Of course if the majority of your samples have negative weights, then you will get garbage... Garbage in, garbage out. In the situations I've faced, only a very small fraction of the dataset had negative weights. |
The DT node splitting should prevent a child node from obtaining an overall negative weight. I'm not sure if this protection was implemented correctly everywhere for all forms of splitting. If the root node cannot be split because all possible splits would result in a child with negative weight, then I suppose an exception should be raised. Otherwise a DT should be able to deal with it. |
Coming back to this discussion: over at #1488, @ndawe said
I think I've already proven that negative weights adversely affect the common case, being so arrogant as to label my use case as common. I also still have no intuition as to what negative weights represent (I know what they do, but not why I'd ever need them). Does your usecase preclude doing some pre-processing to get rid of the negative samples? |
After a very nice talk given by @glouppe to the LHCb collaboration today, someone asked a question about the possibility of using negative weights for the training sample. This is very common in high energy physics (HEP), and the feeling I got from the answer was that negative weights are not supported by So, I would like to know: is the use of training samples including negative weights ‘supported’ by Negative weights how?I'll give a bit of background as to how negative weights crop up in HEP, in case it helps or anyone is interested. The most common usage of classifiers that I know of is to discriminate between a ‘signal’ species and a ‘background’ species. You train a classifier, often a BDT, with a signal sample and a background sample, and then compute the responses of the trained BDT on your data and make some selection requirement on it (e.g. Ideally you would use real data, rather than simulation, for both training samples, because then you don't have to worry about your simulation perfectly modelling the data. However obtaining a ‘pure’ dataset for each species is tricky, because the data you have to hand is some mix of the two. What might be available to you though is a particular variable/feature with distinctly different behaviour for signal and background. Most often this is the mass of some particle, where the signal is a Gaussian distribution centered around the true mass, and the background some constant slope, for example, which might look like this: So, you can try to model this distribution with one probability density function per component, signal and background, and perform a maximum likelihood fit to obtain the relative yields (the amount) of each component in your dataset. Then enters the sPlot technique. This takes as input the result of the likelihood fit, and gives you one weight per species per input vector, so in this case one signal weight and one background weight. These weights have a number of interesting statistical properties, the most important of which being that the sum of the weights for a given species returns the number of that species in the dataset, even if the dataset has been partitioned by some other feature (assuming that the partitioning feature is uncorrelated with the discriminating feature that was fitted). This allows you to statistically unfold a feature's signal and background distributions as histograms simply by summing the appropriate weight within each histogram bin. Due to the way these weights, often called sWeights, are computed, they can often be negative, but the “negativity” within an unfolded distribution cancels on a statistical basis. With all that, you now want to give your BDT a signal training sample, which you now have to hand! You are just thinking of passing negative weights as giving the BDT your signal "distributions", rather than individual negatively weighted vectors. |
Thanks for the feedback @alexpearce! After checking what we did back in the implementation, negative sample weights are in fact supported. However, nothing is done to check whether what is happening makes sense or not. In particular, it may happen that the overall weight of a node becomes negative, in which case how to split such a node or how to make a prediction out of it is undefined. In addition, from a pure machine learning point of view, it should be understood by the HEP community how negative weights are handled, and whether it makes sense or not for their application -- which I am not so sure about. In the case of trees, negative weights affect the impurity criteria (which assume positive weights), the decrease of impurity computation (which also assumes positive weights) and the way predictions are computed. |
And thanks again @glouppe! I agree that negative weights are something that should be better understood. I suspect the problem is that the most commonly used classification packages, TMVA and NeuroBayes, do support negative weights. And, again, because ‘support’ means that people who use negative weights see good separation power in the trained classifier, they do not care about the theoretical details. (This is of course a very bad thing.) I'm not enough of an expert to know internally how either package actually deals with the theoretical problems of negative weights. It would be interesting to try, say, scitkit-learn vs. TMVA, with negative weights and the same classifier, and see how the results differ. |
In fact, it is not so difficult to understand what is happening. Intuitively, in my opinion, having negative weights is like adding samples of the opposite class (e.g., having negative signal events is like having positive background events). Take for example how predictions are computed within a leaf. Let assume that we have N samples in this leaf, that the sample weight sum for the first class is Similarly, the same kind of reasoning could be carried out for the impurity reduction, and I am sure we would end up with the same kind of conclusions. So this is what is happening when you apply blindly negative sample weights to the same algorithm. My question then is: is it what physicists actually expect? |
I know that at one point I did have protection in the node splitting that protected against the overall weight becoming negative. But that code was since changed and I think that protection was lost. |
Actually, the |
@glouppe yes, that's a good way of looking at it. In the case of binary classification, it should be identical to flipping the class label and modifying the weight accordingly. I'd like to see a test of this with each of the classification and regression criteria to check if we do in fact build the same trees. But I suppose this gets more complicated for multiclass problems. Although the majority of problems in HEP are binary signal vs background, multiclass problems can crop up in things like decay mode classification, or in general with analyses that consider multiple signal hypotheses. I wonder if it's then like replacing negative samples from one class with a sample from each of the other classes with the appropriate positive weight. |
We need to update the documentation here: http://scikit-learn.org/stable/modules/tree.html#mathematical-formulation Currently it doesn't mention sample weights, even though they are indeed considered. |
I still don't quite grasp what the negative weights mean @alexpearce. Is there a way to understand them without reading the sPlot paper? |
The reason for getting negative sWeights is due to negative components in the covariance matrix that is used to compute them. With both positive and negative weights, the sum of the weights of a given species gives the the number of that species, whilst the sum of the squares gives the square of the statistical uncertainty on that number. What the ‘physical’ meaning of these negative weights is when given to a classifier I am less sure. Conceptually, as someone with little understanding of the technicalities, I think physicists in HEP ‘expect’ these sWeights to work because using them allows you to see some seperation in signal and background distributions, as histograms, by weighting the input data with signal and background weights respectively. So, we think “if I can see the separation between S and B in the two histograms, then I can feed them to a classifier and it too should see this difference”. How accurate this assumption is… Again I am less sure. What I know is that many classifiers have been trained with sWeights and good discriminatory power has been seen. |
I am very unconfortable with negative sample weights. They seem to be Given this, I don't think that it is realistic to hope that support for I would personnally advise trying to limit it as much as possible, to |
Sure, but why must we explicitly prevent negative weights from being used? At the moment we aren't explicitly adding support for negative weights. We just have support for weights, and the user can use any values they desire. IMHO it's the responsibility of the user to be sure that the weights they are using make sense in their particular context. I don't think scikit-learn needs to hold their hand. Users ought to be aware that if they give an estimator garbage that they should expect garbage in return; In this case, feeding in a large portion of negative weight will result in negative class probabilities (unless the tree growth is properly controlled with stopping criteria like |
I agree with @ndawe. Perhaps I should try to explain in a different way. A set of signal & background weights for a single event does not mean much, physically. The weights are computed from the result of a maximum likelihood fit and this can only discriminate between S and B on a statistical basis, that is across a ‘large enough’ sum of some subset. It cannot say “this is signal” or “this is background”. With that, when you apply a requirement on your input data, as is done in a decision tree, the sum of the weights of the remaining data is still meaningful: it is the number of S & B events passing the requirement. (This assumes the training variables are uncorrelated with the variable used in the maximum likelihood fit, but satisfying this requirement is of course the responsibility of the user.) If you adjust your hyperparameters such that you could end up with one of these sums being negative, or you otherwise use input data that is dominated by background (such that many sets of requirements are likely to lead to sum of signal weights being negative) then the classifier will struggle to make sense of what's going on. As @ndawe says, giving the classifier something ‘reasonable’ to work with is the responsibility of the user, but it would nice if scikit-learn supported at least such ‘reasonable’ weighted datasets. TMVA, the classification package often used in HEP, does not cope well regardless of the input; it too requires the user to be careful when using sWeighted inputs. To clarify my earlier post, I'm not saying “because the HEP community do this and it works for them, it is justified, correct, and should be implemented”. As you say @GaelVaroquaux, scikit-lear 67ED n is based on well-motived machine learning principles, and shouldn't bend to a relatively small community who have found a corner that seems to work. But, if negative weights are not strictly disallowed by the theory, I think scikit-learn should support them. (And maybe that is already the case, I still need to test scikit-learn with sWeights. Perhaps you've already tested with negative weights from MC generators @ndawe?) |
Yes, I've had O(1-10%) of my dataset containing negative weights and had no problem at all with the tree models in scikit-learn. It's important to note that negative weights don't only enter in through the MC generator, but can also be the result of certain subtraction techniques in background estimation. For those not familiar with this, it is rather common to have a background process estimated by real data where some portion of simulated data has been subtracted. Say we want to get a handle on some process we can't model very well in simulation, but we can box it up in a region of the real data that we observe. Say we know that this box also contains some other processes that are not of interest, but can be well-modelled by simulation. We then construct a dataset that is composed of the real data in this box, with positive weight, and the simulated data in this box, with negative weight. Then we have a nice handle on this difficult process and can accurately model the shape of various feature distributions (kinematics, event shapes, etc. in our context). We then also want the node-splitting to be aware of this subtraction (manifest in the negative weights) when learning our classification problem since it affects the shapes of the feature distributions and thus the locations of the optimal splits. |
I think I agree with @GaelVaroquaux that HEP is a niche and you don't want to make things more complicated for the non HEP-ers by introducing weird behaviour. However if we can organise so that sklearn doesn't stand in the way of you using -ve weights, or maybe we come up with a recipe so that you can use negative weights (@glouppe seems to suggest you can just flip the sign??). That would be super. To potentially add to the confusion, @ndawe and @alexpearce do either of you know if the negative weights that arise in generators like MC@NLO "mean" the same thing as the negative weights in sPlots? They come about in very different ways, but I think they "mean" the same, unsure though. I think negative weights only make sense if you think of a group of |
But this is where i A3DB t is important for ML methods. For us, the (normalized) sample weight of a sample should really be seen as (an estimate of) its probability of occurrence. |
On Thu, Apr 30, 2015 at 9:49 AM Gilles Louppe notifications@github.com
|
Are there papers showing that using the negative sample weights is significantly better than discarding those samples or putting them in a third or a fourth class? |
For sWeights, throwing away the negative weights invalidates the properties (sum of weights for a species is the species count, etc.). I would suspect the same is true for "normalising" the weights into the [0, 1] range. |
@arjoly my attempt at describing a very common scenario is above. The short answer is yes. I just don't see the point in discussing this endlessly. We keep repeating the same points. scikit-learn should just not care. As long as I can specify weights as I wish and not have some exception needlessly thrown if a weight is negative, then I am happy (and I know many others in my field would also be happy). This isn't supporting a niche, but just being indifferent. |
@arjoly You don't often read this level of detail in a final published result out of some big experiment (the treatment of negative weights in a classification problem, etc...) but having been directly involved in producing this recent paper: http://link.springer.com/article/10.1007/JHEP04%282015%29117 I know that we tried both including and excluding the events ("samples" in scikit-learn language) with negative weight and we had the best expected sensitivity to seeing the Higgs boson coupling to tau leptons when we included the events in our background model with negative weights (I also used scikit-learn for my contribution to this result). |
Thanks @ndawe ! |
Tree MAE is not considering sample_weights when calculating impurity! In the proposed fix, you will see I have multiplied by the sample weight *after* applying the absolute to the difference (not before). This is in line with the consensus / discussion found here, where negative sample weights are considered: scikit-learn#3774 (and also because during initialisation, self.weighted_n_node_samples is a summation of the sample weights with no "absolute" applied (this is used in the impurity divisor)).
#15531 proposes to address this by checking for positive sample weight by default (maybe starting with warnings instead of errors), but with a global parameter in |
When sample weights are negative, the probabilities can come out negative as well:
The text was updated successfully, but these errors were encountered: