It seems like half the bugs we've solved in the past couple of months surround problems of having different sets of classes, in the context of:
- cross-validation splitting that yields different subsets of classes in different training or testing subsets (and hence issues in alignment of class-wise outputs from
predict_proba, decision_function or metrics, or in normalising macro-averaged scores)
partial_fit where classes are specified upfront, but then repeated calls need matching to those classes
warm_start where classes_ from the first fit must be identical to the set of classes in y in each call to fit()
These are all subtly different problems, but at the moment it seems like we're handling them on an ad-hoc (and too often a post-hoc) basis.
It would be amazing if someone could review these issues and identify where either API changes (classes as a constructor parameter to classifiers has been suggested) or helper utilities might help avoid these issues in the future.
It seems like half the bugs we've solved in the past couple of months surround problems of having different sets of classes, in the context of:
predict_proba,decision_functionor metrics, or in normalising macro-averaged scores)partial_fitwhereclassesare specified upfront, but then repeated calls need matching to those classeswarm_startwhereclasses_from the first fit must be identical to the set of classes inyin each call tofit()These are all subtly different problems, but at the moment it seems like we're handling them on an ad-hoc (and too often a post-hoc) basis.
It would be amazing if someone could review these issues and identify where either API changes (
classesas a constructor parameter to classifiers has been suggested) or helper utilities might help avoid these issues in the future.