-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Enforce positive sample_weight #15531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The global config is an interesting approach. Seems reasonable to me at
first glance
|
I think, zero should always be allowed as a valid value for sample weights, but sum(sw) > 0 should hold. For our dear hep friends, who want negative sw, often in linear models, they are implemented via scaling X and y by sqrt(sw). This makes supporting negative values difficult. |
I like the config, but do we currently have a need for the first parameter? If we don't, maybe it's not worth implementing it (now). Also, erroring by default on negative sw would technically be a backward incompatible change, though we may disguise it as a bug fix |
It would depend on the estimator. Some estimator should support negative sample weights and some others are currently erroring or should error (as SVM which was failing to validate the weights). We should probably have a common test and a specific tag that could indicate whether or not an estimator is supporting negative weights or not. This said I am not against being able to activate/deactivate the support for negative sample weights. If negative weights are really uncommon, we should probably error most of the time and overwrite this behaviour with the config unless the estimator is not supported negative weights (maybe via the tag) |
If an there are estimators which support negative sample weight, then it should be an estimator tag, and not a global config. @glemaitre which ones logically support a negative sample weight now? |
I think the config still makes sense since negative sample weights don't make sense in the vast majority of applications. You really want to know what you're doing when using negative SW. Whether estimators can support them is another thing, independent of the application. |
Another issue with the global config is that |
It's the same idea as the |
See #3774 (comment) for more context. This issue is really about adding checks that would benefit the vast majority of users, while avoiding breaking existing functionality for a small faction of users in HEP. We definitely don't want support for negative sample weights in more models (unless someone has a very compelling argument for it) and I expect that most models simply won't work with negative sample weights irrespective of this config parameter. |
#20880 will add the parameter However, we should make a larger audit of our estimators with common tests. Some estimators should not accept negative weights and will break explicitly (for instance due to take the sqrt of negative weights) or maybe more silently with non-sense results. |
There is a test of the function test_has_fit_parameter(), where assertions are made about the existance of the sample_weight paramerer of some estimators fit method ( test_has_fit_parameter() ). This list (or something similar elsewhere) could be completed with all the estimators to maintain where we have sample_weight parameter (and where we dont). In addition for all estimators there should be a decision about the negative weight case. If it is not supported an error vould be raised in all cases. This would be also tested. What do you think? I created a list of all estimators (and its inherited classes), helper functions and metric functions which uses anywhere a sample_weight or sample_weights named parameter. Using this one can make decisions about each function or class in question. |
Indeed, we should create a common test that by default test all estimators. We should skip the test when
Could you paste the list directly in the issue and hide it with the |
The list is updated. # List of all estimators (and its inherited classes), helper functions and metric # functions which uses anywhere a sample_weight or sample_weights named parameter. # If there is () after a name it indicates a function, if there is no parentheses # then it's a class. Check in test files are excluded. ``` Estimators (the indented classes are inherited): BaseDecisionTree : sklearn/tree/_classes.py - DecisionTreeClassifier : sklearn/tree/_classes.py - ExtraTreeClassifier : sklearn/tree/_classes.py - DecisionTreeRegressor : sklearn/tree/_classes.py - ExtraTreeRegressor: sklearn/tree/_classes.py LinearSVC : sklearn/svm/_classes.py LinearSVR : sklearn/svm/_classes.py BaseLibSVM : sklearn/svm/_base.py - BaseSVC : sklearn/svm/_base.py - SVC : sklearn/svm/_base.py - NuSVC : sklearn/svm/_base.py - SVR : sklearn/svm/_classes.py - NuSVR : sklearn/svm/_classes.py - OneClassSVM : sklearn/svm/_classes.py SplineTransformer : sklearn/preprocessing/_polynomial.py StandardScaler : sklearn/preprocessing/_data.py KernelDensity : sklearn/neighbors/_kde.py GaussianNB : sklearn/naive_bayes.py _BaseDiscreteNB : sklearn/naive_bayes.py - MultinomialNB : sklearn/naive_bayes.py - ComplementNB : sklearn/naive_bayes.py - BernoulliNB : sklearn/naive_bayes.py - CategoricalNB : sklearn/naive_bayes.py _MultiOutputEstimator : sklearn/multioutput.py - MultiOutputRegressor : sklearn/multioutput.py - MultiOutputClassifier : sklearn/multioutput.py _BaseScorer : sklearn/metrics/_scorer.py - _PredictScorer : sklearn/metrics/_scorer.py - _ProbaScorer : sklearn/metrics/_scorer.py - _ThresholdScorer : sklearn/metrics/_scorer.py BaseSGDClassifier : sklearn/linear_model/_stochastic_gradient.py - SGDClassifier : sklearn/linear_model/_stochastic_gradient.py - PassiveAggressiveClassifier : sklearn/linear_model/_passive_aggressive.py BaseSGDRegressor : sklearn/linear_model/_stochastic_gradient.py - SGDRegressor : sklearn/linear_model/_stochastic_gradient.py - PassiveAggressiveRegressor : sklearn/linear_model/_passive_aggressive.py SGDOneClassSVM : sklearn/linear_model/_stochastic_gradient.py _BaseRidge : sklearn/linear_model/_ridge.py - Ridge : sklearn/linear_model/_ridge.py - RidgeClassifier : sklearn/linear_model/_ridge.py _RidgeGCV : sklearn/linear_model/_ridge.py _BaseRidgeCV : sklearn/linear_model/_ridge.py - RidgeCV : sklearn/linear_model/_ridge.py - RidgeClassifierCV : sklearn/linear_model/_ridge.py RANSACRegressor : sklearn/linear_model/_ransac.py QuantileRegressor : sklearn/linear_model/_ransac.py LogisticRegression : sklearn/linear_model/_logistic.py - LogisticRegressionCV : sklearn/linear_model/_logistic.py HuberRegressor : sklearn/linear_model/_huber.py GeneralizedLinearRegressor : sklearn/linear_model/_glm/glm.py - PoissonRegressor : sklearn/linear_model/_glm/glm.py - GammaRegressor : sklearn/linear_model/_glm/glm.py - TweedieRegressor : sklearn/linear_model/_glm/glm.py ElasticNet : sklearn/linear_model/_coordinate_descent.py - Lasso : sklearn/linear_model/_coordinate_descent.py - MultiTaskElasticNet : sklearn/linear_model/_coordinate_descent.py - MultiTaskLasso : sklearn/linear_model/_coordinate_descent.py LinearModelCV : sklearn/linear_model/_coordinate_descent.py - LassoCV : sklearn/linear_model/_coordinate_descent.py - ElasticNetCV : sklearn/linear_model/_coordinate_descent.py - MultiTaskElasticNetCV : sklearn/linear_model/_coordinate_descent.py - MultiTaskLassoCV : sklearn/linear_model/_coordinate_descent.py BayesianRidge : sklearn/linear_model/_bayes.py LinearRegression : sklearn/linear_model/_base.py KernelRidge : sklearn/kernel_ridge.py IsotonicRegression : sklearn/isotonic.py BaseWeightBoosting : sklearn/ensemble/_weight_boosting.py - AdaBoostClassifier : sklearn/ensemble/_weight_boosting.py - AdaBoostRegressor : sklearn/ensemble/_weight_boosting.py _BaseVoting : sklearn/ensemble/_voting.py - VotingClassifier : sklearn/ensemble/_voting.py - VotingRegressor : sklearn/ensemble/_voting.py _BaseStacking : sklearn/ensemble/_stacking.py - StackingClassifier : sklearn/ensemble/_stacking.py - StackingRegressor : sklearn/ensemble/_stacking.py BaseBagging : sklearn/ensemble/_bagging.py - BaggingClassifier : sklearn/ensemble/_bagging.py - BaggingRegressor : sklearn/ensemble/_bagging.py - IsolationForest : sklearn/ensemble/_iforest.py BaseHistGradientBoosting : sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py - HistGradientBoostingRegressor : sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py - HistGradientBoostingClassifier : sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py BaseGradientBoosting : sklearn/ensemble/_gb.py - GradientBoostingClassifier : sklearn/ensemble/_gb.py - GradientBoostingRegressor : sklearn/ensemble/_gb.py BaseForest : sklearn/ensemble/_forest.py - ForestClassifier : sklearn/ensemble/_forest.py - RandomForestClassifier : sklearn/ensemble/_forest.py - ExtraTreesClassifier : sklearn/ensemble/_forest.py - ForestRegressor : sklearn/ensemble/_forest.py - RandomForestRegressor : sklearn/ensemble/_forest.py - ExtraTreesRegressor : sklearn/ensemble/_forest.py - RandomTreesEmbedding : sklearn/ensemble/_forest.py DummyClassifier : sklearn/dummy.py DummyRegressor : sklearn/dummy.py EllipticEnvelope : sklearn/covariance/_elliptic_envelope.py KMeans : sklearn/cluster/_kmeans.py - MiniBatchKMeans : sklearn/cluster/_kmeans.py DBSCAN : sklearn/cluster/_dbscan.py CalibratedClassifierCV : sklearn/calibration.py _SigmoidCalibration : sklearn/calibration.py ClassifierMixin : sklearn/base.py RegressorMixin : sklearn/base.py ``` Helper functions (in sklearn/utils): _check_sample_weight() : sklearn/utils/validation.py _weighted_percentile() : sklearn/utils/stats.py count_nonzero() : sklearn/utils/sparsefuncs.py class_distribution() : sklearn/utils/multiclass.py _incremental_mean_and_var() : sklearn/utils/extmath.py ``` Estimator helper functions and classes (the indented classes are inherited): make_dataset() : sklearn/linear_model/_base.py _preprocess_data() : sklearn/linear_model/_base.py _rescale_data() : sklearn/linear_model/_base.py _weights_scorer() : sklearn/inspection/_permutation_importance.py _calculate_permutation_scores() : sklearn/inspection/_permutation_importance.py permutation_importance() : sklearn/inspection/_permutation_importance.py fit_binary() : sklearn/linear_model/_stochastic_gradient.py sag_solver() : sklearn/linear_model/_sag.py _solve_cholesky_kernel() : sklearn/linear_model/_ridge.py ridge_regression() : sklearn/linear_model/_ridge.py _ridge_regression() : sklearn/linear_model/_ridge.py _average_binary_score() : sklearn/metrics/_base.py _logistic_loss_and_grad() : sklearn/linear_model/_logistic.py _logistic_loss() : sklearn/linear_model/_logistic.py _multinomial_loss() : sklearn/linear_model/_logistic.py _logistic_grad_hess() : sklearn/linear_model/_logistic.py _multinomial_loss_grad() : sklearn/linear_model/_logistic.py _multinomial_grad_hess() : sklearn/linear_model/_logistic.py _logistic_regression_path() : sklearn/linear_model/_logistic.py _log_reg_scoring_path() : sklearn/linear_model/_logistic.py _huber_loss_and_gradient() : sklearn/linear_model/_huber.py _pre_fit() : sklearn/linear_model/_base.py isotonic_regression() : sklearn/isotonic.py _fit_single_estimator() : sklearn/ensemble/_base.py _parallel_build_estimators() : sklearn/ensemble/_bagging.py BaseLoss : sklearn/ensemble/_hist_gradient_boosting/loss.py - LeastSquares : sklearn/ensemble/_hist_gradient_boosting/loss.py - LeastAbsoluteDeviation : sklearn/ensemble/_hist_gradient_boosting/loss.py - Poisson : sklearn/ensemble/_hist_gradient_boosting/loss.py - BinaryCrossEntropy : sklearn/ensemble/_hist_gradient_boosting/loss.py - CategoricalCrossEntropy : sklearn/ensemble/_hist_gradient_boosting/loss.py LossFunction : sklearn/ensemble/_gb_losses.py - RegressionLossFunction : sklearn/ensemble/_gb_losses.py - LeastSquaresError : sklearn/ensemble/_gb_losses.py - LeastAbsoluteError : sklearn/ensemble/_gb_losses.py - HuberLossFunction : sklearn/ensemble/_gb_losses.py - QuantileLossFunction : sklearn/ensemble/_gb_losses.py - ClassificationLossFunction : sklearn/ensemble/_gb_losses.py - BinomialDeviance : sklearn/ensemble/_gb_losses.py - MultinomialDeviance : sklearn/ensemble/_gb_losses.py - ExponentialLoss : sklearn/ensemble/_gb_losses.py _parallel_build_trees() : sklearn/ensemble/_forest.py k_means() : sklearn/cluster/_kmeans.py _kmeans_single_elkan() : sklearn/cluster/_kmeans.py _kmeans_single_lloyd() : sklearn/cluster/_kmeans.py _labels_inertia() : sklearn/cluster/_kmeans.py _labels_inertia_threadpool_limit() : sklearn/cluster/_kmeans.py _mini_batch_step() : sklearn/cluster/_kmeans.py dbscan() : sklearn/cluster/_dbscan.py _fit_classifier_calibrator_pair() : sklearn/calibration.py _fit_calibrator() : sklearn/calibration.py _sigmoid_calibration() : sklearn/calibration.py _fit_estimator() : sklearn/multioutput.py _partial_fit_estimator() : sklearn/multioutput.py _path_residuals() : sklearn/linear_model/_coordinate_descent.py _fit_liblinear() : sklearn/svm/_base.py ``` Metrics (in sklearn/metrics): mean_absolute_error() : sklearn/metrics/_regression.py mean_pinball_loss() : sklearn/metrics/_regression.py mean_absolute_percentage_error() : sklearn/metrics/_regression.py mean_squared_error() : sklearn/metrics/_regression.py mean_squared_log_error() : sklearn/metrics/_regression.py median_absolute_error() : sklearn/metrics/_regression.py explained_variance_score() : sklearn/metrics/_regression.py r2_score() : sklearn/metrics/_regression.py mean_tweedie_deviance() : sklearn/metrics/_regression.py mean_poisson_deviance() : sklearn/metrics/_regression.py mean_gamma_deviance() : sklearn/metrics/_regression.py d2_tweedie_score() : sklearn/metrics/_regression.py average_precision_score() : sklearn/metrics/_ranking.py _binary_uninterpolated_average_precision() : sklearn/metrics/_ranking.py det_curve() : sklearn/metrics/_ranking.py _binary_roc_auc_score() : sklearn/metrics/_ranking.py roc_auc_score() : sklearn/metrics/_ranking.py _multiclass_roc_auc_score() : sklearn/metrics/_ranking.py _binary_clf_curve() : sklearn/metrics/_ranking.py precision_recall_curve() : sklearn/metrics/_ranking.py roc_curve() : sklearn/metrics/_ranking.py label_ranking_average_precision_score() : sklearn/metrics/_ranking.py coverage_error() : sklearn/metrics/_ranking.py label_ranking_loss() : sklearn/metrics/_ranking.py dcg_score() : sklearn/metrics/_ranking.py ndcg_score() : sklearn/metrics/_ranking.py top_k_accuracy_score() : sklearn/metrics/_ranking.py accuracy_score() : sklearn/metrics/_classification.py confusion_matrix() : sklearn/metrics/_classification.py multilabel_confusion_matrix() : sklearn/metrics/_classification.py cohen_kappa_score() : sklearn/metrics/_classification.py jaccard_score() : sklearn/metrics/_classification.py matthews_corrcoef() : sklearn/metrics/_classification.py zero_one_loss() : sklearn/metrics/_classification.py f1_score() : sklearn/metrics/_classification.py fbeta_score() : sklearn/metrics/_classification.py precision_recall_fscore_support() : sklearn/metrics/_classification.py precision_score() : sklearn/metrics/_classification.py recall_score() : sklearn/metrics/_classification.py balanced_accuracy_score() : sklearn/metrics/_classification.py classification_report() : sklearn/metrics/_classification.py hamming_loss() : sklearn/metrics/_classification.py log_loss() : sklearn/metrics/_classification.py hinge_loss() : sklearn/metrics/_classification.py brier_score_loss() : sklearn/metrics/_classification.py _weighted_sum() : sklearn/metrics/_classification.py ``` Metric plot (in sklearn/metrics/_plot): RocCurveDisplay : sklearn/metrics/_plot/roc_curve.py plot_roc_curve() : sklearn/metrics/_plot/roc_curve.py PrecisionRecallDisplay : sklearn/metrics/_plot/precision_recall_curve.py plot_precision_recall_curve() : sklearn/metrics/_plot/precision_recall_curve.py DetCurveDisplay : sklearn/metrics/_plot/det_curve.py plot_det_curve() : sklearn/metrics/_plot/det_curve.py ConfusionMatrixDisplay : sklearn/metrics/_plot/confusion_matrix.py plot_confusion_matrix() : sklearn/metrics/_plot/confusion_matrix.py ``` Examples (in examples): plot_decision_function() : examples/svm/plot_weighted_samples.py _mean_frequency_by_risk_group() : examples/linear_model/plot_poisson_regression_non_normal_loss.py ``` Found "weights", "class_weights", etc. instead of "sample_weights" but with the use of _check_sample_weight or something related (search of these is not complete): has_fit_parameter() : sklearn/utils/validation.py compute_sample_weight() : sklearn/utils/class_weight.py mean_variance_axis() : sklearn/utils/sparsefuncs.py incr_mean_variance_axis() : sklearn/utils/sparsefuncs.py _ValidationScoreCallback : sklearn/linear_model/_stochastic_gradient.py |
I think we could start with the estimators which has a The code for listing:
from sklearn.utils.validation import has_fit_parameter
import sklearn all_estimators = sklearn.utils.all_estimators() for i in range(len(all_estimators)): estimator_name = all_estimators[i][0] estimator = all_estimators[i][1] if has_fit_parameter(estimator, "sample_weight"): print(estimator_name) The list: AdaBoostClassifier AdaBoostRegressor BaggingClassifier BaggingRegressor BayesianRidge BernoulliNB CalibratedClassifierCV CategoricalNB ComplementNB DBSCAN DecisionTreeClassifier DecisionTreeRegressor DummyClassifier DummyRegressor ElasticNet ElasticNetCV ExtraTreeClassifier ExtraTreeRegressor ExtraTreesClassifier ExtraTreesRegressor GammaRegressor GaussianNB GradientBoostingClassifier GradientBoostingRegressor HistGradientBoostingClassifier HistGradientBoostingRegressor HuberRegressor IsolationForest IsotonicRegression KMeans KernelDensity KernelRidge Lasso LassoCV LinearRegression LinearSVC LinearSVR LogisticRegression LogisticRegressionCV MiniBatchKMeans MultiOutputClassifier MultiOutputRegressor MultinomialNB NuSVC NuSVR OneClassSVM Perceptron PoissonRegressor QuantileRegressor RANSACRegressor RandomForestClassifier RandomForestRegressor RandomTreesEmbedding Ridge RidgeCV RidgeClassifier RidgeClassifierCV SGDClassifier SGDOneClassSVM SGDRegressor SVC SVR SplineTransformer StackingClassifier StackingRegressor StandardScaler TweedieRegressor VotingClassifier VotingRegressor |
As discussed in #12464 (comment) and #15358 (comment) in some specific use-cases negative
sample_weight
may be meaningful, however in most cases they should never happen.So we may want to
force_positive=None
parameter to_check_sample_weights
.assume_positive_sample_weights=True
config parameter tosklearn.set_config
/get_config
.By default,
force_positive=None
would error on negative sample weight, but this check could be disabled globally withsklearn.set_config(assume_positive_sample_weights=False)
.With
_check_sample_weights(.., force_positive=True)
the check would always be done irrespective of the config parameter.If there are no objections to this, please tag this issue as "Help Wanted".
The text was updated successfully, but these errors were encountered: