8000 MNT consolidate target type tags by adrinjalali · Pull Request #28927 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MNT consolidate target type tags #28927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

adrinjalali
Copy link
Member

This consolidates the following tags:

binary_only (default=False)
    whether estimator supports binary classification but lacks multi-class
    classification support.

multilabel (default=False)
    whether the estimator supports multilabel output

multioutput (default=False)
    whether a regressor supports multi-target outputs or a classifier supports
    multi-class multi-output.

multioutput_only (default=False)
    whether estimator supports only multi-output classification or regression.

into

target_type (default=["multiclass", "single-output"])
    the supported target types of the estimator, which can be a subset of:
        - 'binary': binary classification. All estimators supporting
          'multi-class' also support 'binary'; classifiers only;
        - 'multi-class': multiclass classification
        - 'multi-label': multilabel output
        - 'multi-output': multi-output regression or classification, i.e. `y`
          has shape `(n_samples, n_outputs)`
        - 'single-output': single-output regression or classification, i.e. `y`
          has shape `(n_samples,)`

Towards #28910

cc @glemaitre @thomasjpfan

Copy link
github-actions bot commented May 1, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 99790df. Link to the linter CI: here

Comment on lines 547 to 550
- 'multi-output': multi-output regression or classification, i.e. `y`
has shape `(n_samples, n_outputs)`
- 'single-output': single-output regression or classification, i.e. `y`
has shape `(n_samples,)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should have an instance to explain that you can multi-output and not support a single output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note:

Note that an estimator can support multi-output but not single-output, or vice
versa. The same goes for binary and multi-class support.

Comment on lines 543 to 545
- 'binary': binary classification. All estimators supporting
'multi-class' also support 'binary'; classifiers only;
- 'multi-class': multiclass classification
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Together with the other comment, we might benefit of an example to show what we mean.

Comment on lines 547 to 550
- 'multi-output': multi-output regression or classification, i.e. `y`
has shape `(n_samples, n_outputs)`
- 'single-output': single-output regression or classification, i.e. `y`
has shape `(n_samples,)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should have an instance to explain that you can multi-output and not support a single output.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the logic behind it, but it's unfortunate we can use the names in type_of_target.


multioutput_only (default=False)
whether estimator supports only multi-output classification or regression.
target_type (default=["multiclass", "single-output"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the estimator is a regressor, then having a "multiclass" tag feels strange to me.

@adrinjalali
Copy link
Member Author

An alternative here is to have a set of binary flags, instead of a set of flags, as in, introduce 5 flags which are all binary. It makes updating the flags in inheritance slightly easier. Right now once we implement the inheritance for __sklearn_flags__, the user should do something like

    flags["target_type"] = set(flags["target_type"]) | {"multi-output"}

I'm not sure which one would be better.

@adrinjalali
Copy link
Member Author

Another alternative is to have two flags: classification-type and output-shape/type maybe?

@thomasjpfan
Copy link
Member

An alternative here is to have a set of binary flags, instead of a set of flags, as in, introduce 5 flags which are all binary.

I like the binary flags approach.

@adrinjalali
Copy link
Member Author

If we go back to binary flags, then the only change I see necessary is to:

multioutput_only (default=False)
    whether estimator supports only multi-output classification or regression.

to

single_output (default=True)
    whether estimator supports single output.

Note that an estimator can support `single_output` and/or `multi_output`.

@thomasjpfan
Copy link
Member

I'm okay with single_output. What do you think of having two binary tags for single-output classification? i.e. binary and multi-class.

  • binary=True, multi-class=False == binary_only
  • binary=False, multi-class=True == impossible
  • binary=False, multi-class=False == does not support single output classification
  • binary=True, multi-class=True == most classifiers

Although adding another tag is more complexity, I do not like how binary_only implies that multi-class is not support.

8000
@adrinjalali
Copy link
Member Author

binary=False, multi-class=False == does not support single output classification

I think single and multi output aspect is independent from binary/multi class aspect. You could be a multi output only classifier that only supports binary outputs for instance.

@glemaitre glemaitre self-requested a review May 18, 2024 10:34
@adrinjalali
Copy link
Member Author

So something which isn't making me comfortable with having a set of binary flags is that X_types flag is a list / set. I kinda feel like we should be consistent between the two. Do we want to make X_types also a set of binary flags? Or are we happy with that inconsistency?

A part of me wishes we had these as flags:

@dataclass
class InputFlags:
    two_d_array: bool
    sparse: bool
    categorical: bool
    string: bool
    positive_only: bool
    nan_allowed: bool
    pairwise: bool

@dataclass
class TargetFlags:
    required: bool
    positive_only: bool
    multi_output: bool
    single_output: bool

@dataclass
class TransformerFlags:
    preserve_dtype: list[str]

@dataclass
class ClassifierFlags:
    poor_score: bool
    binary: bool
    multiclass: bool
    multilabel: bool
    

@dataclass
class RegressorFlags:
    poor_score: bool

@dataclass
class Flags:
    array_api_support: bool
    no_validation: bool
    stateless: bool
    non_deterministic: bool
    poor_score: bool
    requires_fit: bool
    _skip_test: bool
    _xfail_checks: dict[str, str]
    input_flags: InputFlags
    target_flags: TargetFlags
    transformer_flags: TransformerFlags
    classifier_flags: ClassifierFlags
    regressor_flags: RegressorFlags

binary=False, multi-class=False == does not support single output classification

I'm actually not sure why we have this in type_of_target. A classifier can be binary only but support multi-output

WDYT @glemaitre @thomasjpfan

@thomasjpfan
Copy link
Member

I'm okay with having flags for everything. It makes sense for TransformerFlags .preserve_dtype to be the only list. Currently, preverse_dtype is a list of NumPy dtypes, I'm okay 8000 with using a list of strings.

For this PR, are you considering using dataclasses to represent the tags or are you keeping them as dicts?

@adrinjalali
Copy link
Member Author

If we're okay with data classes, I can open a separate PR implementing them. Users can then subclass these if they want to add more tags.

@adrinjalali
Copy link
Member Author

Solved with dataclasses.

@adrinjalali adrinjalali closed this Sep 5, 2024
@adrinjalali adrinjalali deleted the target_type branch September 5, 2024 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0