8000 Add keyword parameter to scoring functions to support different types of data · Issue #18023 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Add keyword parameter to scoring functions to support different types of data #18023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks
anhqngo opened this issue Jul 28, 2020 · 5 comments
Open
2 tasks

Comments

@anhqngo
Copy link
anhqngo commented Jul 28, 2020

Describe the workflow you want to enable

Currently, there is no way to use SelectKBest with ordinal data.

Describe your proposed solution

I would want to add a keyword parameter to the current f_regression (or f_classif) that takes in the type of input data. For example, if our X is ordinal and y is continuous, we can run f_regression(X, y, input_type="ordinal"). The function will then calculate the Spearman's coefficient (as opposed to the current implementation of Pearson's coefficient in f_regression) and output the scores and pvalues.

The steps to add support for ordinal data are:

  • Write wrapper for scipy.stats.spearmanr OR write our own function that calculates Spearman's
  • Integrate that wrapper into f_regression and add keyword parameter input_type

Now, I am not sure how to score one-hot encoded data yet, but hopefully by adding the keyword parameter, we can gradually expand the types of input data sklearn's scoring functions can support.

Describe alternatives you've considered, if relevant

Alternatively, we can also write a new function f_regression_ordinal to deal with ordinal X and continuous y.

Additional context

This feature request partially addresses #8480. There has also been discussions of the wrapper method, but no consensus has been reached: #6673, #8038.

This feature request was submitted per suggestions by @thomasjpfan and discussion with @yashika51 and @flosincapite

@jnothman
Copy link
Member
jnothman commented Jul 28, 2020 via email

@anhqngo
Copy link
Author
anhqngo commented Aug 3, 2020

Yeah I wasn't sure which method would work better for homogeneous input. My concern with writing the function method is that there will be a lot of functions (as opposed to a small number of them with different keyword parameters). In any case, @yashika51 and I would love to work on this; let us know which method you prefer and we can make a draft PR for it!

For heterogeneous input, do you have any suggestions on how to tackle this? I am not aware of any functions that can score features regardless of types.

@thomasjpfan
Copy link
Member

We can have a f_regression(X, y, kind="spearmanr") which would add another method to compute correlation in f_regression. (But does not resolve #8480)

@glemaitre
Copy link
Member

Actually kind could be "auto" as well to switch to the right implementation. The only method that I know that could handle heterogeneous data would be the phi_k correlation: https://arxiv.org/abs/1811.11440
However, it would not fulfill the inclusion criterion strictly.

@thomasjpfan
Copy link
Member

Actually kind could be "auto" as well to switch to the right implementation.

Are we expected the input to be pandas dataframes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0