You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, there is no way to use SelectKBest with ordinal data.
Describe your proposed solution
I would want to add a keyword parameter to the current f_regression (or f_classif) that takes in the type of input data. For example, if our X is ordinal and y is continuous, we can run f_regression(X, y, input_type="ordinal"). The function will then calculate the Spearman's coefficient (as opposed to the current implementation of Pearson's coefficient in f_regression) and output the scores and pvalues.
The steps to add support for ordinal data are:
Write wrapper for scipy.stats.spearmanr OR write our own function that calculates Spearman's
Integrate that wrapper into f_regression and add keyword parameter input_type
Now, I am not sure how to score one-hot encoded data yet, but hopefully by adding the keyword parameter, we can gradually expand the types of input data sklearn's scoring functions can support.
Describe alternatives you've considered, if relevant
Alternatively, we can also write a new function f_regression_ordinal to deal with ordinal X and continuous y.
Additional context
This feature request partially addresses #8480. There has also been discussions of the wrapper method, but no consensus has been reached: #6673, #8038.
If the input is homogeneous, why not just use a different function
altogether. If the input variables have heterogeneous types, then this
solution is insufficient.
Yeah I wasn't sure which method would work better for homogeneous input. My concern with writing the function method is that there will be a lot of functions (as opposed to a small number of them with different keyword parameters). In any case, @yashika51 and I would love to work on this; let us know which method you prefer and we can make a draft PR for it!
For heterogeneous input, do you have any suggestions on how to tackle this? I am not aware of any functions that can score features regardless of types.
Actually kind could be "auto" as well to switch to the right implementation. The only method that I know that could handle heterogeneous data would be the phi_k correlation: https://arxiv.org/abs/1811.11440
However, it would not fulfill the inclusion criterion strictly.
Describe the workflow you want to enable
Currently, there is no way to use
SelectKBest
with ordinal data.Describe your proposed solution
I would want to add a keyword parameter to the current
f_regression
(orf_classif
) that takes in the type of input data. For example, if ourX
is ordinal andy
is continuous, we can runf_regression(X, y, input_type="ordinal")
. The function will then calculate the Spearman's coefficient (as opposed to the current implementation of Pearson's coefficient inf_regression
) and output the scores and pvalues.The steps to add support for ordinal data are:
scipy.stats.spearmanr
OR write our own function that calculates Spearman'sf_regression
and add keyword parameterinput_type
Now, I am not sure how to score one-hot encoded data yet, but hopefully by adding the keyword parameter, we can gradually expand the types of input data
sklearn
's scoring functions can support.Describe alternatives you've considered, if relevant
Alternatively, we can also write a new function
f_regression_ordinal
to deal with ordinalX
and continuousy
.Additional context
This feature request partially addresses #8480. There has also been discussions of the wrapper method, but no consensus has been reached: #6673, #8038.
This feature request was submitted per suggestions by @thomasjpfan and discussion with @yashika51 and @flosincapite
The text was updated successfully, but these errors were encountered: