-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
REF/BUG/API: factorizing categorical data #19938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
18de376
9ef5be2
5e52b6f
e19ae86
121b682
a6bc405
0bfbc47
b25f383
2688c4f
2395a99
ab4f01c
65150f4
ad8173b
1e006d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,7 +30,8 @@ | |
is_scalar, | ||
is_dict_like) | ||
|
||
from pandas.core.algorithms import factorize, take_1d, unique1d | ||
from pandas.core.algorithms import ( | ||
factorize, take_1d, unique1d, _factorize_array) | ||
from pandas.core.accessor import PandasDelegate | ||
from pandas.core.base import (PandasObject, | ||
NoNewAttributesMixin, _shared_docs) | ||
|
@@ -2069,13 +2070,11 @@ def unique(self): | |
take_codes = sorted(take_codes) | ||
return cat.set_categories(cat.categories.take(take_codes)) | ||
|
||
def factorize(self, sort=False, na_sentinel=-1): | ||
def factorize(self, na_sentinel=-1): | ||
"""Encode the Categorical as an enumerated type. | ||
|
||
Parameters | ||
---------- | ||
sort : boolean, default False | ||
Sort by values | ||
na_sentinel: int, default -1 | ||
Value to mark "not found" | ||
|
||
|
@@ -2110,21 +2109,16 @@ def factorize(self, sort=False, na_sentinel=-1): | |
[a, b] | ||
Categories (2, object): [a, b] | ||
""" | ||
from pandas.core.algorithms import _factorize_array, take_1d | ||
|
||
codes = self.codes.astype('int64') | ||
codes[codes == -1] = iNaT | ||
# We set missing codes, normally -1, to iNaT so that the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. put the astype after the comment, looks awkward otherwise why do you think you need to do this? the point of the na_sentinel is to select the missing values There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. na_sentinel controls the missing marker for the output. We're modifying the input, since the Int64HashTable sees that they're missing, instead of the value -1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should be fixed generally, e.g. we should have a way to pass in the missing value. can you fix There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not familiar with the hashtable code, but at a glance it looks like the null condition is included in the class definition template?
I'm not sure how to pass expressions down to cython as a parameter. Anyway, do we actually need this to be parameterized? Do we have other cases where we've needed to pass the null condition down? |
||
# Int64HashTable treats them as missing values. | ||
codes[codes == -1] = iNaT | ||
labels, uniques = _factorize_array(codes, check_nulls=True, | ||
na_sentinel=na_sentinel) | ||
uniques = self._constructor(self.categories.take(uniques), | ||
categories=self.categories, | ||
ordered=self.ordered) | ||
if sort: | ||
order = uniques.argsort() | ||
labels = take_1d(order, labels, fill_value=na_sentinel) | ||
uniques = uniques.take(order) | ||
return labels, uniques | ||
|
||
def equals(self, other): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface we have to
hashtable.get_labels()
is very odd right now, IOW we have acheck_null
flag which then makes the caller know to substitute values to iNaT (for int64) and know which are the sentinels. This is breaking the abstrastion. Rather would either like to be able to pass in the actual sentinel (not the output sentinel, but that's another confusion). e.g . you would simply pass -1 here.I think its worth re-factoring this (maybe before this PR), though I suppose could be after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#20328
Yes, that'd be nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually want this to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorize in general? I don’t see why not. It’s present on series and index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#19938 (comment) was in reference to the API docs. We whitelist the methods on Categorical that are included in the API docs (just
__array__
andfrom_codes
for now).