8000 REF/BUG/API: factorizing categorical data by TomAugspurger · Pull Request #19938 · pandas-dev/pandas · GitHub
[go: up one dir, main page]

Skip to content

REF/BUG/API: factorizing categorical data #19938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 15, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Clean : imports / remove sort
  • Loading branch information
TomAugspurger committed Mar 6, 2018
commit 5e52b6f60101bef3eb92bd1197d103fc71b4ba21
14 changes: 4 additions & 10 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@
is_scalar,
is_dict_like)

from pandas.core.algorithms import factorize, take_1d, unique1d
from pandas.core.algorithms import (
factorize, take_1d, unique1d, _factorize_array)
from pandas.core.accessor import PandasDelegate
from pandas.core.base import (PandasObject,
NoNewAttributesMixin, _shared_docs)
Expand Down Expand Up @@ -2069,13 +2070,11 @@ def unique(self):
take_codes = sorted(take_codes)
return cat.set_categories(cat.categories.take(take_codes))

def factorize(self, sort=False, na_sentinel=-1):
def factorize(self, na_sentinel=-1):
"""Encode the Categorical as an enumerated type.

Parameters
----------
sort : boolean, default False
Sort by values
na_sentinel: int, default -1
Value to mark "not found"

Expand Down Expand Up @@ -2110,21 +2109,16 @@ def factorize(self, sort=False, na_sentinel=-1):
[a, b]
Categories (2, object): [a, b]
"""
from pandas.core.algorithms import _factorize_array, take_1d

codes = self.codes.astype('int64')
codes[codes == -1] = iNaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface we have to hashtable.get_labels() is very odd right now, IOW we have a check_null flag which then makes the caller know to substitute values to iNaT (for int64) and know which are the sentinels. This is breaking the abstrastion. Rather would either like to be able to pass in the actual sentinel (not the output sentinel, but that's another confusion). e.g . you would simply pass -1 here.

I think its worth re-factoring this (maybe before this PR), though I suppose could be after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#20328

Yes, that'd be nicer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually want this to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factorize in general? I don’t see why not. It’s present on series and index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19938 (comment) was in reference to the API docs. We whitelist the methods on Categorical that are included in the API docs (just __array__ and from_codes for now).

# We set missing codes, normally -1, to iNaT so that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the astype after the comment, looks awkward otherwise

why do you think you need to do this? the point of the na_sentinel is to select the missing values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na_sentinel controls the missing marker for the output. We're modifying the input, since the Int64HashTable sees that they're missing, instead of the value -1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fixed generally, e.g. we should have a way to pass in the missing value. can you fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the hashtable code, but at a glance it looks like the null condition is included in the class definition template?

{{py:

# name, dtype, null_condition, float_group
dtypes = [('Float64', 'float64', 'val != val', True),
          ('UInt64', 'uint64', 'False', False),
          ('Int64', 'int64', 'val == iNaT', False)]

I'm not sure how to pass expressions down to cython as a parameter.

Anyway, do we actually need this to be parameterized? Do we have other cases where we've needed to pass the null condition down?

# Int64HashTable treats them as missing values.
codes[codes == -1] = iNaT
labels, uniques = _factorize_array(codes, check_nulls=True,
na_sentinel=na_sentinel)
uniques = self._constructor(self.categories.take(uniques),
categories=self.categories,
ordered=self.ordered)
if sort:
order = uniques.argsort()
labels = take_1d(order, labels, fill_value=na_sentinel)
uniques = uniques.take(order)
return labels, uniques

def equals(self, other):
Expand Down
0