-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Initial draft: from_dummies #41902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft: from_dummies #41902
Changes from 1 commit
f3e6afe
c7c5588
d06540f
1fa4e8a
c7f8ec8
3cc98ca
0e131c6
9f74dc7
442b340
38cf04d
8eccfab
fd027c5
106ff3c
2019228
be39c05
d406227
61a25e0
1d104f8
5bcfbb4
ca6200e
bf17cdb
92b5dae
c2cd747
dc50464
4d9cfd0
82d6743
153202d
d3dd9f7
e6ec175
ee6025d
4e741c8
1b4a8e9
90177be
d58c668
46457fa
131f42b
1af65ac
6dacf53
61edd30
04f360c
7ff2f3b
56ea182
39a0199
e05fe3f
23f6c07
7190879
012a1dd
52ed909
d8e4743
0cf35d8
b9303bc
3207534
8089fe5
55ad274
1b17815
00c7b05
07ba536
bbe41d0
329394b
b83ac6a
1f5e1dc
8a3421b
16cdaa0
174df1f
e45d3f8
e83faed
1e12e6a
24e9899
c8e7a7d
0ac8fff
6af6cad
54fdcbd
ced3ed0
6db7744
c84d973
842d335
8f91012
84d5bd8
fd0f985
6230d0f
84a60f7
c78ef2a
52a9dea
bc658ba
9fbca72
2581fc9
85a0ed8
5b74039
015ee94
66c0292
30b8ff1
b261656
555825b
9d6e571
9f1bb8e
dc52985
e7d6828
ae9f3d2
a59ed4e
66c7a64
76221f8
7fa66b3
536f9c5
530889e
6536c65
1272a23
fd3b115
bd5a118
f7d08d0
c32e514
0fda02f
62b09ae
1dcdd9a
3c00690
4425b4a
dc144f7
15503b0
61a348b
f06a45c
f3a0f83
23c133f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1101,7 +1101,7 @@ def from_dummies( | |
data: DataFrame, | ||
subset: None | Index | list[Hashable] = None, | ||
sep: None | str | dict[str, str] = None, | ||
dropped_first: None | str | dict[str, str] = None, | ||
dropped_first: None | Hashable | dict[str, Hashable] = None, | ||
) -> DataFrame: | ||
""" | ||
Create a categorical `DataFrame` from a `DataFrame` of dummy variables. | ||
|
@@ -1123,7 +1123,7 @@ def from_dummies( | |
you can strip the underscore by specifying sep='_'. | ||
Alternatively, pass a dictionary to map prefix separators to prefixes if | ||
multiple and/or mixed separators are used in the column names. | ||
dropped_fist : None, str or dict of str, default None | ||
dropped_fist : None, Hashable or dict of Hashables, default None | ||
The implied value the dummy takes when all values are zero. | ||
Can be a a single value for all variables or a dict directly mapping the | ||
dropped value to a prefix of a variable. | ||
|
@@ -1219,7 +1219,7 @@ def from_dummies( | |
f"First instance column: {col}" | ||
) | ||
elif isinstance(sep, str): | ||
variables_slice: dict[str, list] = {} | ||
variables_slice = {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome advice, thank you very much :) |
||
for col in data_to_decode.columns: | ||
prefix = col.split(sep)[0] | ||
if len(prefix) == len(col): | ||
|
@@ -1250,24 +1250,24 @@ def check_len(item, name) -> None: | |
if dropped_first: | ||
if isinstance(dropped_first, dict): | ||
57AE check_len(dropped_first, "dropped_first") | ||
elif isinstance(dropped_first, str): | ||
elif isinstance(dropped_first, Hashable): | ||
dropped_first = dict( | ||
zip(variables_slice, [dropped_first] * len(variables_slice)) | ||
) | ||
else: | ||
raise TypeError( | ||
f"Expected 'dropped_first' to be of type 'str' or 'dict'; " | ||
f"Expected 'dropped_first' to be of type 'Hashable' or 'dict'; " | ||
f"Received 'dropped_first' of type: {type(dropped_first).__name__}" | ||
) | ||
|
||
cat_data = {} | ||
for prefix, prefix_slice in variables_slice.items(): | ||
if sep is None: | ||
cats = subset.copy() | ||
elif isinstance(sep, str): | ||
cats = [col[len(prefix + sep) :] for col in prefix_slice] | ||
elif isinstance(sep, dict): | ||
cats = [col[len(prefix + sep[prefix]) :] for col in prefix_slice] | ||
else: | ||
cats = [col[len(prefix + sep) :] for col in prefix_slice] | ||
assigned = data_to_decode[prefix_slice].sum(axis=1) | ||
if any(assigned > 1): | ||
raise ValueError( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentB41A The reason will be displayed to describe this comment to others. Learn more. Couldn't you check this much earlier with a row sum after the conversion to boolean, e.g. , if (data_to_decode.sum(1) > 1).any()? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, that only works if there are no prefixes/multiple variables as each prefix slice has to be checked individually and |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe its me, but this is not very descriptive. (also mispelled). and i am not sure the name is obvious.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced it with
implied_category
which makes way more sense in the context of dummy data. Thanks for the hint, I was once again to focused on inverting theget_dummies
function and its arguments.