-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP CountFeaturizer #7765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP CountFeaturizer #7765
Conversation
@@ -1956,3 +1956,66 @@ def transform(self, X): | |||
""" | |||
return _transform_selected(X, self._transform, | |||
self.categorical_features, copy=True) | |||
|
|||
|
|||
class CountFeaturizer(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should inherit from BaseEstimator, TransformerMixin
, I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nelson-liu Thank you for the feedback/guidance. I will add the things you mentioned in my next commit.
if data != None: | ||
self.fit(data, inclusion=inclusion) | ||
|
||
def get_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this function is necessary
def get_data(self): | ||
return self.data | ||
|
||
def valid_data_type(self, type_check): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally these sort of checks are directly implemented in the methods that call them (so fit
in this case)
Also, since we are dealing with floats (and thus precision problems), I am also thinking of adding a parameter called "rounding_factor=16" or something along those lines and includes in the count if the rounded value up to rounding_factor decimal places is equal |
The input for this should be discrete features, and we can assume integer for now. They are represented as floats, but we can rely on them as being exactly equal, I think. |
I have added documentation to the CountFeaturizer (along with the tweaks suggested by (@nelson-liu), although I'm not quite sure if it's intended that the commits of all the other people are also showing here. I think it may have been a side effect of trying to update my fork without merging (stackoverflow), since merging will add in a "merge" commit which may be unwanted. |
hmm, the problem probably stems from the fact that you are working on the master branch. generally, you want to work on a new features in a non-master branch, then update the feature branch (after updating the master branch of your fork with the info in the link you provided above) with |
Yeah, you're right. I will be making my future commits on the branches. Apologies to those involved in this pull request unintentionally. |
I suggest you close this pull request, create a new branch locally to work on this feature, reset your master to our (upstream) master, and create a new pull request from your branch. The way it currently is, it looks like there are 42 files changed, which makes the changes hard to review. |
Reference Issue
#5853
What does this implement/fix? Explain your changes.
It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature
Any other comments?
Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!
Currently there are no test cases and no documentation either, but I am planning on adding it later.