8000 WIP CountFeaturizer by chenhe95 · Pull Request #7765 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

WIP CountFeaturizer #7765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

WIP CountFeaturizer #7765

wants to merge 0 commits into from

Conversation

chenhe95
Copy link
Contributor

Reference Issue

#5853

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Currently there are no test cases and no documentation either, but I am planning on adding it later.

@@ -1956,3 +1956,66 @@ def transform(self, X):
"""
return _transform_selected(X, self._transform,
self.categorical_features, copy=True)


class CountFeaturizer(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should inherit from BaseEstimator, TransformerMixin, I think...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nelson-liu Thank you for the feedback/guidance. I will add the things you mentioned in my next commit.

if data != None:
self.fit(data, inclusion=inclusion)

def get_data(self):
Copy link
Cont 8000 ributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this function is necessary

def get_data(self):
return self.data

def valid_data_type(self, type_check):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally these sort of checks are directly implemented in the methods that call them (so fit in this case)

@chenhe95
Copy link
Contributor Author
chenhe95 commented Oct 26, 2016

Also, since we are dealing with floats (and thus precision problems), I am also thinking of adding a parameter called "rounding_factor=16" or something along those lines and includes in the count if the rounded value up to rounding_factor decimal places is equal

@amueller
Copy link
Member

The input for this should be discrete features, and we can assume integer for now. They are represented as floats, but we can rely on them as being exactly equal, I think.

@chenhe95
Copy link
Contributor Author

I have added documentation to the CountFeaturizer (along with the tweaks suggested by (@nelson-liu), although I'm not quite sure if it's intended that the commits of all the other people are also showing here. I think it may have been a side effect of trying to update my fork without merging (stackoverflow), since merging will add in a "merge" commit which may be unwanted.

@nelson-liu
Copy link
Contributor

hmm, the problem probably stems from the fact that you are working on the master branch. generally, you want to work on a new features in a non-master branch, then update the feature branch (after updating the master branch of your fork with the info in the link you provided above) with git checkout feature_branch followed by git rebase master, then fixing any merge conflicts that arise.

@chenhe95
Copy link
Contributor Author

Yeah, you're right. I will be making my future commits on the branches. Apologies to those involved in this pull request unintentionally.

@amueller
Copy link
Member

I suggest you close this pull request, create a new branch locally to work on this feature, reset your master to our (upstream) master, and create a new pull request from your branch. The way it currently is, it looks like there are 42 files changed, which makes the changes hard to review.

@chenhe95
Copy link
Contributor Author
chenhe95 commented Nov 1, 2016

@amueller Okay, that sounds good!
#7803
Is the new pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0