WIP CountFeaturizer #7765

chenhe95 · 2016-10-26T22:30:04Z

Reference Issue

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Currently there are no test cases and no documentation either, but I am planning on adding it later.

nelson-liu · 2016-10-26T22:32:52Z

sklearn/preprocessing/data.py

@@ -1956,3 +1956,66 @@ def transform(self, X):
        """
        return _transform_selected(X, self._transform,
                                   self.categorical_features, copy=True)
+
+
+class CountFeaturizer(object):


should inherit from BaseEstimator, TransformerMixin, I think...

@nelson-liu Thank you for the feedback/guidance. I will add the things you mentioned in my next commit.

nelson-liu · 2016-10-26T22:34:13Z

sklearn/preprocessing/data.py

+        if data != None:
+            self.fit(data, inclusion=inclusion)
+
+    def get_data(self):


i don't think this function is necessary

nelson-liu · 2016-10-26T22:34:47Z

sklearn/preprocessing/data.py

+    def get_data(self):
+        return self.data 
+
+    def valid_data_type(self, type_check):


generally these sort of checks are directly implemented in the methods that call them (so fit in this case)

chenhe95 · 2016-10-26T23:20:12Z

Also, since we are dealing with floats (and thus precision problems), I am also thinking of adding a parameter called "rounding_factor=16" or something along those lines and includes in the count if the rounded value up to rounding_factor decimal places is equal

amueller · 2016-10-27T15:23:24Z

The input for this should be discrete features, and we can assume integer for now. They are represented as floats, but we can rely on them as being exactly equal, I think.

chenhe95 · 2016-10-31T06:55:30Z

I have added documentation to the CountFeaturizer (along with the tweaks suggested by (@nelson-liu), although I'm not quite sure if it's intended that the commits of all the other people are also showing here. I think it may have been a side effect of trying to update my fork without merging (stackoverflow), since merging will add in a "merge" commit which may be unwanted.

nelson-liu · 2016-10-31T07:01:23Z

hmm, the problem probably stems from the fact that you are working on the master branch. generally, you want to work on a new features in a non-master branch, then update the feature branch (after updating the master branch of your fork with the info in the link you provided above) with git checkout feature_branch followed by git rebase master, then fixing any merge conflicts that arise.

chenhe95 · 2016-10-31T07:24:02Z

Yeah, you're right. I will be making my future commits on the branches. Apologies to those involved in this pull request unintentionally.

amueller · 2016-10-31T19:58:25Z

I suggest you close this pull request, create a new branch locally to work on this feature, reset your master to our (upstream) master, and create a new pull request from your branch. The way it currently is, it looks like there are 42 files changed, which makes the changes hard to review.

chenhe95 · 2016-11-01T00:56:24Z

@amueller Okay, that sounds good!
#7803
Is the new pull request.

nelson-liu reviewed Oct 26, 2016

View reviewed changes

chenhe95 closed this Nov 1, 2016

chenhe95 force-pushed the master branch from a2cf504 to 99342b6 Compare November 1, 2016 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP CountFeaturizer #7765

WIP CountFeaturizer #7765

WIP CountFeaturizer #7765

WIP CountFeaturizer #7765

Conversation

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment