-
-
Notifications
You must be signed in to change notification settings - Fork 324
Drop duplicate features #114 #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
solegalli
merged 12 commits into
feature-engine:develop
from
Tejash-Shah:drop_duplicate_features
Oct 9, 2020
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
54d3db5
add DropDuplicateFeatures in init
Tejash-Shah 9a5f9e5
add fixture for duplicate features
Tejash-Shah bd89fe6
add DropDuplicateFeatures functionality
Tejash-Shah 91e56a2
add test for DropDuplicateFeatures
Tejash-Shah e98b970
add DropDuplicateFeatures in init
Tejash-Shah 9afcf7d
add fixture for duplicate features
Tejash-Shah 62fbb1f
add DropDuplicateFeatures functionality
Tejash-Shah 7dcefc7
add test for DropDuplicateFeatures
Tejash-Shah 779e483
create drop duplicate transformer
solegalli 634c0a1
Merge branch 'drop_duplicate_features' into drop_duplicates
Tejash-Shah b2b9b72
Merge pull request #3 from solegalli/drop_duplicates
Tejash-Shah decec24
delete extra fixture
Tejash-Shah File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
from sklearn.base import TransformerMixin, BaseEstimator | ||
from sklearn.utils.validation import check_is_fitted | ||
from feature_engine.dataframe_checks import ( | ||
_is_dataframe, | ||
_check_input_matches_training_df, | ||
) | ||
from feature_engine.variable_manipulation import _find_all_variables, _define_variables | ||
|
||
|
||
class DropDuplicateFeatures(BaseEstimator, TransformerMixin): | ||
""" | ||
DropDuplicateFeatures finds and removes duplicated features in a dataframe. | ||
|
||
Duplicated features are identical features, regardless of the variable or column name. If they | ||
show the same values for every observation, then they are considered duplicated. | ||
|
||
The transformer will first identify and store the duplicated variables. Next, the transformer | ||
will drop these variables from a dataframe. | ||
|
||
Parameters | ||
---------- | ||
|
||
variables: list, default=None | ||
The list of variables to evaluate. If None, the transformer will evaluate all variables in | ||
the dataset. | ||
|
||
""" | ||
|
||
def __init__(self, variables=None): | ||
self.variables = _define_variables(variables) | ||
|
||
def fit(self, X, y=None): | ||
|
||
""" | ||
8000 Find duplicated features. | ||
|
||
Parameters | ||
---------- | ||
|
||
X: pandas dataframe of shape = [n_samples, n_features] | ||
The input dataframe. | ||
|
||
y: None | ||
y is not needed for this transformer. You can pass y or None. | ||
|
||
|
||
Attributes | ||
---------- | ||
|
||
duplicated_features_: set | ||
The duplicated features. | ||
|
||
duplicated_feature_sets_: list | ||
Groups of duplicated features. Or in other words, features that are duplicated with | ||
each other. Each list represents a group of duplicated features. | ||
""" | ||
|
||
# check input dataframe | ||
X = _is_dataframe(X) | ||
|
||
# find all variables or check those entered are in the dataframe | ||
self.variables = _find_all_variables(X, self.variables) | ||
|
||
# create tuples of duplicated feature groups | ||
self.duplicated_feature_sets_ = [] | ||
|
||
# set to collect features that are duplicated | ||
self.duplicated_features_ = set() | ||
|
||
# create set of examined features | ||
_examined_features = set() | ||
|
||
for feature in self.variables: | ||
|
||
# append so we can remove when we create the combinations | ||
_examined_features.add(feature) | ||
|
||
if feature not in self.duplicated_features_: | ||
|
||
_temp_set = set([feature]) | ||
|
||
# features that have not been examined, are not currently examined and were | ||
# not found duplicates | ||
_features_to_compare = [ | ||
f | ||
for f in self.variables | ||
if f not in _examined_features.union(self.duplicated_features_) | ||
] | ||
|
||
# create combinations: | ||
for f2 in _features_to_compare: | ||
|
||
if X[feature].equals(X[f2]): | ||
self.duplicated_features_.add(f2) | ||
_temp_set.add(f2) | ||
|
||
# if there are duplicated features | ||
if len(_temp_set) > 1: | ||
self.duplicated_feature_sets_.append(_temp_set) | ||
|
||
self.input_shape_ = X.shape | ||
|
||
return self | ||
|
||
def transform(self, X): | ||
""" | ||
Drops the duplicated features from a dataframe. | ||
|
||
Parameters | ||
---------- | ||
X: pandas dataframe of shape = [n_samples, n_features]. | ||
The input samples. | ||
|
||
Returns | ||
------- | ||
X_transformed: pandas dataframe of shape = [n_samples, n_features - (duplicated features)] | ||
The transformed dataframe with the remaining subset of variables. | ||
|
||
""" | ||
# check if fit is performed prior to transform | ||
check_is_fitted(self) | ||
|
||
# check if input is a dataframe | ||
X = _is_dataframe(X) | ||
|
||
# check if number of columns in test dataset matches to train dataset | ||
_check_input_matches_training_df(X, self.input_shape_[1]) | ||
|
||
# returned non-duplicate features | ||
X = X.drop(columns=self.duplicated_features_) | ||
|
||
solegalli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return X |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.