Drop duplicate features #114 #144

Tejash-Shah · 2020-10-03T23:22:28Z

WRT #114, adding DropDuplicateFeatures() functionality along with tests. Let me know any typos/suggestions. Thanks:)

Already identified typo-> returned non-duplicate features in transform

Tejash-Shah · 2020-10-03T23:31:02Z

Also, if you notice py36/py37/py38 tests on circleci there is a runtime warning generated by importlib. I think it's harmless as mentioned here. Let me know if you know any workaround or we can leave as it is? or do you want suppress but it would not be a good idea.

solegalli

Hi @Tejash-Shah

Thanks for the PR. There are 2 main things I would like to think more about:

collecting duplicated features in a dictionary does not seem to be optimal. Most keys will be empty. I think an array or list of lists would be better, but only outputting duplicates, i.e., we would not output a list of empty lists.
the use of itertools need some thinking. Using combinations create all possible combinations, some of them were already examined and features added to the list, so not sure it is the fastest.

It would be great if you could play with the implementation I suggest, the on2 you wrote, the original code in the issue, and see which one is faster. Could you do that please?

Thanks a lot!
Sole

solegalli · 2020-10-05T08:33:03Z

feature_engine/selection/drop_duplicate_features.py

+        # check input dataframe
+        X = _is_dataframe(X)
+
+        duplicated_features_list = []


I think it would be better if this was a set instead of a list. It would run faster and handle the addition of features that are already there automatically

feature_engine/selection/drop_duplicate_features.py

solegalli · 2020-10-05T08:39:04Z

feature_engine/selection/drop_duplicate_features.py

+
+        for feat_1, feat_2 in col_pair:
+            if feat_1 not in duplicated_features_list and X[feat_1].equals(X[feat_2]):
+                self.duplicate_feature_dict_[feat_1].append(feat_2)


it should check if both feat_1 and feat_2 are in the duplicated list, otherwise skip, to speed things up?

itertools.combinations used this way doesn't optimize for the fact that we should be skipping columns that were found duplicates already, or fit well with the array or list of lists.

After some research / thinking this is a template of what the code could look like:

# list of lists or array of arrays: needs some thinking / proper design duplicated_feat_pairs = [] # create an empty set to collect features # that were found to be duplicated _duplicated_feat = () # create set of examined features _examined_features = () # pick first feature to compare for feature in X.columns: # append so we can remove when we create the combinations _examined_features.append(feature) if feature not in _duplicated_feat: # remove features being examined, already examined or duplicated features already identified remaining_cols = X.columns.drop(list(_duplicated_feat)+list(_examined_features)) #not too sure we need to convert the sets to lists to be dropped, can you check? # create combinations of that feature with the others: for f1, f2 in itertools.product([feature], remaining_cols): if data[f1].equals(data[f2]): add it to the array or list of lists.

A few things to notice:

How to create the array or list of lists needs some thinking. The result could be an empty array if no duplicated features are present in the dataframe. Or alternatively it could contain 1 or more lists / rows with different lenghts. Could I leave the design of this to you?

Another thing I would like to know is whether using itertools as we do here, would be faster than the code suggested in the issue. Could you also check?

feature_engine/selection/drop_duplicate_features.py

Drop duplicates

* add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com>

* style formatting base scripts * reformat style creation modules * reformat style discretisers * reformat style imputers * rewords strings, minor changes * reformat codestyle outliers * reformat code style selection * reformat style transformers * reformat style wrappers * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style encoders * shorten dosctrings with flake8 Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com>

* add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com>

* style formatting base scripts * reformat style creation modules * reformat style discretisers * reformat style imputers * rewords strings, minor changes * reformat codestyle outliers * reformat code style selection * reformat style transformers * reformat style wrappers * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style encoders * shorten dosctrings with flake8 Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com>

* sort modules in subfolders, separate classes in individual modules, rename encoding classes, rename few other classes * fix WoE name in doc * sort selection in subfolder, separate class into modules * fix selection imports * update test_selection * update VERSION to 1.0.0 * match doc name with submodule name, fix imports and class names (#129) * reorganise test submodules as per package submodules, create individual test files for each class (#130) * rename class init params and defo values (#131) * reorganise base transformers and functions, clean imports (#133) * fix style check (#141) * improve code style throughout (#142) * remove iterable from init MathematicalCombination * replace listcomp by genexp in arbitrary discretiser * create abstraction decision tree discretiser * create abstraction in base cat encoder * replace listcomp by genexp * create abstraction base numerical imputer * create baseOutlier * final cleanup of code * replace listcomp by genexp in randomsampleimputer * split tests into individual functions (#147) * split imputation tests * fix typo, unify test names in math combination test * reformat line space in sklearwrapper test * remove comment in drop constant tests * split discretisation tests * split encoding tests * split test outliers * split transformer tests * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style with black (#153) * style formatting base scripts * reformat style creation modules * reformat style discretisers * reformat style imputers * rewords strings, minor changes * reformat codestyle outliers * reformat code style selection * reformat style transformers * reformat style wrappers * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style encoders * shorten dosctrings with flake8 Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com> * reformat test code style (#157) * expand style check in tox, add black to test_req (#154) * update docs, shorten lines, test code (#158) * update docs, shorten lines, test code * include v1.0.0 changes in changelog * fix linebreaks in changelog * Add type hints, docstrings, and expand test. Introduce bug fix in _define_variables (#159) * Add .vscode in gitignore * add type hints and docstrings * Add type hints, docstrings, and introduce bug fix * add type hints and docstrings * add type hints, docstrings, and stylistic modifications * add type hints, docstrings, and stylistic modifications * some stylistic modifications * add test to check null values in dataframe * remove redundant docstring * stylistic modification * add new test and expanded existing one * fix flake8 suggestions * fix some indentation errors * fix indention error * add type hints and docstrings in boxcox transformer * add type hints and docstrings in log transformation * add type hints, docstrings, and recaftor constructor in power transformation * add type hints and docstrings in reciprocal transformer * add type hints * add type hints and docstrings * add test case to test the type of math operation argument * remove extra blank line * add test cases for parameter_checks * black it files * update type hints and docstring * black them and update type hints and docstrings * update type hints and docstrings * black it * update type hints * tidy imports Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com> Co-authored-by: Nodar Okroshiashvili <n.okroshiashvili@gmail.com>

Tejash-Shah added 4 commits October 3, 2020 15:40

add DropDuplicateFeatures in init

54d3db5

add fixture for duplicate features

9a5f9e5

add DropDuplicateFeatures functionality

bd89fe6

add test for DropDuplicateFeatures

91e56a2

solegalli reviewed Oct 5, 2020

View reviewed changes

Tejash-Shah and others added 8 commits October 8, 2020 10:59

add DropDuplicateFeatures in init

e98b970

add fixture for duplicate features

9afcf7d

add DropDuplicateFeatures functionality

62fbb1f

add test for DropDuplicateFeatures

7dcefc7

create drop duplicate transformer

779e483

Merge branch 'drop_duplicate_features' into drop_duplicates

634c0a1

Merge pull request #3 from solegalli/drop_duplicates

b2b9b72

Drop duplicates

delete extra fixture

decec24

solegalli merged commit 5c632bd into feature-engine:develop Oct 9, 2020

Tejash-Shah deleted the drop_duplicate_features branch October 11, 2020 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Drop duplicate features #114 #144

Drop duplicate features #114 #144

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Drop duplicate features #114 #144

Drop duplicate features #114 #144

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!