-
-
Notifications
You must be signed in to change notification settings - Fork 324
Drop duplicate features #114 #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop duplicate features #114 #144
Conversation
Also, if you notice py36/py37/py38 tests on circleci there is a runtime warning generated by importlib. I think it's harmless as mentioned here. Let me know if you know any workaround or we can leave as it is? or do you want suppress but it would not be a good idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Tejash-Shah
Thanks for the PR. There are 2 main things I would like to think more about:
-
collecting duplicated features in a dictionary does not seem to be optimal. Most keys will be empty. I think an array or list of lists would be better, but only outputting duplicates, i.e., we would not output a list of empty lists.
-
the use of itertools need some thinking. Using combinations create all possible combinations, some of them were already examined and features added to the list, so not sure it is the fastest.
It would be great if you could play with the implementation I suggest, the on2 you wrote, the original code in the issue, and see which one is faster. Could you do that please?
Thanks a lot!
Sole
# check input dataframe | ||
X = _is_dataframe(X) | ||
|
||
duplicated_features_list = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if this was a set instead of a list. It would run faster and handle the addition of features that are already there automatically
|
||
for feat_1, feat_2 in col_pair: | ||
if feat_1 not in duplicated_features_list and X[feat_1].equals(X[feat_2]): | ||
self.duplicate_feature_dict_[feat_1].append(feat_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should check if both feat_1 and feat_2 are in the duplicated list, otherwise skip, to speed things up?
itertools.combinations used this way doesn't optimize for the fact that we should be skipping columns that were found duplicates already, or fit well with the array or list of lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some research / thinking this is a template of what the code could look like:
# list of lists or array of arrays: needs some thinking / proper design
duplicated_feat_pairs = []
# create an empty set to collect features
# that were found to be duplicated
_duplicated_feat = ()
# create set of examined features
_examined_features = ()
# pick first feature to compare
for feature in X.columns:
# append so we can remove when we create the combinations
_examined_features.append(feature)
if feature not in _duplicated_feat:
# remove features being examined, already examined or duplicated features already identified
remaining_cols = X.columns.drop(list(_duplicated_feat)+list(_examined_features))
#not too sure we need to convert the sets to lists to be dropped, can you check?
# create combinations of that feature with the others:
for f1, f2 in itertools.product([feature], remaining_cols):
if data[f1].equals(data[f2]):
add it to the array or list of lists.
A few things to notice:
How to create the array or list of lists needs some thinking. The result could be an empty array if no duplicated features are present in the dataframe. Or alternatively it could contain 1 or more lists / rows with different lenghts. Could I leave the design of this to you?
Another thing I would like to know is whether using itertools as we do here, would be faster than the code suggested in the issue. Could you also check?
Drop duplicates
* add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com>
* style formatting base scripts * reformat style creation modules * reformat style discretisers * reformat style imputers * rewords strings, minor changes * reformat codestyle outliers * reformat code style selection * reformat style transformers * reformat style wrappers * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style encoders * shorten dosctrings with flake8 Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com>
* add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com>
* style formatting base scripts * reformat style creation modules * reformat style discretisers * reformat style imputers * rewords strings, minor changes * reformat codestyle outliers * reformat code style selection * reformat style transformers * reformat style wrappers * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style encoders * shorten dosctrings with flake8 Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com>
* sort modules in subfolders, separate classes in individual modules, rename encoding classes, rename few other classes * fix WoE name in doc * sort selection in subfolder, separate class into modules * fix selection imports * update test_selection * update VERSION to 1.0.0 * match doc name with submodule name, fix imports and class names (#129) * reorganise test submodules as per package submodules, create individual test files for each class (#130) * rename class init params and defo values (#131) * reorganise base transformers and functions, clean imports (#133) * fix style check (#141) * improve code style throughout (#142) * remove iterable from init MathematicalCombination * replace listcomp by genexp in arbitrary discretiser * create abstraction decision tree discretiser * create abstraction in base cat encoder * replace listcomp by genexp * create abstraction base numerical imputer * create baseOutlier * final cleanup of code * replace listcomp by genexp in randomsampleimputer * split tests into individual functions (#147) * split imputation tests * fix typo, unify test names in math combination test * reformat line space in sklearwrapper test * remove comment in drop constant tests * split discretisation tests * split encoding tests * split test outliers * split transformer tests * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style with black (#153) * style formatting base scripts * reformat style creation modules * reformat style discretisers * reformat style imputers * rewords strings, minor changes * reformat codestyle outliers * reformat code style selection * reformat style transformers * reformat style wrappers * separate woe and ratio encoder closes issue #143 (#149) * Issue 143 * doc issue 143 documentation issue 143 * Update PRatioEncoder.rst 'C' Removed from RareLabelCEncoder whch was causing and error. Also enconder_dict_ ratio results added * Changes in docstrings #143 Changes requested by Sole in docstrings, after first pull request related to this issue. * More docstring changes related to #143 minor updates in docstrings * separate woe and ratio tests into functions #147 separate woe and ratio tests into functions #147 * move PRatioEncoder under WoEencoder in list as these are related transformers * add ' to log_ratio in docstring * fix docstring with intro to transformer funcion * add more detail about the encoding in docstrings * renamed test functions * reword tests Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reorganised folders with jupyter notebooks (#155) * Drop duplicate features #114 (#144) * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * add DropDuplicateFeatures in init * add fixture for duplicate features * add DropDuplicateFeatures functionality * add test for DropDuplicateFeatures * create drop duplicate transformer * delete extra fixture Co-authored-by: Soledad Galli <solegalli@protonmail.com> * reformat code style encoders * shorten dosctrings with flake8 Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com> * reformat test code style (#157) * expand style check in tox, add black to test_req (#154) * update docs, shorten lines, test code (#158) * update docs, shorten lines, test code * include v1.0.0 changes in changelog * fix linebreaks in changelog * Add type hints, docstrings, and expand test. Introduce bug fix in _define_variables (#159) * Add .vscode in gitignore * add type hints and docstrings * Add type hints, docstrings, and introduce bug fix * add type hints and docstrings * add type hints, docstrings, and stylistic modifications * add type hints, docstrings, and stylistic modifications * some stylistic modifications * add test to check null values in dataframe * remove redundant docstring * stylistic modification * add new test and expanded existing one * fix flake8 suggestions * fix some indentation errors * fix indention error * add type hints and docstrings in boxcox transformer * add type hints and docstrings in log transformation * add type hints, docstrings, and recaftor constructor in power transformation * add type hints and docstrings in reciprocal transformer * add type hints * add type hints and docstrings * add test case to test the type of math operation argument * remove extra blank line * add test cases for parameter_checks * black it files * update type hints and docstring * black them and update type hints and docstrings * update type hints and docstrings * black it * update type hints * tidy imports Co-authored-by: NicoGalli <72278140+NicoGalli@users.noreply.github.com> Co-authored-by: Tejash Shah <stejash15@gmail.com> Co-authored-by: Nodar Okroshiashvili <n.okroshiashvili@gmail.com>
Hi @solegalli,
WRT #114, adding
DropDuplicateFeatures()
functionality along with tests. Let me know any typos/suggestions. Thanks:)ednon-duplicate features intransform