8000 Add Filter and Pre Selection Components by samborba · Pull Request #65 · platiagro/projects · GitHub
[go: up one dir, main page]

Skip to content

Add Filter and Pre Selection Components #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 7, 2020
Merged

Add Filter and Pre Selection Components #65

merged 10 commits into from
Jun 7, 2020

Conversation

samborba
Copy link
Contributor

There is a temporary measure to deal with custom transformers: there is a wrapping of the code to be imported in both Training and Inference notebooks.

  • Filter Selection
    Removes selected features from the dataset.

  • Pre Selection
    Removes features with low-variance and high correlation.

There is a temporary measure to deal with custom transformers: there is a wrapping of the code to be imported in both Training and Inference notebooks.

- Filter Selection
Removes selected features from the dataset.

- Pre Selection
Removes features with low-variance and high correlation.
@samborba samborba requested review from fberanizo, lucaslzl and lborro May 29, 2020 14:26
Pre Selection:
- Change correlation method
- Implementation of the fit method so that we can call the transform method in the Inference file
- Import features_after_pipeline

Filter:
- Changes in contract.json
- Implementation of the fit method so that we can call the transform method in the Inference file
- Import features_after_pipeline
Copy link
Member
@fberanizo fberanizo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samborba você precisa copiar a célula com classe CustomTransformer.py no Inference.ipynb.
No fluxo da plataforma, Training.ipynb e Inference.ipynb rodam em momentos diferentes e containers diferentes.

@fberanizo
Copy link
Member

Dá erro quando uso o dataset com categóricos. Ex: titanic
erro

- Add Wrapping Custom Transformer step in Inference.ipynb
- Remove target param from Pre Selection
- Handle categorical features in Pre Selection
@samborba samborba requested review from lborro and fberanizo June 2, 2020 04:12
- Bring back the target variable for Pre Selection
- Change features_to_filter to feature type
@samborba samborba requested a review from fberanizo June 4, 2020 00:26
- Add target variable in Filter Selection
- Refactoring Pre Selection: improving code documentation; remove imports and unused variables; remove target column from features_after_pipeline
- Refactoring in attribute engineering components: will no longer need to save numerical_indexes list values using the save_model method, as the model will be able to remeber what are the types of each columns
- Change in Inference targets
- Clear Normalizer output
- Add parameter tag in Filter
- Get new numerical features indexes after make_column_transformer
@samborba samborba requested a review from fberanizo June 6, 2020 08:04
@fberanizo
Copy link
Member
fberanizo commented Jun 6, 2020

@lucaslzl @lborro
@samborba aparentemente de uns tempos pra cá o scikit-learn aceita pandas.DataFrame nas chamadas de .fit, transform, .predict ...

Alguém quer fazer uns teste e trocar os lugares que usam ndarray por DataFrame?
Essa parte da ordenação de colunas parece desnecessariamente complicada, com esses vários saves e reordenações, etc...
Parece que com pandas.DataFrame o próprio scikit cuida dessa parte.

Edit: procurei melhor e parece que não é bem como pensei. Ainda é um problema em aberto no scikit-learn:
scikit-learn/scikit-learn#7242
scikit-learn/scikit-learn#12627

Não adianta fazer a sugestão aí de cima. Testei e não rolou.

O jeito parece ser usar um:

save_model(
     ...,
     feature_names_in=feature_names_in,
     feature_names_out=feature_names_out,
)

- Save features_after_pipeline and original column
@samborba samborba requested a review from fberanizo June 6, 2020 17:21
Copy link
Member
@fberanizo fberanizo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Só tem o drop lá normalizer. O resto deu certo.

- Make_column_transform remainder change: if the column is not specified, it should not be dropped after transformation
@samborba samborba requested a review from fberanizo June 6, 2020 23:35
@fberanizo
Copy link
Member

LGTM

@fberanizo fberanizo merged commit df40935 into platiagro:master Jun 7, 2020
@samborba samborba deleted the feature/add-new-components branch June 8, 2020 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0