10BC0 feat: Allow DataFrame.join for self-join on Null index by TrevorBergeron · Pull Request #860 · googleapis/python-bigquery-dataframes · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@TrevorBergeron TrevorBergeron requested review from a team as code owners July 24, 2024 23:24
@product-auto-label product-auto-label bot added the size: s Pull request size is small. label Jul 24, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Jul 24, 2024
Copy link
Collaborator
@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Trevor!

Should we add an e2e test for the .fit(X, y) table with null index / unordered mode, too?

@TrevorBergeron
Copy link
Contributor Author

Thanks Trevor!

Should we add an e2e test for the .fit(X, y) table with null index / unordered mode, too?

Good idea. This revealed that ml modules were caching pre-join, which is invalidated by row-identity join. Instead, I made it cache post-join.

@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Jul 25, 2024
@TrevorBergeron TrevorBergeron requested a review from tswast July 26, 2024 23:06
input_data = X_train.cache()
else:
input_data = X_train.cache().join(y_train.cache(), how="outer")
input_data = X_train.join(y_train.cache(), how="outer").cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be y_train without cache() as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fixed

def test_unordered_mode_logistic_regression_configure_fit_score(
unordered_session, penguins_table_id, dataset_id
):
model = bigframes.ml.linear_model.LogisticRegression()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a problem, but if we only pick one model to test some shared functionalities, usually the choice is LinearReg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, used linear regression model instead

@tswast tswast merged commit e950533 into main Jul 30, 2024
@tswast tswast deleted the null_join branch July 30, 2024 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0