8000 feat: support array output in `remote_function` by shobsi · Pull Request #1057 · googleapis/python-bigquery-dataframes · GitHub 8000
[go: up one dir, main page]

Skip to content

Conversation

@shobsi
Copy link
Contributor
@shobsi shobsi commented Oct 7, 2024

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
    • remote_function: screen/5rMtCZVaUYKdqxP
    • Series.apply: screen/9HkKMuWxMvbbPgf
    • DataFrame.apply: screen/BoXH9A7d4hGpETu

Fixes internal issue 298876217 🦕

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.
@shobsi shobsi requested review from a team as code owners October 7, 2024 17:28
@shobsi shobsi requested a review from GarrettWu October 7, 2024 17:28
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Oct 7, 2024
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label Oct 7, 2024
@shobsi shobsi marked this pull request as draft October 8, 2024 00:42
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Oct 10, 2024
shobsi added 12 commits December 6, 2024 08:04

# if the output is an array, reconstruct it from the json serialized
# string form
if bigframes.dtypes.is_array_like(func.output_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually handle any array-like dtype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, in this PR we are looking to support types like list[int] on the output side? Or I didn't get you?


# if the output is an array, reconstruct it from the json serialized
# string form
if bigframes.dtypes.is_array_like(func.output_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the code within this block assume not just array_like, but specifically that it is a pyarrow list_ type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what the array_like implementation checks?

def is_array_like(type_: ExpressionType) -> bool:
return isinstance(type_, pd.ArrowDtype) and isinstance(
type_.pyarrow_dtype, pa.ListType
)

Copy link
Contributor

Choose a reason for hidi 67F4 ng this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, probably fine then, I don't really see array_like definition expanding anytime soon

return None

try:
python_output_type = eval(output_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval always makes me a bit uncomfortable - can we do this in a more constrained way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed eval in the latest patch, PTAL

if typing.get_origin(python_output_type) is list:
python_output_type_ser = repr(python_output_type)
else:
python_output_type_ser = python_output_type.__name__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shoudl we bother with non-list types right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwing error for non-array and not-supported-array types in the latest patch, PTAL

@shobsi shobsi merged commit bdee173 into main Jan 15, 2025
22 checks passed
@shobsi shobsi deleted the shobs-rf-array-out-1 branch January 15, 2025 23:15
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 20, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 24, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
shuoweil pushed a commit that referenced this pull request Jan 24, 2025
* feat: support array output in `remote_function`

This is feature request to support use cases like creating custom
feature vectors, embeddings etc.

* add multiindex test

* move array type conversion to bigquery module, test multiindex

* add `bigframes.bigquery.json_extract_string_array`, support int and str array outputs

* increase cleanup rate

* update input and output types doc

* support array output in DataFrame.apply

* support read_gbq_function on a remote function created for array output

* fix the json_set after variable renaming

* add tests for output_type in read_gbq_function

* temporarily exclude system 3.9 tests and include 3.10 and 3.11

* Revert "temporarily exclude system 3.9 tests and include 3.10 and 3.11"

This reverts commit 2485aa3.

* add more info in the unexpected exception

* more debug info

* use unique routine name across tests

* Revert "more debug info"

This reverts commit 86fe316.

* Revert "add more info in the unexpected exception"

This reverts commit fe010cb.

* support array output in binary remote function operations

* support array output in nary remote function operations

* preserve array output type in function description to avoid explit output_type in read_gbq_function

* fix one failing read_gbq_function test

* make test parameterization order deterministic

* fix sorting of types for mypy

* remove test parameterization with sorting inside

* include partial ordering mode testing for read_gbq_function

* add remote function array out test in partial ordering mode

* avoid repr-eval for output type serialization/deserialization

* remove unsupported scenarios system tests, use common exception for unsupported
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

0