-
Notifications
You must be signed in to change notification settings - Fork 316
fix: empty record dtypes #2147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: empty record dtypes #2147
Conversation
# Avoid "ValueError: need at least one array to concatenate" on | ||
# older versions of pandas when converting empty RecordBatch to | ||
# DataFrame. See: https://github.com/pandas-dev/pandas/issues/41241 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas-dev/pandas#41052 has released with pandas 1.3.0 (as mentioned in above issue: pandas-dev/pandas#41241 (comment))
So if we can increase minimum pandas version to 1.3.0, we don't need this workaround
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really come up with a consistent policy across all Google packages:
- pandas-gbq supports 1.1.4+ (https://github.com/googleapis/python-bigquery-pandas/blob/b3a9af27960c6e2736835d6a94e116f550f94114/setup.py#L27)
- db-dtypes supports 1.2.0+ (https://github.com/googleapis/python-db-dtypes-pandas/blob/44a5aa24ef3219849e9e0bedeeaebc4b9ee53d9e/setup.py#L34)
- bigframes supports 1.5.3+ (https://github.com/googleapis/python-bigquery-dataframes/blob/7d00be67cf50fdf713c40912f207d14f0f65538f/setup.py#L53)
- Colab and BigQuery Studio Python notebooks are both in the 2.x range now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, moving to a monorepo might help make the dependencies more consistent, ideally we might want any overlapped dependencies among the handwritten BigQuery projects to be consistent.
But just for the purpose of this PR, I think it's reasonable to increase the minimum pandas version to 1.3.0, as long as it's not breaking anything.
@@ -21,7 +21,7 @@ opentelemetry-api==1.1.0 | |||
opentelemetry-instrumentation==0.20b0 | |||
opentelemetry-sdk==1.1.0 | |||
packaging==24.2.0 | |||
pandas==1.1.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that pandas==1.2.5
reproduces
FAILED tests/unit/test_table.py::TestRowIterator::test_to_dataframe_w_bqstorage_no_streams - ValueError: need at least one array to concatenate
and pandas==1.3.0
can pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the constraint file!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but will wait for @chalmerlowe or @Linchin to approve.
# Avoid "ValueError: need at least one array to concatenate" on | ||
# older versions of pandas when converting empty RecordBatch to | ||
# DataFrame. See: https://github.com/pandas-dev/pandas/issues/41241 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really come up with a consistent policy across all Google packages:
- pandas-gbq supports 1.1.4+ (https://github.com/googleapis/python-bigquery-pandas/blob/b3a9af27960c6e2736835d6a94e116f550f94114/setup.py#L27)
- db-dtypes supports 1.2.0+ (https://github.com/googleapis/python-db-dtypes-pandas/blob/44a5aa24ef3219849e9e0bedeeaebc4b9ee53d9e/setup.py#L34)
- bigframes supports 1.5.3+ (https://github.com/googleapis/python-bigquery-dataframes/blob/7d00be67cf50fdf713c40912f207d14f0f65538f/setup.py#L53)
- Colab and BigQuery Studio Python notebooks are both in the 2.x range now.
@@ -21,7 +21,7 @@ opentelemetry-api==1.1.0 | |||
opentelemetry-instrumentation==0.20b0 | |||
opentelemetry-sdk==1.1.0 | |||
packaging==24.2.0 | |||
pandas==1.1.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the constraint file!
I think we need to remove some liens in the unit test so coverage test can pass, could you fix it for us? Then we should be able to merge the PR
|
Indeed, pandas>=1.3.0 must have |
A different system test is failing, could you fix it too? Thanks |
# Result is dependent upon which version of pandas is being used. | ||
# Float64 was not introduced until pandas version 1.4. | ||
if PANDAS_INSTALLED_VERSION >= "1.4": | ||
assert df.dtypes["float64_col"].name == "Float64" | ||
else: | ||
assert df.dtypes["float64_col"].name == "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Float64
is introduced in pandas 1.2.0
, not 1.4
. So we can drop this handling now (pandas >= 1.3.0)
nox > python -m pip freeze
.
.
.
pandas==1.3.0
.
.
.
if PANDAS_INSTALLED_VERSION >= "1.4":
assert df.dtypes["float64_col"].name == "Float64"
else:
> assert df.dtypes["float64_col"].name == "string"
E AssertionError: assert 'Float64' == 'string'
E
E - string
E + Float64
@Linchin sorry to overlooking another error, I fixed it too. Can you try to run Ci test again ? I have trouble to run the system test on my local (maybe due to old numpy on apple silicon) but the tests should pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your contribution!
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #2148 🦕
Can we increase the minimum supported pandas from
1.1.4
to1.3.0
?