8000 BigQuery Storage API sample for reading pandas dataframe by tswast · Pull Request #1994 · GoogleCloudPlatform/python-docs-samples · GitHub
[go: up one dir, main page]

Skip to content

BigQuery Storage API sample for reading pandas dataframe #1994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 7, 2019

Conversation

tswast
Copy link
Contributor
@tswast tswast commented Feb 4, 2019

How to get a pandas DataFrame, fast!

The first two examples use the existing BigQuery client. These examples
create a thread pool and read in parallel. The final example shows using
just the new BigQuery Storage client, but only shows how to read with a
single thread.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Feb 4, 2019
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Feb 4, 2019
@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Feb 5, 2019
@tswast tswast force-pushed the tswast-bqstorage-pandas branch from 0eaca4e to a3a48c0 Compare February 5, 2019 00:37
@tswast tswast requested review from alixhami and shollyman February 5, 2019 00:37
@tswast tswast added the bigquery label Feb 5, 2019
Copy link
Contributor
@shollyman shollyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, modulo the open question on small results

# [START bigquerystorage_pandas_read_query_results]
import uuid

# Due to a known issue in the BigQuery Storage API (TODO: link to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider simply running a large query that emits enough data to avoid the inline? Pros: better demonstrates the perf of the new API, and avoids us having to revisit the sample. Cons: test time overhead and potential pitfalls for people kicking tires with small results. Part of this is dependent on how the team will be maintaining their KI list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cons: test time overhead.

Long test time is my main reason for avoiding queries that return big results. I guess it's not so bad since this repo can test the different directories independently.

Cons: potential pitfalls for people kicking tires with small results.

The current failure case is rather bad: it returns a successful response, but gives you an empty result set. That's a pretty big pitfall, because catching it requires noticing that you didn't get the data you thought you were going to get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. Let's confirm how the KIs will be maintained so you can link them.

tswast added 10 commits February 7, 2019 10:46
How to get a pandas DataFrame, fast!

The first two examples use the existing BigQuery client. These examples
create a thread pool and read in parallel. The final example shows using
just the new BigQuery Storage client, but only shows how to read with a
single thread.
* Move imports inside region tags.
* Adjust query indentation to match region tags.
Move duplicate imports out of region tags.

Add region tag for the whole sample.
to just above the sample where it is used. This
makes the complete source code for the sample
make more sense (bigquerystorage_pandas_tutorial_all)
@tswast tswast force-pushed the tswast-bqstorage-pandas branch from f064277 to 925fe3b Compare February 7, 2019 18:47
@tswast tswast merged commit e9bc7de into master Feb 7, 2019
@tswast tswast deleted the tswast-bqstorage-pandas branch February 7, 2019 18:53
plamut pushed a commit to plamut/python-bigquery-storage that referenced this pull request Sep 2, 2020
…ogleCloudPlatform/python-docs-samples#1994)

* BigQuery Storage API sample for reading pandas dataframe

How to get a pandas DataFrame, fast!

The first two examples use the existing BigQuery client. These examples
create a thread pool and read in parallel. The final example shows using
just the new BigQuery Storage client, but only shows how to read with a
single thread.
plamut pushed a commit to googleapis/python-bigquery-storage that referenced this pull request Sep 10, 2020
…ogleCloudPlatform/python-docs-samples#1994)

* BigQuery Storage API sample for reading pandas dataframe

How to get a pandas DataFrame, fast!

The first two examples use the existing BigQuery client. These examples
create a thread pool and read in parallel. The final example shows using
just the new BigQuery Storage client, but only shows how to read with a
single thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0