Kaiyang expansion project 2022 #8224

kaiyang-code · 2022-08-01T21:49:10Z

Description

@leahecole @bradmiro This is the draft PR for Kaiyang Yu's expansion project. The DAG script is an expansion of data_analytics_dag.py and the PySpark code is an expansion of the data_analytics_process.py. The entire workflow is trying to answer the question: "How has the rainfall and snowfall patterns in the western US changed over the past 25 years?" and "How has the rainfall and snowfall patterns in Phoenix changed over the past 25 years?". The instructions are specifically designed for @leahecole and @bradmiro as others may not have access to the recourses.

How to run

The expansion project can be run in the following way:

Download the ghcnd-stations-new.txt (if the link doesn't work, you can find it in the workshop_example_bucket GCS bucket) file and upload it to your desired GCS bucket. This is the dataset after pre-processing.
Upload the data_analytics_process_expansion.py to the same GCS bucket as the last step.
Create a Cloud Composer environment with the latest version of Composer and Airflow. Add the following variables using Airflow UI:
- dataproc_service_account: #######-compute@developer.gserviceaccount.com
- gce_region: us-central1
- gcp_project: <your_gcp_project>
- gcs_bucket: the bucket you created in step one
Upload the data_analytics_dag_expansion.py to the Composer environment you just created and trigger the DAG

Alternatively, you can also directly run it with the same environment that I'm using:

Navigate to the Cloud Composer console and select the environment called expansion_project
Select data_analytics_dag
Trigger the DAG

If you just want to run the PySpark code:

In data_analytics_process_expansion.py, comment out line 32 to 38 (inclusive), uncomment line 23 to 29 and 40 to 46 (inclusive).
Upload the new data_analytics_process_expansion.py to a GCS bucket.
Run gcloud dataproc jobs submit pyspark gs://path_to_your_file_from_last_step --cluster=your_cluster --region=us-central1 --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar using Cloud Shell.

Alternatively, you can run the PySpark code using my cluster:

In data_analytics_process_expansion.py, comment out line 32 to 38 (inclusive), uncomment line 23 to 29 and 40 to 46 (inclusive).
Re-upload the data_analytics_process_expansion.py to the workshop_example_bucket GCS bucket, overwriting the original file.
In Cloud Shell, run gcloud dataproc jobs submit pyspark gs://workshop_example_bucket/data_analytics_process_expansion.py --cluster=cluster-d630 --region=us-central1 --jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar.

View results

You can view the results in BigQuery, under the dataset expansion_project:
- ghcnd_stations_joined: merged dataset after the BigQuery query job.
- ghcnd_stations_normalization: dataset after row filtering and unit normalization.
- ghcnd_stations_prcp_mean: arithmetic mean of annual precipitation in western US over the past 25 years
- ghcnd_stations_snow_mean: arithmetic mean of annual snowfall in western US over the past 25 years
- phx_annual_prcp: annual precipitation in Phoenix over the past 25 years (result of distance weighting algorithm)
- phx_annual_snow: annual snowfall in Phoenix over the past 25 years (result of distance weighting algorithm)

⚠️ CAUTION: Before running the DAG, be sure to remove the `ghcnd-stations-joined` dataset since the DAG codes features a `WRITE_APPEND` write disposition and the dataset will double in size every time the DAG runs. You don't have to worry about it if you are only running the PySpark program. However, be sure that the `ghcnd-stations-joined` dataset exists in BQ if you're only running the PySpark code.

Next step

Add test for data_analytics_process_expansion.py.
Delete the print() functions, as they are here only for debugging purposes.

Checklist

I have followed Sample Guidelines from AUTHORING_GUIDE.MD
README is updated to include all relevant information
Tests pass: nox -s py-3.9 (see Test Environment Setup)
Lint pass: nox -s lint (see Test Environment Setup)
These samples need a new API enabled in testing projects to pass (let us know which ones)
These samples need a new/updated env vars in testing projects set to pass (let us know which ones)
Please merge this PR for me once it is approved.
This sample adds a new sample directory, and I updated the CODEOWNERS file with the codeowners for this sample

…t_2022

leahecole · 2022-08-02T14:49:18Z

Ok I haven't even started reviewing but this PR description is 🔥 and because of that it has me excited to review it , well done @kaiyang-code

bradmiro

Looking great! Let me know if you'd like me to clarify any comments.

bradmiro · 2022-08-02T14:33:55Z

composer/2022_airflow_summit/data_analytics_dag_expansion.py

+            f"{BQ_DESTINATION_DATASET_NAME}.{BQ_PHX_PRCP_TABLE_NAME}",
+            f"{BQ_DESTINATION_DATASET_NAME}.{BQ_PHX_SNOW_TABLE_NAME}",
+        ],
+


Extra line? @leahecole to confirm DAG syntax

idk I'd probably run black on it though for formatting

composer/2022_airflow_summit/data_analytics_dag_expansion.py

bradmiro · 2022-08-02T14:35:26Z

composer/2022_airflow_summit/data_analytics_process_expansion.py

+# BQ_DESTINATION_DATASET_NAME = "expansion_project"
+# BQ_DESTINATION_TABLE_NAME = "ghcnd_stations_joined"
+# BQ_NORMALIZED_TABLE_NAME = "ghcnd_stations_normalized"
+# BQ_PRCP_MEAN_TABLE_NAME = "ghcnd_stations_prcp_mean"
+# BQ_SNOW_MEAN_TABLE_NAME = "ghcnd_stations_snow_mean"
+# BQ_PHX_PRCP_TABLE_NAME = "phx_annual_prcp"
+# BQ_PHX_SNOW_TABLE_NAME = "phx_annual_snow"


Is this necessary to leave in for the purposes of the sample?

I left it on purpose so that the PySpark program can run independently without having to run together with the DAG. You can check out the "If you just want to run the PySpark code:" section in my description :)

bradmiro · 2022-08-02T14:35:38Z

composer/2022_airflow_summit/data_analytics_process_expansion.py

+    # BUCKET_NAME = "workshop_example_bucket"
+    # READ_TABLE = f"{BQ_DESTINATION_DATASET_NAME}.{BQ_DESTINATION_TABLE_NAME}"
+    # DF_WRITE_TABLE = f"{BQ_DESTINATION_DATASET_NAME}.{BQ_NORMALIZED_TABLE_NAME}"
+    # PRCP_MEAN_WRITE_TABLE = f"{BQ_DESTINATION_DATASET_NAME}.{BQ_PRCP_MEAN_TABLE_NAME}"
+    # SNOW_MEAN_WRITE_TABLE = f"{BQ_DESTINATION_DATASET_NAME}.{BQ_SNOW_MEAN_TABLE_NAME}"
+    # PHX_PRCP_WRITE_TABLE = f"{BQ_DESTINATION_DATASET_NAME}.{BQ_PHX_PRCP_TABLE_NAME}"
+    # PHX_SNOW_WRITE_TABLE = f"{BQ_DESTINATION_DATASET_NAME}.{BQ_PHX_SNOW_TABLE_NAME}"


Is this necessary to leave in?

Same reason as above

composer/2022_airflow_summit/data_analytics_process_expansion.py

bradmiro · 2022-08-02T14:56:05Z

composer/2022_airflow_summit/data_analytics_process_expansion.py

+        phx_annual_prcp_df = (
+            phx_annual_prcp_df.withColumn(f"PHX_PRCP_{year}", lit(phx_dw_compute(prcp_year)))
+        )
+        phx_annual_snow_df = (
+            phx_annual_snow_df.withColumn(f"PHX_SNOW_{year}", lit(phx_dw_compute(snow_year)))
+        )


You can also just create the DataFrames here and populate them in one line instead of declaring them on 119 / 120. See Section 2.2: https://sparkbyexamples.com/pyspark/different-ways-to-create-dataframe-in-pyspark/

The reason I'm doing this is because they are in a for loop. If I didn't misunderstand your point, I think creating and populating them in one line would result in more than one DF, which is not what I want.

composer/2022_airflow_summit/data_analytics_process_expansion.py

leahecole

+1 to what Brad says, and then in the data processing file, extra emphasis on the adding comments. We aren't going to have time to turn this into a tutorial, so having it as a well commented, runnable code sample is the next best thing

leahecole · 2022-08-02T17:16:47Z

composer/2022_airflow_summit/data_analytics_dag_expansion.py

+BQ_PRCP_MEAN_TABLE_NAME = "ghcnd_stations_prcp_mean"
+BQ_SNOW_MEAN_TABLE_NAME = "ghcnd_stations_prcp_mean"


Are these supposed to be identical?

it's a typo! Just fixed!

Still showing up as identical

leahecole · 2022-08-02T17:17:11Z

composer/2022_airflow_summit/data_analytics_dag_expansion.py

+            f"{BQ_DESTINATION_DATASET_NAME}.{BQ_PHX_PRCP_TABLE_NAME}",
+            f"{BQ_DESTINATION_DATASET_NAME}.{BQ_PHX_SNOW_TABLE_NAME}",
+        ],
+


idk I'd probably run black on it though for formatting

composer/2022_airflow_summit/data_analytics_dag_expansion.py

composer/2022_airflow_summit/data_analytics_process_expansion.py

…t_2022

leahecole

couple of other small nits

leahecole · 2022-08-04T15:06:20Z

composer/2022_airflow_summit/data_analytics_dag_expansion.py

+BQ_PRCP_MEAN_TABLE_NAME = "ghcnd_stations_prcp_mean"
+BQ_SNOW_MEAN_TABLE_NAME = "ghcnd_stations_prcp_mean"


Still showing up as identical

composer/2022_airflow_summit/data_analytics_dag_expansion.py

composer/2022_airflow_summit/data_analytics_process_expansion.py

bradmiro · 2022-08-04T17:36:24Z

composer/2022_airflow_summit/data_analytics_process_expansion.py

+        )
+
+        phx_annual_prcp_df = (
+            phx_annual_prcp_df.withColumn(f"PHX_PRCP_{year_val}", lit(phx_dw_compute(prcp_year)))


Is there a reason you're introducing new columns each time? As opposed to making this statically have two columns that you're appending (Year, Value) to?

This is due to the design of the expansion project. We can make the changes in the future if adding rows is something that we want. leaving it as unresolved for now 😃

bradmiro · 2022-08-04T17:36:32Z

composer/2022_airflow_summit/data_analytics_process_expansion.py

+            phx_annual_prcp_df.withColumn(f"PHX_PRCP_{year_val}", lit(phx_dw_compute(prcp_year)))
+        )
+        phx_annual_snow_df = (
+            phx_annual_snow_df.withColumn(f"PHX_SNOW_{year_val}", lit(phx_dw_compute(snow_year)))


Same as above.

same as above

* Fix: add region tags * Fix: region tag typos * Fix: urlpatterns moved to end * Fix: typo * Fix: cli retries to fix flakiness * Fix: remove duplicate tags * Fix: use backoff for retries * Fix: lint import order error

…t_2022

leahecole

@bradmiro I'm fine to merge this to a branch for now, wdyt?

bradmiro

LGTM for merge into staging upstream branch - rebase to happen before merging into upstream main.

* chenged the dag to load ghcn dataset * data preprocessing done * modified preprocessing * dataproc file added * code runs great * modifyed code based on Brad, still buggy * finished modifying, haven't sync wit hDAG * finished modifying DAG codes * ready for draft PR * pass lint * addressed Brad and Leah's comments * pass nox lint * pass nox lint * Fix: Retry CLI launch if needed (#8221) * Fix: add region tags * Fix: region tag typos * Fix: urlpatterns moved to end * Fix: typo * Fix: cli retries to fix flakiness * Fix: remove duplicate tags * Fix: use backoff for retries * Fix: lint import order error * address Leah's comments about typo and comments Co-authored-by: Charles Engelke <engelke@google.com>

* Kaiyang expansion project 2022 (#8224) * chenged the dag to load ghcn dataset * data preprocessing done * modified preprocessing * dataproc file added * code runs great * modifyed code based on Brad, still buggy * finished modifying, haven't sync wit hDAG * finished modifying DAG codes * ready for draft PR * pass lint * addressed Brad and Leah's comments * pass nox lint * pass nox lint * Fix: Retry CLI launch if needed (#8221) * Fix: add region tags * Fix: region tag typos * Fix: urlpatterns moved to end * Fix: typo * Fix: cli retries to fix flakiness * Fix: remove duplicate tags * Fix: use backoff for retries * Fix: lint import order error * address Leah's comments about typo and comments Co-authored-by: Charles Engelke <engelke@google.com> * run blacken on dag and dataproc code * WIP: not working test for process job * working test for expansion dataproc script * move dataproc expansion files to separate directory * add readme * update readme * run black * ignore data file * fix import order * ignore one line of lint because it's being silly * add check for Notfound for test * add requirements files * add noxfile config * update try/except * experiment - fully qualify path * update filepath * update path * try different path * remove the directory that was causing test problems * fix typo in header checker * tell folks to skip cleanup of prereq * clean up hyperlinks for distance weighting and arithmetic mean * fix math links again * remove debug statements * remove commented out variables * Update composer/2022_airflow_summit/data_analytics_dag_expansion_test.py Co-authored-by: Dan Lee <71398022+dandhlee@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Dan Lee <71398022+dandhlee@users.noreply.github.com> * Apply suggestions from code review * update apache-beam version (#8302) Bumping the `apache-beam[gcp]` version to (indirectly) bump the `google-cloud-pubsub` version to accept the keyword argument `request` on `create_topic()` * dataflow: replace job name underscores with hyphens (#8303) * dataflow: replace job name underscores with hyphens It looks like Dataflow no longer accepts underscores in the job names. Replacing them with hyphens should work. * fix test checks * improve error reporting * fix test name for exception handling * chore(deps): update dependency datalab to v1.2.1 (#8309) * fix: unsanitized output (#8316) * fix: unsanitized output * fix: add license to template * chore(deps): update dependency cryptography to v38 (#8317) * chore(deps): update dependency cryptography to v38 * lint Co-authored-by: Anthonios Partheniou <partheniou@google.com> * Remove region tags to be consistent with other languages (#8322) * fix lint in conftest (#8324) * Pin perl version to 5.34.0 as latest doesn't work with the example. (#8319) Co-authored-by: Leah E. Cole <6719667+leahecole@users.noreply.github.com> * refactor fixtures * revert last change * revert last change * chore(deps): update dependency tensorflow to v2.7.2 [security] (#8329) * remove backoff, add manual retry (#8328) * remove backoff, add manual retry * fix lint * remove unused import Co-authored-by: Anthonios Partheniou <partheniou@google.com> * refactor test to match #8328 * update most write methods, fix test issue with comparing to exception * Bmiro kaiyang edit (#8350) * modified code to more closely adhere to Spark best practices * remove unnecessary import * improved explanation of Inverse Distance Weighting * Apply suggestions from code review Co-authored-by: Leah E. Cole <6719667+leahecole@users.noreply.github.com> Co-authored-by: Leah E. Cole <6719667+leahecole@users.noreply.github.com> * run black on process files * fix relative import issue * fixed jvm error (#8360) * Add UDF type hinting (#8361) * fixed jvm error * add type hinting to UDF * Update composer/2022_airflow_summit/data_analytics_process_expansion.py * fix comment alignment * change dataproc region to northamerica-northeast1 * refactor import * switch other test to also use northamerica-northeast1 Co-authored-by: kaiyang-code <57576013+kaiyang-code@users.noreply.github.com> Co-authored-by: Charles Engelke <engelke@google.com> Co-authored-by: Maciej Strzelczyk <strzelczyk@google.com> Co-authored-by: Dan Lee <71398022+dandhlee@users.noreply.github.com> Co-authored-by: David Cavazos <dcavazos@google.com> Co-authored-by: WhiteSource Renovate <bot@renovateapp.com> Co-authored-by: Anthonios Partheniou <partheniou@google.com> Co-authored-by: Averi Kitsch <akitsch@google.com> Co-authored-by: mhenc <mhenc@google.com> Co-authored-by: Brad Miro <bmiro@google.com>

kaiyang-code and others added 13 commits July 18, 2022 23:59

chenged the dag to load ghcn dataset

e9ae0ea

Merge branch 'GoogleCloudPlatform:main' into kaiyang_expansion_projec…

9c2befe

…t_2022

data preprocessing done

b9be45c

modified preprocessing

32652f5

dataproc file added

826aaea

Merge branch 'GoogleCloudPlatform:main' into kaiyang_expansion_projec…

464cb39

…t_2022

code runs great

7f13eea

modifyed code based on Brad, still buggy

385482c

finished modifying, haven't sync wit hDAG

d3d38b0

finished modifying DAG codes

9d4368f

Merge branch 'GoogleCloudPlatform:main' into kaiyang_expansion_projec…

0af34d5

…t_2022

ready for draft PR

edf12eb

pass lint

9ce11d6

product-auto-label bot added the samples Issues that are directly related to samples. label Aug 1, 2022

bradmiro requested changes Aug 2, 2022

View reviewed changes

leahecole requested changes Aug 2, 2022

View reviewed changes

kaiyang-code and others added 4 commits August 2, 2022 20:01

Merge branch 'GoogleCloudPlatform:main' into kaiyang_expansion_projec…

13ce8fe

…t_2022

addressed Brad and Leah's comments

6e1d322

pass nox lint

357cd32

pass nox lint

5e565d0

kaiyang-code requested review from leahecole and bradmiro August 2, 2022 20:58

leahecole requested changes Aug 4, 2022

View reviewed changes

leahecole changed the base branch from main to kaiyang_expansion_project August 4, 2022 17:57

bradmiro reviewed Aug 4, 2022

View reviewed changes

engelke and others added 3 commits August 4, 2022 16:11

Fix: Retry CLI launch if needed (GoogleCloudPlatform#8221)

0c85f7b

* Fix: add region tags * Fix: region tag typos * Fix: urlpatterns moved to end * Fix: typo * Fix: cli retries to fix flakiness * Fix: remove duplicate tags * Fix: use backoff for retries * Fix: lint import order error

Merge branch 'GoogleCloudPlatform:main' into kaiyang_expansion_projec…

a4a39dd

…t_2022

address Leah's comments about typo and comments

385abad

kaiyang-code requested a review from leahecole August 6, 2022 03:03

kaiyang-code requested a review from bradmiro August 6, 2022 03:03

leahecole self-assigned this Aug 11, 2022

leahecole approved these changes Aug 18, 2022

View reviewed changes

bradmiro approved these changes Aug 18, 2022

View reviewed changes

leahecole marked this pull request as ready for review August 18, 2022 19:15

leahecole requested review from rachael-ds, rafalbiegacz and a team as code owners August 18, 2022 19:15

leahecole merged this pull request into GoogleCloudPlatform:kaiyang_expansion_project Aug 18, 2022

		BQ_PRCP_MEAN_TABLE_NAME = "ghcnd_stations_prcp_mean"
		BQ_SNOW_MEAN_TABLE_NAME = "ghcnd_stations_prcp_mean"

Kaiyang expansion project 2022 #8224

Kaiyang expansion project 2022 #8224

Uh oh!

Conversation

Uh oh!

Description

How to run

The expansion project can be run in the following way:

Alternatively, you can also directly run it with the same environment that I'm using:

If you just want to run the PySpark code:

Alternatively, you can run the PySpark code using my cluster:

View results

Next step

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!