|
| 1 | +# Run template |
| 2 | + |
| 3 | +[`main.py`](main.py) - Script to run an [Apache Beam] template on [Google Cloud Dataflow]. |
| 4 | + |
| 5 | +The following examples show how to run the [`Word_Count` template], but you can run any other template. |
| 6 | + |
| 7 | +For the `Word_Count` template, we require to pass an `output` Cloud Storage path prefix, and optionally we can pass an `inputFile` Cloud Storage file pattern for the inputs. |
| 8 | +If `inputFile` is not passed, it will take `gs://apache-beam-samples/shakespeare/kinglear.txt` as default. |
| 9 | + |
| 10 | +## Before you begin |
| 11 | + |
| 12 | +1. Install the [Cloud SDK]. |
| 13 | + |
| 14 | +1. [Create a new project]. |
| 15 | + |
| 16 | +1. [Enable billing]. |
| 17 | + |
| 18 | +1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,logging,storage_component,storage_api,bigquery,pubsub,datastore.googleapis.com,cloudfunctions.googleapis.com,cloudresourcemanager.googleapis.com): Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Pub/Sub, Datastore, Cloud Functions, and Cloud Resource Manager. |
| 19 | + |
| 20 | +1. Setup the Cloud SDK to your GCP project. |
| 21 | + |
| 22 | + ```bash |
| 23 | + gcloud init |
| 24 | + ``` |
| 25 | + |
| 26 | +1. Create a Cloud Storage bucket. |
| 27 | + |
| 28 | + ```bash |
| 29 | + gsutil mb gs://your-gcs-bucket |
| 30 | + ``` |
| 31 | + |
| 32 | +## Setup |
| 33 | + |
| 34 | +The following instructions will help you prepare your development environment. |
| 35 | + |
| 36 | +1. [Install Python and virtualenv]. |
| 37 | + |
| 38 | +1. Clone the `python-docs-samples` repository. |
| 39 | + |
| 40 | + ```bash |
| 41 | + git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git |
| 42 | + ``` |
| 43 | + |
| 44 | +1. Navigate to the sample code directory. |
| 45 | + |
| 46 | + ```bash |
| 47 | + cd python-docs-samples/dataflow/run_template |
| 48 | + ``` |
| 49 | + |
| 50 | +1. Create a virtual environment and activate it. |
| 51 | + |
| 52 | + ```bash |
| 53 | + virtualenv env |
| 54 | + source env/bin/activate |
| 55 | + ``` |
| 56 | + |
| 57 | + > Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running `deactivate`. |
| 58 | + |
| 59 | +1. Install the sample requirements. |
| 60 | + |
| 61 | + ```bash |
| 62 | + pip install -U -r requirements.txt |
| 63 | + ``` |
| 64 | + |
| 65 | +## Running locally |
| 66 | + |
| 67 | +To run a Dataflow template from the command line. |
| 68 | + |
| 69 | +> NOTE: To run locally, you'll need to [create a service account key] as a JSON file. |
| 70 | +> Then export an environment variable called `GOOGLE_APPLICATION_CREDENTIALS` pointing it to your service account file. |
| 71 | +
|
| 72 | +```bash |
| 73 | +python main.py \ |
| 74 | + --project <your-gcp-project> \ |
| 75 | + --job wordcount-$(date +'%Y%m%d-%H%M%S') \ |
| 76 | + --template gs://dataflow-templates/latest/Word_Count \ |
| 77 | + --inputFile gs://apache-beam-samples/shakespeare/kinglear.txt \ |
| 78 | + --output gs://<your-gcs-bucket>/wordcount/outputs |
| 79 | +``` |
| 80 | +
|
| 81 | +## Running in Python |
| 82 | +
|
| 83 | +To run a Dataflow template from Python. |
| 84 | +
|
| 85 | +> NOTE: To run locally, you'll need to [create a service account key] as a JSON file. |
| 86 | +> Then export an environment variable called `GOOGLE_APPLICATION_CREDENTIALS` pointing it to your service account file. |
| 87 | + |
| 88 | +```py |
| 89 | +import main as run_template |
| 90 | +
|
| 91 | +run_template.run( |
| 92 | + project='your-gcp-project', |
| 93 | + job='unique-job-name', |
| 94 | + template='gs://dataflow-templates/latest/Word_Count', |
| 95 | + parameters={ |
| 96 | + 'inputFile': 'gs://apache-beam-samples/shakespeare/kinglear.txt', |
| 97 | + 'output': 'gs://<your-gcs-bucket>/wordcount/outputs', |
| 98 | + } |
| 99 | +) |
| 100 | +``` |
| 101 | + |
| 102 | +## Running in Cloud Functions |
| 103 | + |
| 104 | +To deploy this into a Cloud Function and run a Dataflow template via an HTTP request as a REST API. |
| 105 | + |
| 106 | +```bash |
| 107 | +PROJECT=$(gcloud config get-value project) \ |
| 108 | +REGION=$(gcloud config get-value functions/region) |
| 109 | +
|
| 110 | +# Deploy the Cloud Function. |
| 111 | +gcloud functions deploy run_template \ |
| 112 | + --runtime python37 \ |
| 113 | + --trigger-http \ |
| 114 | + --region $REGION |
| 115 | +
|
| 116 | +# Call the Cloud Function via an HTTP request. |
| 117 | +curl -X POST "https://$REGION-$PROJECT.cloudfunctions.net/run_template" \ |
| 118 | + -d project=$PROJECT \ |
| 119 | + -d job=wordcount-$(date +'%Y%m%d-%H%M%S') \ |
| 120 | + -d template=gs://dataflow-templates/latest/Word_Count \ |
| 121 | + -d inputFile=gs://apache-beam-samples/shakespeare/kinglear.txt \ |
| 122 | + -d output=gs://<your-gcs-bucket>/wordcount/outputs |
| 123 | +``` |
| 124 | + |
| 125 | +[Apache Beam]: https://beam.apache.org/ |
| 126 | +[Google Cloud Dataflow]: https://cloud.google.com/dataflow/docs/ |
| 127 | +[`Word_Count` template]: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/java/com/google/cloud/teleport/templates/WordCount.java |
| 128 | + |
| 129 | +[Cloud SDK]: https://cloud.google.com/sdk/docs/ |
| 130 | +[Create a new project]: https://console.cloud.google.com/projectcreate |
| 131 | +[Enable billing]: https://cloud.google.com/billing/docs/how-to/modify-project |
| 132 | +[Create a service account key]: https://console.cloud.google.com/apis/credentials/serviceaccountkey |
| 133 | +[Creating and managing service accounts]: https://cloud.google.com/iam/docs/creating-managing-service-accounts |
| 134 | +[GCP Console IAM page]: https://console.cloud.google.com/iam-admin/iam |
| 135 | +[Granting roles to service accounts]: https://cloud.google.com/iam/docs/granting-roles-to-service-accounts |
| 136 | + |
| 137 | +[Install Python and virtualenv]: https://cloud.google.com/python/setup |
0 commit comments