8000 Add bigquery_kms_key Dataflow sample (#2402) · rmardiko/python-docs-samples@db332fd · GitHub
[go: up one dir, main page]

Skip to content
8000

Commit db332fd

Browse files
authored
Add bigquery_kms_key Dataflow sample (GoogleCloudPlatform#2402)
* Add bigquery_kms_key Dataflow sample * Clarified description on service accounts
1 parent 851525c commit db332fd

File tree

4 files changed

+381
-0
lines changed

4 files changed

+381
-0
lines changed

dataflow/README.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Getting started with Google Cloud Dataflow
2+
3+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor)
4+
5+
[Apache Beam](https://beam.apache.org/)
6+
is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.
7+
This guides you through all the steps needed to run an Apache Beam pipeline in the
8+
[Google Cloud Dataflow](https://cloud.google.com/dataflow) runner.
9+
10+
## Setting up your Google Cloud project
11+
12+
The following instructions help you prepare your Google Cloud project.
13+
14+
1. Install the [Cloud SDK](https://cloud.google.com/sdk/docs/).
15+
> *Note:* This is not required in
16+
> [Cloud Shell](https://console.cloud.google.com/cloudshell/editor)
17+
> since it already has the Cloud SDK pre-installed.
18+
19+
1. Create a new Google Cloud project via the
20+
[*New Project* page](https://console.cloud.google.com/projectcreate),
21+
or via the `gcloud` command line tool.
22+
23+
```sh
24+
export PROJECT=your-google-cloud-project-id
25+
gcloud projects create $PROJECT
26+
```
27+
28+
1. Setup the Cloud SDK to your GCP project.
29+
30+
```sh
31+
gcloud init
32+
```
33+
34+
1. [Enable billing](https://cloud.google.com/billing/docs/how-to/modify-project).
35+
36+
1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,storage_component,storage_api,logging,cloudresourcemanager.googleapis.com,iam.googleapis.com):
37+
Dataflow, Compute Engine, Cloud Storage, Cloud Storage JSON,
38+
Stackdriver Logging, Cloud Resource Manager, and IAM API.
39+
40+
1. Create a service account JSON key via the
41+
[*Create service account key* page](https://console.cloud.google.com/apis/credentials/serviceaccountkey),
42+
or via the `gcloud` command line tool.
43+
Here is how to do it through the *Create service account key* page.
44+
45+
* From the **Service account** list, select **New service account**.
46+
* In the **Service account name** field, enter a name.
47+
* From the **Role** list, select **Project > Owner** **(*)**.
48+
* Click **Create**. A JSON file that contains your key downloads to your computer.
49+
50+
Alternatively, you can use `gcloud` through the command line.
51+
52+
```sh
53+
export PROJECT=$(gcloud config get-value project)
54+
export SA_NAME=samples
55+
export IAM_ACCOUNT=$SA_NAME@$PROJECT.iam.gserviceaccount.com
56+
57+
# Create the service account.
58+
gcloud iam service-accounts create $SA_NAME --display-name $SA_NAME
59+
60+
# Set the role to Project Owner (*).
61+
gcloud projects add-iam-policy-binding $PROJECT \
62+
--member serviceAccount:$IAM_ACCOUNT \
63+
--role roles/owner
64+
65+
# Create a JSON file with the service account credentials.
66+
gcloud iam service-accounts keys create path/to/your/credentials.json \
67+
--iam-account=$IAM_ACCOUNT
68+
```
69+
70+
> **(*)** *Note:* The **Role** field authorizes your service account to access resources.
71+
> You can view and change this field later by using the
72+
> [GCP Console IAM page](https://console.cloud.google.com/iam-admin/iam).
73+
> If you are developing a production app, specify more granular permissions than **Project > Owner**.
74+
> For more information, see
75+
> [Granting roles to service accounts](https://cloud.google.com/iam/docs/granting-roles-to-service-accounts).
76+
77+
For more information, see
78+
[Creating and managing service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts)
79+
80+
1. Set your `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to your service account key file.
81+
82+
```sh
83+
export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
84+
```
85+
86+
## Setting up a Python development environment
87+
88+
For instructions on how to install Python, virtualenv, and the Cloud SDK, see the
89+
[Setting up a Python development environment](https://cloud.google.com/python/setup)
90+
guide.

dataflow/encryption-keys/README.md

Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Using customer-managed encryption keys
2+
3+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor)
4+
5+
This sample demonstrate how to use
6+
[cryptographic encryption keys](https://cloud.google.com/kms/)
7+
for the I/O connectors in an
8+
[Apache Beam](https://beam.apache.org) pipeline.
9+
For more information, see the
10+
[Using customer-managed encryption keys](https://cloud.google.com/dataflow/docs/guides/customer-managed-encryption-keys)
11+
docs page.
12+
13+
## Before you begin
14+
15+
Follow the
16+
[Getting started with Google Cloud Dataflow](../README.md)
17+
page, and make sure you have a Google Cloud project with billing enabled
18+
and a *service account JSON key* set up in your `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
19+
Additionally, for this sample you need the following:
20+
21+
1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery,cloudkms.googleapis.com):
22+
BigQuery and Cloud KMS API.
23+
24+
1. Create a Cloud Storage bucket.
25+
26+
```sh
27+
export BUCKET=your-gcs-bucket
28+
gsutil mb gs://$BUCKET
29+
```
30+
31+
1. [Create a symmetric key ring](https://cloud.google.com/kms/docs/creating-keys).
32+
For best results, use a [regional location](https://cloud.google.com/kms/docs/locations).
33+
This example uses a `global` key for simplicity.
34+
35+
```sh
36+
export KMS_KEYRING=samples-keyring
37+
export KMS_KEY=samples-key
38+
39+
# Create a key ring.
40+
gcloud kms keyrings create $KMS_KEYRING --location global
41+
42+
# Create a key.
43+
gcloud kms keys create $KMS_KEY --location global \
44+
--keyring $KMS_KEYRING --purpose encryption
45+
```
46+
47+
> *Note:* Although you can destroy the
48+
> [*key version material*](https://cloud.google.com/kms/docs/destroy-restore),
49+
> you [cannot delete keys and key rings](https://cloud.google.com/kms/docs/object-hierarchy#lifetime).
50+
> Key rings and keys do not have billable costs or quota limitations,
51+
> so their continued existence does not impact costs or production limits.
52+
53+
1. Grant Encrypter/Decrypter permissions to the *Dataflow*, *Compute Engine*, and *BigQuery*
54+
[service accounts](https://cloud.google.com/iam/docs/service-accounts).
55+
This grants your Dataflow, Compute Engine and BigQuery service accounts the
56+
permission to encrypt and decrypt with the CMEK you specify.
57+
The Dataflow workers use these service accounts when running the pipeline,
58+
which is different from the *user* service account used to start the pipeline.
59+
60+
```sh
61+
export PROJECT=$(gcloud config get-value project)
62+
export PROJECT_NUMBER=$(gcloud projects list --filter $PROJECT --format "value(PROJECT_NUMBER)")
63+
64+
# Grant Encrypter/Decrypter permissions to the Dataflow service account.
65+
gcloud projects add-iam-policy-binding $PROJECT \
66+
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
67+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
68+
69+
# Grant Encrypter/Decrypter permissions to the Compute Engine service account.
70+
gcloud projects add-iam-policy-binding $PROJECT \
71+
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
72+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
73+
74+
# Grant Encrypter/Decrypter permissions to the BigQuery service account.
75+
gcloud projects add-iam-policy-binding $PROJECT \
76+
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \
77+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
78+
```
79+
80+
1. Clone the `python-docs-samples` repository.
81+
82+
```sh
83+
git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
84+
```
85+
86+
1. Navigate to the sample code directory.
87+
88+
```sh
89+
cd python-docs-samples/dataflow/encryption-keys
90+
```
91+
92+
1. Create a virtual environment and activate it.
93+
94+
```sh
95+
virtualenv env
96+
source env/bin/activate
97+
```
98+
99+
> Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running `deactivate`.
100+
101+
1. Install the sample requirements.
102+
103+
```sh
104+
pip install -U -r requirements.txt
105+
```
106+
107+
## BigQuery KMS Key example
108+
109+
* [bigquery_kms_key.py](bigquery_kms_key.py)
110+
111+
The following sample gets some data from the
112+
[NASA wildfires public BigQuery dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table)
113+
using a customer-managed encryption key, and dump that data into the specified `output_bigquery_table`
114+
using the same customer-managed encryption key.
115+
116+
Make sure you have the following variables set up:
117+
118+
```sh
119+
# Set the project ID, GCS bucket and KMS key.
120+
export PROJECT=$(gcloud config get-value project)
121+
export BUCKET=your-gcs-bucket
122+
123+
# Set the region for the Dataflow job.
124+
# https://cloud.google.com/compute/docs/regions-zones/
125+
export REGION=us-central1
126+
127+
# Set the KMS key ID.
128+
export KMS_KEYRING=samples-keyring
129+
export KMS_KEY=samples-key
130+
export KMS_KEY_ID=$(gcloud kms keys list --location global --keyring $KMS_KEYRING --filter $KMS_KEY --format "value(NAME)")
131+
132+
# Output BigQuery dataset and table name.
133+
export DATASET=samples
134+
export TABLE=dataflow_kms
135+
```
136+
137+
Create the BigQuery dataset where the output table resides.
138+
139+
```sh
140+
# Create the BigQuery dataset.
141+
bq mk --dataset $PROJECT:$DATASET
142+
```
143+
144+
To run the sample using the Dataflow runner.
145+
146+
```sh
147+
python bigquery_kms_key.py \
148+
--output_bigquery_table $PROJECT:$DATASET.$TABLE \
149+
--kms_key $KMS_KEY_ID \
150+
--project $PROJECT \
151+
--runner DataflowRunner \
152+
--temp_location gs://$BUCKET/samples/dataflow/kms/tmp \
153+
--region $REGION
154+
```
155+
156+
> *Note:* To run locally you can omit the `--runner` command line argument and it defaults to the `DirectRunner`.
157+
158+
You can check your submitted Cloud Dataflow jobs in the
159+
[GCP Console Dataflow page](https://console.cloud.google.com/dataflow) or by using `gcloud`.
160+
161+
```sh
162+
gcloud dataflow jobs list
163+
```
164+
165+
Finally, check the contents of the BigQuery table.
166+
167+
```sh
168+
bq query --use_legacy_sql=false "SELECT * FROM `$PROJECT.$DATASET.$TABLE`"
169+
```
170+
171+
## Cleanup
172+
173+
To avoid incurring charges to your GCP account for the resources used:
174+
175+
```sh
176+
# Remove only the files created by this sample.
177+
gsutil -m rm -rf "gs://$BUCKET/samples/dataflow/kms"
178+
179+
# [optional] Remove the Cloud Storage bucket.
180+
gsutil rb gs://$BUCKET
181+
182+
# Remove the BigQuery table.
183+
bq rm -f -t $PROJECT:$DATASET.$TABLE
184+
185+
# [optional] Remove the BigQuery dataset and all its tables.
186+
bq rm -rf -d $PROJECT:$DATASET
187+
188+
# Revoke Encrypter/Decrypter permissions to the Dataflow service account.
189+
gcloud projects remove-iam-policy-binding $PROJECT \
190+
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
191+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
192+
193+
# Revoke Encrypter/Decrypter permissions to the Compute Engine service account.
194+
gcloud projects remove-iam-policy-binding $PROJECT \
195+
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
196+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
197+
198+
# Revoke Encrypter/Decrypter permissions to the BigQuery service account.
199+
gcloud projects remove-iam-policy-binding $PROJECT \
200+
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \
201+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
202+
```
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
#!/usr/bin/env python
2+
#
3+
# Copyright 2019 Google Inc. All Rights Reserved.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
import argparse
18+
19+
20+
def run(output_bigquery_table, kms_key, beam_args):
21+
# [START dataflow_cmek]
22+
import apache_beam as beam
23+
24+
# output_bigquery_table = '<project>:<dataset>.<table>'
25+
# kms_key = 'projects/<project>/locations/<kms-location>/keyRings/<kms-keyring>/cryptoKeys/<kms-key>' # noqa
26+
# beam_args = [
27+
# '--project', 'your-project-id',
28+
# '--runner', 'DataflowRunner',
29+
# '--temp_location', 'gs://your-bucket/samples/dataflow/kms/tmp',
30+
# '--region', 'us-central1',
31+
# ]
32+
33+
# Query from the NASA wildfires public dataset:
34+
# https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table
35+
query = """
36+
SELECT latitude,longitude,acq_date,acq_time,bright_ti4,confidence
37+
FROM `bigquery-public-data.nasa_wildfire.past_week`
38+
LIMIT 10
39+
"""
40+
41+
# Schema for the output BigQuery table.
42+
schema = {
43+
'fields': [
44+
{'name': 'latitude', 'type': 'FLOAT'},
45+
{'name': 'longitude', 'type': 'FLOAT'},
46+
{'name': 'acq_date', 'type': 'DATE'},
47+
{'name': 'acq_time', 'type': 'TIME'},
48+
{'name': 'bright_ti4', 'type': 'FLOAT'},
49+
{'name': 'confidence', 'type': 'STRING'},
50+
],
51+
}
52+
53+
options = beam.options.pipeline_options.PipelineOptions(beam_args)
54+
with beam.Pipeline(options=options) as pipeline:
55+
(
56+
pipeline
57+
| 'Read from BigQuery with KMS key' >>
58+
beam.io.Read(beam.io.BigQuerySource(
59+
query=query,
60+
use_standard_sql=True,
61+
kms_key=kms_key,
62+
))
63+
| 'Write to BigQuery with KMS key' >>
DD32 64+
beam.io.WriteToBigQuery(
65+
output_bigquery_table,
66+
schema=schema,
67+
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
68+
kms_key=kms_key,
69+
)
70+
)
71+
# [END dataflow_cmek]
72+
73+
74+
if __name__ == '__main__':
75+
parser = argparse.ArgumentParser()
76+
parser.add_argument(
77+
'--kms_key',
78+
required=True,
79+
help='Cloud Key Management Service key name',
80+
)
81+
parser.add_argument(
82+
'--output_bigquery_table',
83+
required=True,
84+
help="Output BigQuery table in the format 'PROJECT:DATASET.TABLE'",
85+
)
86+
args, beam_args = parser.parse_known_args()
87+
88+
run(args.output_bigquery_table, args.kms_key, beam_args)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
apache-beam[gcp]

0 commit comments

Comments
 (0)
0