8000 initial commit · renxiaoming/python-docs-samples@302a695 · GitHub
[go: up one dir, main page]

Skip to content

Commit 302a695

Browse files
committed
initial commit
1 parent c4a0e6d commit 302a695

File tree

9 files changed

+1218
-0
lines changed

9 files changed

+1218
-0
lines changed

tables/automl/pipeline/README.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# AutoML Tables Pipeline
2+
- Launch training and prediction jobs for AutoML Tables with a single command.
3+
- Define your pipelines with YAML configuration files for easy reuse.
4+
- Log parameters, operations, and results.
5+
6+
## Before you begin
7+
Install the most recent google cloud packages and additional requirements.
8+
```
9+
pip install --upgrade google-cloud
< 10000 /code>
10+
pip install --upgrade google-cloud-automl
11+
pip install -r requirements.txt
12+
```
13+
Set up service account authentication with an env variable or in cmd line at run
14+
time with the `--json_key_filepath` arg.
15+
```
16+
export GOOGLE_APPLICATION_CREDENTIALS=path/to/json_key
17+
```
18+
19+
20+
## Defining a pipeline
21+
YAML configuration files are used to manage an inventory of previously trained
22+
models, or to act as a template for repeated jobs with shared parameters. As a
23+
minimal example, consider a config file "my_config.yaml" with parameters:
24+
```
25+
dataset_display_name: my_dataset
26+
dataset_input_path: bq://project.dataset.table
27+
model_display_name: my_model
28+
label_column: my_label
29+
```
30+
See the provided "example.yaml" and Configuration section below for more
31+
details.
32+
33+
## Running a pipeline
34+
Parameters can also be provided in the command line, and will take priority over
35+
parameters in the config. Together they support a number of usage patterns, here
36+
are the two most common repeated jobs.
37+
38+
#### Training job
39+
Import a new dataset "my_dataset" then train a new model "my_model" with
40+
`--build_dataset` and `--build_model`:
41+
```
42+
python run_pipeline.py \
43+
--project=my_project \
44+
--config_filename=config/my_config.yaml \
45+
--build_dataset \
46+
--build_model
47+
```
48+
49+
#### Batch prediction job
50+
Load "my_dataset" and "my_model" (default behavior) then make a batch
51+
prediction with `--make_prediction`:
52+
```
53+
python run_pipeline.py \
54+
--project=my_project \
55+
--config_filename=config/my_config.yaml \
56+
--predict_input_path=bq://project.dataset.table \
57+
--predict_output_path=bq://project \
58+
--make_prediction
59+
```
60+
61+
## Project Structure
62+
```
63+
.
64+
├── run_pipeline # Script to run the Tables pipeline from the command line.
65+
├── tables_config # TablesConfig reads parameters from YAML and command line.
66+
├── tables_client # TablesClient adds helper functions for the AutoML client.
67+
├── tables_pipeline # TablesPipeline queues/executes operations with logging.
68+
├── config/ # Directory to read YAML parameter config files from.
69+
└── log/ # Directory to write logging files to.
70+
```
71+
72+
## Configuration
73+
YAML configuration files may be created in the config/ directory, an
74+
example.yaml is provided as a basis, and detailed descriptions for all
75+
parameters are provided below (X denotes required).
76+
77+
| Parameter | Default | Type | Comments |
78+
|------------------------|----------------|--------|--------------------------------------------------------------------------------|
79+
| project | X | String | Recommend setting through command line. |
80+
| location | us-central1 | String | Location of compute resources. |
81+
| build_dataset | false | Bool | true builds a new dataset, false loads an old one. |
82+
| build_model | false | Bool | true builds a new model, false loads an old one. |
83+
| make_prediction | false | Bool | Make a batch prediction after loads/builds. |
84+
| dataset_display_name | X | String | A unique and informative < 32 char name. |
85+
| dataset_input_path | X | String | bq://project.dataset.table or gs://path/to/train/data |
86+
| label_column | X | String | Label dtype determines if regression or classification. |
87+
| weight_column | null | String | Weights loss and evaluation metrics. |
88+
| split_column | null | String | Manually split data, time column is preferred. |
89+
| time_column | null | String | TIMESTAMP type column, auto split data on it. |
90+
| columns_nullable | null | Dict | Only modify columns detected differently than intended (display name to bool). |
91+
| columns_dtype | null | Dict | Only modify columns detected differently than intended (display name to str). |
92+
| model_display_name | X | String | A unique and informative < 32 char name. |
93+
| train_hours | 1.0 | Float | Maximum time for training, must be >= 1. |
94+
| optimization_objective | null | String | Recommend using defaults. |
95+
| ignore_columns | null | List | Columns (other than label/split/weight) to exclude in train. |
96+
| predict_input_path | X (if predict) | String | bq://project.dataset.table or gs://path/to/predict/data |
97+
| predict_output_path | X (if predict) | String | bq://project or gs://path/to/basedir (dataset.table or subdir generated). |
98+
99+
100+
## Logging
101+
Log files are written to the log/ directory by default, but the directory and
102+
filename can be set explicitly with the `--log_filepath` arg. Log levels can be
103+
set by `--console_log_level` and `--console_log_level` args.
104+
105+
- Parameters are logged (in YAML format) at run time for reproducibility.
106+
- Evaluation metrics and feature importance are logged during model load/build.
107+
- Full output path logged during prediction.
108+
- Operation names logged at INFO level, full responses at DEBUG.

tables/automl/pipeline/__init__.py

Whitespace-only changes.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# This YAML config is an example containing all possible parameters.
2+
# Parameters can be set in this config, or through the command line.
3+
4+
# Required parameters.
5+
project: my_project
6+
dataset_display_name: my_dataset
7+
dataset_input_path: bq://project.dataset.table
8+
label_column: my_label_column
9+
model_display_name: my_model
10+
11+
# Optional parameters.
12+
build_dataset: true
13+
build_model: true
14+
make_prediction: true
15+
location: us-central1
16+
weight_column: my_weight_column
17+
split_column: my_split_column
18+
time_column: my_time_column
19+
columns_nullable:
20+
my_non_nullable_column: false
21+
my_nullable_column: true
22+
columns_dtype:
23+
my_categorical_column: CATEGORY
24+
my_numerical_column: FLOAT64
25+
train_hours: 1.5
26+
optimization_objective: MINIMIZE_RMSE
27+
ignore_columns:
28+
- my_id_column_1
29+
- my_id_column_2
30+
31+
# Required if make_prediction is true.
32+
predict_input_path: bq://project.dataset.table
33+
predict_output_path: bq://project

tables/automl/pipeline/log/.gitkeep

Whitespace-only changes.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
PyYAML>=5.1
2+
futures>=3.1.0
3+
DateTime>=4.3
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# Copyright 2019 Google Inc. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from __future__ import absolute_import
16+
from __future__ import division
17+
from __future__ import print_function
18+
19+
import sys
20+
import logging
21+
import argparse
22+
23+
import tables_config
24+
import tables_client
25+
import tables_pipeline
26+
27+
28+
def parse_arguments(argv):
29+
"""Parses command line arguments."""
30+
31+
parser = argparse.ArgumentParser(description='Args for Tables pipeline.')
32+
parser.add_argument(
33+
'--config_filename',
34+
required=True,
35+
type=str,
36+
help='The filepath for the YAML configuration file.')
37+
parser.add_argument(
38+
'--log_dir',
39+
required=False,
40+
type=str,
41+
help='The directory to generate a session log file in.')
42+
parser.add_argument(
43+
'--service_account_filename',
44+
required=False,
45+
type=str,
46+
help='The filepath for the json key for oath.')
47+
parser.add_argument(
48+
'--console_log_level',
49+
required=False,
50+
type=str,
51+
default=logging.WARN,
52+
help='Controls the log level for the console display.')
53+
parser.add_argument(
54+
'--file_log_level',
55+
required=False,
56+
type=str,
57+
default=logging.INFO,
58+
help='Controls the log level to write to file. Set to logging.DEBUG for'
59+
'to write out the full AutoML service responses (very verbose).')
60+
args, _ = parser.parse_known_args(args=argv[1:])
61+
62+
# Parser for parameters to pass to TablesConfig
63+
param_parser = argparse.ArgumentParser(description='Args for config params.')
64+
65+
# Resource parameters
66+
param_parser.add_argument(
67+
'--project',
68+
required=False,
69+
type=str,
70+
help='GCP project ID to run AutoML Tables on.')
71+
param_parser.add_argument(
72+
'--location',
73+
required=False,
74+
default='us-central1',
75+
type=str,
76+
help='GCP location to run AutoML Tables in.')
77+
78+
# Runtime parameters
79+
param_parser.add_argument(
80+
'--build_dataset',
81+
action='store_const',
82+
const=True,
83+
help='Builds a new dataset, loads an old dataset otherwise.')
84+
param_parser.add_argument(
85+
'--build_model',
86+
action='store_const',
87+
const=True,
88+
help='Builds a new model, loads an old model otherwise.')
89+
param_parser.add_argument(
90+
'--make_prediction',
91+
action='store_const',
92+
const=True,
93+
help='Makes a batch prediction.')
94+
95+
# Dataset parameters
96+
# Note that columns_dtype and columns_nullable must be set in YAML config.
97+
param_parser.add_argument(
98+
'--dataset_display_name',
99+
required=False,
100+
type=str,
101+
help='Name of the Tables Dataset (32 character max).')
102+
param_parser.add_argument(
103+
'--dataset_input_path',
104+
required=False,
105+
type=str,
106+
help=('Path to import the training data from, one of'
107+
'bq://project.dataset.table or gs://path/to/csv'))
108+
param_parser.add_argument(
109+
'--label_column',
110+
required=False,
111+
type=str,
112+
help='Label to to train model on, for regression or classification.')
113+
param_parser.add_argument(
114+
'--split_column',
115+
required=False,
116+
type=str,
117+
help='Explicitly defines "TRAIN"/"VALIDATION"/"TEST" split.')
118+
param_parser.add_argument(
119+
'--weight_column',
120+
required=False,
121+
type=str,
122+
help='Weights loss and metrics.')
123+
param_parser.add_argument(
124+
'--time_column',
125+
required=False,
126+
type=str,
127+
help='Date/timestamp to automatically split data on.')
128+
129+
# Model parameters
130+
# Note that ignore columns must be set in YAML config.
131+
param_parser.add_argument(
132+
'--model_display_name',
133+
required=False,
134+
type=str,
135+
help='Name of the Tables Model (32 character max).')
136+
param_parser.add_argument(
137+
'--train_hours',
138+
required=False,
139+
type=float,
140+
help='The number of hours to train the model for.')
141+
param_parser.add_argument(
142+
'--optimization_objective',
143+
required=False,
144+
type=str,
145+
help='Metric to optimize for in training.')
146+
147+
# Predict parameters
148+
param_parser.add_argument(
149+
'--predict_input_path',
150+
required=False,
151+
type=str,
152+
help=('Path to import the batch prediction data from, one of'
153+
'bq://project.dataset.table or gs://path/to/csv'))
154+
param_parser.add_argument(
155+
'--predict_output_path',
156+
required=False,
157+
type=str,
158+
help=('Path to export batch predictions to, one of'
159+
'bq://project or gs://path'))
160+
params, _ = param_parser.parse_known_args(args=argv[1:])
161+
return args, params
162+
163+
164+
def main():
165+
args, params = parse_arguments(sys.argv)
166+
config = tables_config.TablesConfig(args.config_filename, vars(params))
167+
client = tables_client.TablesClient(args.service_account_filename)
168+
pipeline = tables_pipeline.TablesPipeline(
169+
tables_config=config,
170+
tables_client=client,
171+
log_dir=args.log_dir,
172+
console_log_level=args.console_log_level,
173+
file_log_level=args.file_log_level)
174+
pipeline.run()
175+
176+
177+
if __name__ == '__main__':
178+
main()

0 commit comments

Comments
 (0)
0