[go: up one dir, main page]

0% found this document useful (0 votes)
197 views67 pages

From Data To Insights Course Summary

The document discusses challenges faced by data analysts in processing large amounts of data and provides an overview of BigQuery as a solution. It outlines how BigQuery allows analysts to focus on insights rather than infrastructure management by providing a fully managed data warehouse that scales automatically and allows querying of massive datasets. The document also covers best practices for exploring datasets using SQL, cleaning and transforming data, optimizing queries for cost, ensuring data integrity, and accessing BigQuery via the web UI, CLI, or API.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
197 views67 pages

From Data To Insights Course Summary

The document discusses challenges faced by data analysts in processing large amounts of data and provides an overview of BigQuery as a solution. It outlines how BigQuery allows analysts to focus on insights rather than infrastructure management by providing a fully managed data warehouse that scales automatically and allows querying of massive datasets. The document also covers best practices for exploring datasets using SQL, cleaning and transforming data, optimizing queries for cost, ensuring data integrity, and accessing BigQuery via the web UI, CLI, or API.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

Exploring and Preparing your Data with BigQuery

Challenges faced by data analysts

 Queries taking too long


 Difficult to combine and query all of the data collected
 Difficult to reconcile the data requirements with hardware capabilities
 Problems to process the data as clusters aren’t scaling
 Too much data may generate problems to process the data and cost may increase

Processing massive amounts of data at high speed is the objective

Why a cloud?

 Storage is cheap (Cost of 1GB of storage has dropped dramatically)


 Focus on queries and not in the infrastructure
 Massive scalability

Big data Infrastructure involves

 Hard drives for storage the data


 CPUs for processing the data
 Networking
 Staff to maintain the servers

Data analysts wants to concentrate in creating insights, although the time dedicated to that is usually 1/8 of the total big data process. Bigger
data requires more infrastructure. However when using the cloud, analysts can concentrate in creating insights as the infrastructure and the
servers are externalized.

On premise systems require data engineers to manage the hardware, however when moving toward the cloud, all of that is managed by the
platform itself.

It is difficult to balance capacity with On premise systems, however with cloud services the scalability problem is solved.

In on-premise systems storing (Hard drives) and processing units (Cpus) are coupled, however in the cloud, storing and processing units are
decoupled allowing better resource management and allocation.

BigQuery scales automatically and you only pay for what you use

Google Cloud platform project basics


 The interface is divided in projects, resources and billing.

Data analyst tasks


1. Ingest: Get data in / Challenges: Data volume, data variety, data velocity
Solution: BigQuery storage
2. Transform: Prepare, clean and transform the data / Challenges: Slow exploration, Slow processing, Unclear logic
Solution: BigQuery Analysis (SQL) / Cloud Dataprep
3. Store: Create, save and store datasets / Challenges: Storage cost, hard to scale, latency issues
Solution: Cloud Storage (buckets) / BigQuery Storage (tables)
4. Analyze: Derive insights from data / Challenges: Slow queries, Data volume, Siloed Data
Solution: BigQuery Analysis (SQL)
5. Visualize: Explore and present data insights / Challenges: Dataset size, tool latency
Solution: Google Data Studio / Third-party Tools (Tableau, Looker, Qlik)

BigQuery: Don’t manage the infrastructure, focus on finding insights

BigQuery in a nutshell:

1. Fully managed data warehouse.


2. Reliable
3. Economical, pay only for what you use
4. Security
5. Auditable, every transaction can be traced
6. Scalable, parallel processing
7. Flexible
8. Easy to use with familiar SQL, no indexes
9. Public databases

Three ways to access BigQuery:

1. Web UI
2. Command-Line Interface (CLI): Cloud shell or the Google Cloud SDK
3. REST API: Programmatically run queries using languages like Java and Python over HTTP

BigQuery operates two services in one, the analytics engine (Processing [SQL engine / Dremel]) and the Storage (Hard drives [Scalable data
storage])

In the past, MapReduce lay down the base for Hadoop, however later Google innovated with the Dremel technology which is processing
queries over smaller chunks of data doing it massively in parallel. Paired with Colossus which is the massive hard drive in the cloud formed
the basis of what was BigQuery and Google Cloud Storage.
Differences between data analysts, data scientists and data engineers

Data analyst: Derive data insights from queries and visualization


Background: Data analysis using SQL

Data scientist: Analyze data and model systems using statistics and machine learning.
Background: Statistical analysis using SQL, R, Python

Data engineer: Design, builds and maintains data processing systems


Background: Computer engineering

alfonso.delosreyes@gmail.com

7VvfnVB7w8vNRV4

Options for exploring a dataset are complimentary


1. SQL + Web UI (Steps to explore data using SQL: 1.- Ask good questions 2.- Know the dataset 3.- Write good SQL)
2. Data preparation tools (Cloud data prep) (
3. Visualization tools (Google data studio)

Enable standard SQL in your queries


 Add #standardSQL as a comment in the first line
 Show options -> disable legacy SQL
Add formatting

SQL Functions
String Manipulation Functions – FORMAT()

Aliases does not exist yet when filtering in Where

Add new fields in SELECT to return more data

Avoid using * to return all columns


Selecting only the columns you need greatly increases query speed and helps with readability

Filter rows using the WHERE clause


Aggregation functions – SUM, COUNT, AVG, MAX

Embed functions (ROUND)

Create aggregation groupings with GROUP BY (Mixing non aggregated fields with functions that aggregate)
This is also useful to locate duplicate records with COUNT + GROUP BY

Don’t forget to use a GROUP BY if you are mixing aggregated and non-aggregated fields

In case of duplicate records

Filter aggregations with HAVING clause


Explore further by filtering on one case

If we just want 2015 data, a solution would involve parsing, converting and filtering for 2014

Parsing and converting dates can be hard


 Data type conversion functions – CAST()
 Date functions – PARSE_DATETIME()

Data types
 Numeric data (Integer [INT64], Float)
 String data [STRING]
 Dates [DATE]
 Other (Boolean, Array, Struct)

NULL value
NULL is the absence of data and this is not the same as a blank or “”

Handle NULL values with extreme care


YYYY-MM-DD is the expected format for dates

Parsing string values with string functions

Searching for nonprofits with ‘help’ in their name

Modifying the condition will fix the results


Use the Right Function for the Right Job

Pricing in BigQuery
The unit of work in BigQuery is called a Job
Each job:

 Is given an unique ID
 Can run concurrently
 Can perform tasks
 History is stored for 6 months

Job types
 Query – Charged by bytes processed
 Load data into a Table - Free
 Extract to GCS - Free
 Copy existing table – Free
Storing data in BigQuery is a separate cost

Three categories of BigQuery pricing

 Query – $5 per TB (First TB each month is free)


 Storage – $0.02 per GB, per month
 Streaming inserts - $0.05 per GB

The price calculator calculates prices considering queries and storage separately

Pricing model – reserved slots


Slots are the amount of total query throughput (many workers)

Slots can be reserved and has a separate pricing, they are not shared through other users

The more you use BigQuery, the more slots you get automatically
Quotas
Quotas are used to protect all BigQuery tenants
 50 concurrent queries
 Query timeout: 6 hours
 Max result size: 128mb compressed

Custom quotas by project and user can also be used

Query just what you need

Optimize queries for cost


 Filter early, filter often
 Used cache results
 Limit the use of User-Defined functions (UDFs)

Dataset integrity

Garbage in – Garbage out

High quality datasets conform to strict integrity rules

1. Valid data follows constraints on uniqueness (Telephone number, addresses, etc) and constraints on range constraints (Dice
with a value of 7)
2. Accurate data matches to a known sources of truth (U.S. State name = Hot Dog)
3. Completeness refers to data that have enough context to explain (clock and Big Ben tower picture)
4. Consistency data ensures harmony across systems (Who owns the house?)
5. Uniformity in terms of measuring the same way (English vs Metric system)

Datashape and skewness


Two problems:
 A lot of observations but only a few fields covered
 Many fields but very little observations

Clean and transform data with SQL


Two ways to clean data, with SQL and with a tool called Cloud Dataprep.

For validity
 Setup field data type constraints
 Specify fields as NULLABLE or REQUIRED
 Proactively check for NULL values
 Check and Filter for allowable range values – SQL conditionals: CASE WHEN, IF()
 Require primary keys / Relational constraints in upstream source system
For accuracy
 Create test cases or calculated fields to check values – SQL: (quantity_ordered * item_price) AS sub_total
 Lookup values against an objective reference dataset – SQL: IN() with a subquery or JOIN

For completeness
 Explore the data shape and skew and look for missing values – SQL: NULLIF(), IFNULL(), COALESCE()
 Enrich the existing dataset with others using UNIONs and JOINs

For consistency
 Store one fact in one place and use IDs to lookup
 Use String functions to clean data
o PARSE_DATE()
o SUBSTR()
o REPLACE()

For uniformity
 Document and comment your approach
 Use FORMAT() to clearly indicate units
 CAST() data types to the same formats and digits
 Label all visualizations appropriately

Clean and transform data using Cloud Dataprep


Cloud dataflow

 The data flow (Diagrams)


 Wranglers (Transformations)
 Recipes (Multiple transformations to the data)

Recipe

Preparing data with Wrangler in Cloud Data Fusion


Data integration is all about your data. When working with data it’s handy to be able to see what the raw data looks like so that you can use
it as a starting point for your transformation. With Wrangler, you can take a data-first approach to your data integration workflow.

The most common source of data for ETL (Extract-Transform-Load) applications is typically data stored in comma separated value (CSV)
format text files, as many database systems export and import data in this fashion. For the purposes of this lab you’ll be using a CSV file,
but the same techniques can be applied to database sources as well as any other data source that you have available in Cloud Data Fusion.
Building a pipeline

In Data Fusion the studio is the user interface

Creating New BigQuery Datasets and Visualizing insights


Objective: How to connect and bring new datasets into BigQuery, and how can be joined together
to create a proper data analytics warehouse.

Creating permanent tables


1. Write SQL query
2. Click show options
3. Specify the destination table (Can be existing)
4. Chose write preference (if table already exists)
5. Run query

If the destination table exists:

 Write if empty
 Append records
 Overwrite table
What is the key insight behind how BigQuery caches your query results?
BigQuery stores all your query results in temporary tables automatically which you then can promote to permanent tables

Ingesting New Data into BigQuery


What is the meaning of manage in BigQuery?
As we mentioned in one of the earlier courses, BigQuery is two components. It's the query engine that will process your queries and it's also
the data management piece behind the scenes that handles and stores and optimizes all of your data. So things like caching your data, storing
it into column format and compressing those columns, which we're going to talk a little bit more about in the advanced course on The
Architecture of BigQuery, and expanding the data and making sure that it's replicated, and all these things that are traditional like a database
administrator wouldn't handle for you, the BigQuery team here at Google manages that for you behind the scenes.

Pitfalls from querying from external sources directly


 Strong performance disadvantages
 Data consistency not guaranteed
 Cant use table wildcards

There is also a streaming option available to ingest data into BigQuery through the API

Ingesting external data into BigQuery

External table performance and data quality considerations


Linking external tables to BigQuery (e.g. Google Spreadsheets or directly from Google Cloud Storage) has several limitations. Two of the
most significant are:

Data consistency is not guaranteed if the data values in the source are changed while querying.
Data sources stored outside of BigQuery lose the performance benefits of having BigQuery manage your data storage (including but not
limited to auto-optimization of your query execution path, certain wildcard functions are disabled, etc.).

Partitioned tables
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By
dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of
bytes read by a query.

 Ingestion time: Tables are partitioned based on the data's ingestion (load) time or arrival time.
 Date/timestamp/datetime: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column.
 Integer range: Tables are partitioned based on an integer column.

Common use-cases for date-partitioned tables


Scanning through the entire dataset everytime to compare rows against a WHERE condition is wasteful. This is especially true if you only
really care about records for a specific period of time like:

 All transactions for the last year


 All visitor interactions within the last 7 days
 All products sold in the last month
Instead of scanning the entire dataset and filtering on a date field like we did in the earlier queries, we will now setup a date-partitioned
table. This will allow us to completely ignore scanning records in certain partitions if they are irrelevant to our query.
Introducing JOINs and UNIONs
Now in this module, we'll tackle how to append additional historical datasets together through UNIONs, as well as how to JOIN together
different datasets horizontally through SQL JOINs.
SQL UNION
Table wildcard

Filtering with a Table Wildcard and a _TABLE_SUFFIX


 Use Table Wildcard * vs writing many UNIONs
 Use _TABLE_SUFFIX to filter out tables wildcard included
 USE _TABLE_SUFFIX in your SELECT statements with CONCAT()

Check tables in which the table suffix is greater than 1999


Check if table suffix equals year

Avoid Union Pitfalls like Brittle Schemas


 Duplicate records among tables (Use UNION DISTINCT vs UNION ALL)
 Changing schemas and Field names over time
 Start with the most inclusive Table first (most fields)

SQL JOIN

Joining data tables can provide meaningful insight into your dataset. However, when you join your data, there are common pitfalls that
could corrupt your results. This lab focuses on avoiding those pitfalls. Types of joins:

Different types of joins


 Cross join: combines each row of the first dataset with each row of the second dataset, where every combination is represented
in the output.
 Inner join: requires that key values exist in both tables for the records to appear in the results table. Records appear in the merge
only if there are matches in both tables for the key values.
 Left join: Each row in the left table appears in the results, regardless of whether there are matches in the right table.
 Right join: the reverse of a left join. Each row in the right table appears in the results, regardless of whether there are matches in
the left table.

Pitfalls when merging datasets


 Doing a many-to-many JOIN could result in more rows than either of your initial tables
 This is a primary reason for exceeding your resource cap in BigQuery
 Know your dataset and the relationships between your tables before joining

Understand your data model and relationships


 Understand your data relationship before joining 1:1, N:1, 1:N, N:N
 Use CONCAT() to create composite key fields if no unique fields exist or join on more than one field
 Ensure your key fields are distinct (Deduplicate)

STRING_AGG

Hmmm. you have 1,909 distinct SKUs which is less than the 2,273 number for total number of products on the website. The first results
probably contain products with duplicate SKUs.

Take an even closer look at the records. Determine which products have more than one SKU and which SKUs have more than one product.

What if you wanted one query that listed all products missing from either the website or inventory?

#standardSQL
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
FULL JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL OR inventory.SKU IS NULL

Learning Review
 Use Union Wildcards to treat multiple tables as a single group
 Use _TABLE_SUFFIX to filter wildcard tables and create calculated fields with the table name
 FULL JOINs (also called FULL OUTER JOINs) include all records from each joined table regardless of whether there are
matches on the join key
 Having non-unique join keys can result in an unintentional CROSS JOIN (more output rows than input rows) which should be
avoided
 Use COUNT() and GROUP BY to determine if a key field is indeed unique

Data visualization principles


Dimensions and measures

A dimension is a field that you can classify, that is an independent variable. So normally, I like to think of that as your qualitative or
categorical information. So, if you had your IRS 990, this could be like the name of the charity, is their address, anything that's not
necessarily mathematical, whereas your measure, is going to be dependent. So, that's going to be generally quantitative information.
Google Data Studio UI

Advanced insights
Summary of previous functions

Statistical functions

Approximation functions

Say you have an absurd number of rows. What you can use is approximate count distinct. So much like you've seen before with the count
distinct there on that second line, you can get that exact count. But you can also get the approximation, which can be good enough to get the
job done.
Analytic functions: WITH clause and RANK()
Real example

Now if you recall, you cannot use the alias of a field that you just created directly after FROM in the WHERE clause

To solve this, we will use a temporary table to save the first part of the query and then query that table making the filter.
User Defined Functions (UDF)
UDF limits
Traditional Relational Database Architecture
 Wide tables are not scalable
 Instead of storing everything in one massive table. You can actually begin to break apart those fields as rows in another table.

Normalization benefit: Scalability

Normalization drawback: Joins are now a necessity


Denormalized column-based storage

BigQuery Column-Oriented Storage is built for speed


Break apart tables into pieces

Store nested fields within a table


Nested schemas bring performance benefits
 Avoid costly joins
 No performance punishment for SELECT(DISTINCT Company ID)

Working with Arrays and Structs in BigQuery


Arrays (Nested Rows)
Arrays are ordered lists of zero or more data values that must have the same data type and are supported natively in BigQuery . And that's all
just one field value.

Working with SQL arrays

BigQuery implicitly flattens arrays

So how do you access elements in that array? In order to do that, we've used the function that's actually called UNNEST
Explicitly flatten arrays with UNNEST()

Aggregate into an Array with ARRAY_AGG

Sorting elements in an Array

Filter arrays using WHERE IN


STRUCTs (Flexible containers)
STRUCT are a container of ordered fields [Like a dictionary] each with a type (required) and field
name (optional).

You can store multiple data types in a STRUCT (even Arrays!)

STRUCTs can even contain ARRAY values

ARRAYS can contain STRUCTs as Values


Filter for Customers who bought Ice cream

Nested (Repeated) Records are Arrays of Structs

 Nested records in BigQuery are Arrays of Structs.


 Instead of Joining with a sql_on: expression, the join relationship is built into the table.
 UNNESTing a ARRAY of STRUCTs is similar to joining a table.
Legacy VS Standard SQL repeated record differences

ARRAY / STRUCT example


Summary

Lab

As you saw earlier, arrays can only have one data type. There is, however, a data type which supports having both multiple field names and
types: the STRUCT data type.

To recap:
Structs are containers that can have multiple field names and data types nested inside.
An arrays can be one of the field types inside of a Struct (as shown above with the splits field).
In traditional relational SQL, if you had a races table and a participants table what would you do to get information from both tables? You
would JOIN them together. Here the participant STRUCT (which is conceptually very similar to a table) is already part of your races table
but is not yet correlated correctly with your non-STRUCT field "race".

Can you think of what two word SQL command you would use to correlate the 800M race with each of the racers in the first table?

Answer: CROSS JOIN

If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible
race like a cartesian product?
Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, see
working with ARRAYs and STRUCTs

Recap of STRUCTs:
 A SQL STRUCT is simply a container of other data fields which can be of different data types. The word struct means data structure.
Recall the example from earlier:
 STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner
 STRUCTs are given an alias (like runner above) and can conceptually be thought of as a table inside of your main table.
 STRUCTs (and ARRAYs) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the
struct itself or the struct field that is an array in order to unpack and flatten it.

Recall that the UNNEST operator takes an ARRAY and returns a table, with one row for each element in the ARRAY. This will allow you
to perform normal SQL operations like:

Aggregating values within an ARRAY

Filtering arrays for particular values

Ordering and sorting arrays

As a reminder, an Array is an ordered list of elements that share a data type. Here is a string array of the 8 racer names.

['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']

You can create arrays in BigQuery by adding brackets [ ] and comma separating values.

In order to find the racers whose names begin with the letter M, you need to unpack the above array into individual rows so you can use a
WHERE clause.

Unpacking the array is done by wrapping the array (or the name of the array) with UNNEST() as shown below. Run the below query and
note how many rows are returned.

#standardSQL
SELECT * FROM
UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names

Now add a normal WHERE clause to filter these rows, and run the query:

#standardSQL
SELECT * FROM
UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names
WHERE unnested_array_of_names LIKE 'M%'

Write a query that will list the total race time for racers whose names begin with R. Order the results with the fastest total time first. Use the
and start with the partially written query below

#standardSQL
SELECT
p.name,
SUM(split_times) as total_race_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_times
WHERE p.name LIKE 'R%'
GROUP BY p.name
ORDER BY total_race_time ASC;

Filtering within ARRAY values


You happened to see that the fastest lap time recorded for the 800 M race was 23.2 seconds, but you did not see which runner ran that
particular lap. Create a query that returns that result.

#standardSQL
SELECT
p.name,
split_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_time
WHERE split_time = 23.2;
Advanced Data Studio
Case statements and filters

Can also be created directly in SQL functions upstream

Avoiding visualization performance and cache pitfalls

Sharing and collaborating on Dashboards

As a quick summary:
1. you can create calculated fields directly in data studio, as you saw.
2. You can disable the prefetch cache for more real time data needs.
3. And lastly, be aware that sharing reports does not automatically add your audience as viewers to the underlying dataset and
BigQuery. You'll have to do that separately, and we'll cover data access later in this course.
Avoid BigQuery Performance Pitfalls

Avoid Input / Output Wastefulness


 Do not SELECT *, use only the columns you need
 Denormalize your schemas and take advantage of nested and repeated fields
 Use granular suffixes in your table wildcards for more specificity

Use BigQuery native storage for the best performance


 External direct data connections can never be cached
 Live edits to underlying external sources (e.g. spreadsheets) could
create race conditions
 Native BigQuery tables have intelligence built-in like automatic
predicate pushdown

Optimize communication between slots (via shuffle)


 Pre-filter your data before doing JOINs
 Many shuffle stages can indicate data partition issues (Skew)

Do not use WITH clauses in place of materializing results


 Commonly filtering and transforming the same results? Store them
into a permanent table
 WITH clause queries are not materialized and are re-queried if
referenced more than once

Now the main takeaway with WITH clauses is that the results of those queries,
those subqueries that you've named in that WITH clause, like line 6 and line 15 here, those results are not materialized and they're actually
requeried every single time, if you're referencing those tables more than once in one of your long queries. So it's a trade-off between
readability for your code, and also performance or speed, when you're storing those tables as permanent.

Materialize your transformed results instead of re-querying


 Commonly filtering and transforming the same results? Store them into a permanent table

Be careful using GROUP BY across many distinct values


 Best when the number of distinct groups is small (fewer shuffles of data).
 Grouping by a high-cardinality unique ID is a bad idea.

Reduce Javascript UDFs to reduce computational load


 Javascript UDFs require BigQuery to launch a Java subprocess to run
 Use native SQL functions whenever possible
 Concurrent rate limits:
 For non-UDF queries: 50
 For UDF-queries: 6

Understand your data model before applying Joins and Unions


 Know your join conditions and if they are unique – no accidental cross joins
 Filter wildcards UNIONS with _TABLE_SUFFIX filter
 Do not use self-joins (consider window functions instead)

Push intensive operations to the end of the query


 Large sorts should be the last operation in your query
 If you need to sort early, filter or limit your data before sorting

Data Skew in your datasets


Shuffle wisely: Be aware of data skew in your dataset
 Filter your dataset as early as possible (This avoids overloading workers on JOINs)
 Hint: Use the Query explanation map and compare the Max vs the Avg times to highlight
skew
 Bigquery will automatically attempt to reshuffle workers that are overloaded with data

Diagnose performance issues with the query explanation


map

 First up, are those yellow bars and this represents the amount of time that your workers waiting to actually receive work. And
one of the potential pitfalls here is that you have a highly skewed data set and one of your workers is actually being bottlenecked.
And the rest are just sitting around idly waiting for the work to be performed on that one worker before the rest of the work can
be distributed out.
 Second up is read time and this represents the amount of time it takes your workers to actually read your data from disk.
 Third in orange is the amount of computational work that is done as part of your query.
 Fourth, is the amount of time that it takes to write your data to disk and this is in the case where you're creating something like a
permanent table.

Each stage represents a shuffling of information and data between those workers. If you have many, many stages, 1, 2, 3, 4, 5, 6 stages, that
represents a lot of communication that's happening, or passing of data between the workers.

The key thing to remember is that each individual worker and then sometimes you can have upwards of 2,000 workers working on a query
at once is selfish. And likes to have a partition of your data where it has all of the IDs on its own local memory to perform those aggregation
operations or any other kind of querying operations across that subset of your data, very efficiently.

Table sharding – Then and Now

So all the way on the left, what you see is traditionally in relational database systems, you'd have a database administrator, or DBA, that
would take a look at your database. And then manually partition, or pre-allocate that space in very very defined partitions on disk. As you
bet, you'd pretty much have to be a hardware master to deal with that, to get those individual partitions set up and performing along with
your queries. Second, later on down the road, what actually came out is you can manually shard your own tables.
So the last and best way that if you came across this issue and you just really wanted to query most recent data, like 2016, 2017 and 2018, is
to actually set up a special version of your table. We will just call the date partition table. And it can be just like a normal table except that it
actually has what we will call a partition column in it, and this is just like a date, time field as you see here all the way on the right. And the
example I'm going to go with is if you have a massive amount of customer orders historically, going back say 30 years, but you're really
only interested in querying things from the last 3 years. You want to wait and partition off the table in and of itself, so that you're not
scanning over everything, as is typically the case when you're doing like a select query.

Filtering on date-partitioned tables in SQL

Monitor performance with Stackdriver

Recap
 Since you are paying for the total amount of bytes that you are processing, you want to first limit the amount of columns that
you're returning and also consider filtering out your rows by using that where filter whenever you can. Now this doesn't mean
you can't write queries that process your entire data set, BigQuery of course was built for petabyte scale. You just want to be
mindful of the resources that you're consuming.
 We then cover the query explanation map, where we get those cool visual cues of the types of work that is most demanded by
your queries.
 Note here that a large difference between the average, and the max, could indicate a heavy data skew, which can be helped by
filtering your data as early as possible.
 And we covered some SQL bad behavior, like selecting every record and ordering it, without using a limit clause.
 Finally, if you only access in recent data like the most recent events, consider using table partitioning to reduce the total amount
of bytes that are scanned by your query.

BigQuery and normalized datasets


BigQuery supports large JOINs, and JOIN performance is good. However, BigQuery is a columnar datastore, and maximum performance is
achieved on denormalized datasets. Because BigQuery storage is inexpensive and scalable, it's a good practice to denormalize and pre-JOIN
datasets into homogeneous tables. In other words, you exchange compute resources for storage resources (the latter being more performant
and cost-effective).
Creating tables using the shell
bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.category gs://data-insights-
course/labs/optimizing-for-performance/category.csv category:STRING,category_name:STRING

bq load --source_format=CSV --skip_leading_rows=1 --allow_quoted_newlines liquor_sales.convenience_store gs://data-insights-


course/labs/optimizing-for-performance/convenience_store.csv store:STRING

Some observations
 The denormalized query (#2) is faster and uses less slot time to achieve the same result
 The relational query (#1) has many more input stages and spends the most worker time in joining the datasets together
 The denormalized query (#2) spends the most time reading data input and outputting the results. There is minimal time spent in
aggregations and joins.
 Neither query resulted in bytes spilled to disk which suggests out datasets are likely not skewed (or significantly large enough to
spill out from memory from an individual worker)
Jupyter Notebooks
One of the BIGGEST advantages for Data Analysts is the ability to have both your SQL code AND your documentation (text, images, bash
scripts) in a single version-controlled file. I've included an optional lab following this reading item for those who want to get practice using
Notebooks with BigQuery ahead of the ML course content.

Data access roles

And one of the really cool things that you can do is invoke some BigQuery reserved keywords like session user, which you can then pull the
email address of the person that's logged in, and assuming that you have your data structured in such a way where you have a column called
allowed viewer or allowed group that you can then maps that email address on, you can set up some pretty neat role level filtering on your
underlying data

So two key areas to really focus on for access permission or at the project level, which includes BigQuery and your cloud storage buckets
and everything else and then individually at the BigQuery dataset level. And you can access and share datasets by clicking on that triangle
beneath each of your different datasets

So let's review three of the key concepts


 The first thing is permissions are at the dataset level. So you share datasets as such.
 Second, you can have multiple datasets within the same project ID, and that's how you can do neat things like creating authorized
views as bridges between different datasets.
 And third, you can actually share datasets with groups, users, authorized views, or even entire domains as well. So you can get as
granular as you would like.
Roles
Setting up proper access controls to your data, this probably one of the most important topics in data analysis. As you've seen, Google Cloud
platform provides you with a variety of roles at the project level that can then be inherited down to the BigQuery dataset level. It's ultimately
up to you to determine which individuals and which groups should have access to which parts of your data. Be sure to set up regular access
control audits, and look at those stack driver logs to spot any strange usage patterns. Lastly, as we mentioned before, consider using
authorized views in tandem with a where clause filter on current user to limit row-level access to a particular table based on who's logged in.

Summary of third course


Let's summarize some of those key points that we walked through. You learned how and when to use advanced SQL functions like
approximate count distinct as we worked our way through our large-scale GitHub example, estimating the number of users over time. We
also illustrated using the WITH clause and how that can break apart complex queries into easy logical steps. We also covered how to
correctly aggregate over a subset of rows in your result set with analytical window functions like rank over partition. We then took it to our
BigQuery is internal architecture, we learned what sets BigQuery apart from traditional relational databases. Recall that BigQuery breaks
apart data-sets into smaller chunks called shards which are then stored on persistent disk and massively processed in parallel by BigQuery's
workers once a query is executed. We then explored one my favorite topics, which is working with data-sets that have structs and arrays as
part of repeated fields. Structs again, that's your flexible data container, and an array of structs is how repeated fields are setup. Repeated
fields offer the performance benefit of pre-joined tables. When in doubt, don't forget to UNNEST and CROSS JOIN you're structs and
arrays. In our performance lab, we practice how you can optimize poorly-written SQL queries and use the query explanation map, where
you get visual cues of what types of work is most in-demand by your queries. Lastly, we compared different data access scenarios and how
you can safeguard your own data. Remember that, as of right now, BigQuery shares permission at that dataset level, and also inherits
permissions that are set at your project level as well. Well, that's the end of the Advanced BigQuery course. Stick around for the last course
in the Data to Insights Specialization, we'll recover the hat data analysts hands-on guide to machine learning and how you can start
accessing things like the pre-trained ML APIs library, where you can do cool things like photo recognition and language translation as well
as building your own ML models. We'll see you there.
Introduction to machine learning
Machine learning is a discipline inside of AI

Machine learning is a discipline inside artificial intelligence

Machine learning in its core labels things for you

Machine learning is not about providing the model with rules like a hard-coded recipe, the beauty of ML is that the recipe is created by the
algorithm itself.

An example about Google search: There were a ton of rules that were part of the search engine code-based to decide which sports team to
show and where based on where the user was. If the
query is giants, and the users in the Bay Area, show
them the results by the San Francisco Giants. If the user
is in the New York area, show them the results but the
New York Giants.

The complexity of the query and the amount of rules


that were needed to make the algorithm work is
unsustainable, however ML is scalable.

Hard-coded rules are hard to maintain.

Machine learning is about lead with examples, not with


instructions.

We use deep learning when we cannot explain the


labelling rules.

Understanding where ML is applied is the first step towards building an ML solution.

After separating training and testing data, we need to select the ML model

Examples:

1. Forecasting a number: Try linear regression


2. Classify a label: Try logistic regression

The ultimate goal is the same. If we're given labeled training data, predict the label for unseen or unknown data.

Now, no topic on ML models would be complete without first discussing the difference between supervised and unsupervised learning.
Supervised learning is everything that we've covered so far. It's where you already know the correct answer as a label in your data-set or
your training data-set, and spoiler alert, these problems are typically easier for prediction. Unsupervised learning is where you have the data
but you don't have any labels.
Pre-trained vs trained ML models
Now, there are two broad categories of ML models. You have models that are already built,and called pre-trained, and then those custom
models that you build yourself using code like tensorflow.

Cloud NLP
Collaborative notebooks in the Cloud
So as you'll see in the later modules, often times data scientists, which you're quickly on the road to becoming, will develop, well, I'll say
locally within data lab. And that means you got your box on Google Compute Engine instance, and you've got Datalab up and running,
much like in the Google Analytics demo that we were showing a little bit earlier. And you're bringing in data and you're processing it, and
you're using something like, if you're familiar with, like a Pandas dataframe, or you're operating as a data store. Or your data store is a CSV
file or a small subset or a sample of that data. And then once you've got, in the case of ML, once you've got a ML prototype or sample
working really, really well and it's fast and it's iterating fine. Then what you can do is you can scale out and really get all the access to the
cloud resources and replace a lot of those with what we call local stamps with cloud stamps, right? So instead of your CSV files, you could
just be pulling data in directly from BigQuery or a cloud storage. And you're up-leveling your Apache Beam workflows into a managed
workflow service like Cloud Dataflow. And instead of using Pandas dataframes, you're using BigQuery for them, kind of that
petabyte scale data warehouse as well. And same thing with TensorFlow, instead of running individual TensorFlow lines of code,
you could use the Cloud Machine Learning engine APIs or just get that kind of managed TensorFlow aspect there. So what you'll see
in a lot of these examples, especially in the later modules in this course, is you'll do a lot of the initial development locally within your
Cloud Datalab notebook. And then you'll reach out to a lot of these other services as you get more towards what we'll call productionalizing
or serving your ML model.

Import Bigquery as a library into Cloud Datalab to use it later

import google.datalab.bigquery as bq

Use %%bq to execute BigQuery commands


example
%%bq tables describre -n “Google.com:analytics-bigquery.LondonCyclehelmet.ga_session_20130910”

Describing tables is a great to explore schemas within cloud datalab

Use -n to provide a name for the query as you can have more than one query in a single notebook

%%bq query -n sessions

%bq execute –query sessions

BigQuery operations have defined parameters


BigQuery commands summary

Charting commands
What makes a good dataset for ML

Splitting training and testing data set


ML process in BigQuery

Steps in building the model


Now for classification, one of the key metrics that we're gonna focus on here, is a combination of both maximizing the true positive rate
which is where we predicted the user will return to the website and they actually did and minimizing the false positive rate, where we
predicted that they would come back but they never showed up. In a very popular way to visualize this relationship is by using an ROC
curve or rock curve.

A rock curve or receiver operating characteristic curve, is just a graph that shows the performance of a classification model at all
classification thresholds. The curve plots the two parameters that you see here, the true positive rate against your false positive rate.

LAB
Scenario: Your data analyst team exported the Google Analytics logs for an ecommerce website into BigQuery and created a new table of
all the raw ecommerce visitor session data for you to explore. Using this data, you'll try to answer a few questions.

Question: Out of the total visitors who visited our website, what % made a purchase?

#standardSQL
WITH visitors AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `data-to-insights.ecommerce.web_analytics`
),

purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.transactions IS NOT NULL
)

SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers

The result: 2.69%

Question: What are the top 5 selling products?

#standardSQL
SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5;

Question: How many visitors bought on subsequent visits to the website?

#standardSQL

# visitors who bought on a return visit (could have bought on first as well
WITH all_visitor_stats AS (
SELECT
fullvisitorid, # 741,721 unique visitors
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)

SELECT
COUNT(DISTINCT fullvisitorid) AS total_visitors,
will_buy_on_return_visit
FROM all_visitor_stats
GROUP BY will_buy_on_return_visit

Select features and create your training dataset


Google Analytics captures a wide variety of dimensions and measures about a user's visit on this ecommerce website. Browse the complete
list of fields here and then preview the demo dataset to find useful features that will help a machine learning model understand the
relationship between data about a visitor's first time on your website and whether they will return and make a purchase.

Your team decides to test whether these two fields are good inputs for your classification model:

totals.bounces (whether the visitor left the website immediately)


totals.timeOnSite (how long the visitor was on our website)

Question: What are the risks of only using the above two fields?

Answer: Machine learning is only as good as the training data that is fed into it. If there isn't enough information for the model to determine
and learn the relationship between your input features and your label (in this case, whether the visitor bought in the future) then you will not
have an accurate model. While training a model on just these two fields is a start, you will see if they're good enough to produce an accurate
model.

#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM

# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1)
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
ORDER BY time_on_site DESC
LIMIT 10;

Which fields are the model features? What is the label (correct answer)?

The inputs are bounces and time_on_site. The label is will_buy_on_return_visit.

Question: Which two fields are known after a visitor's first session?

Answer: bounces and time_on_site are known after a visitor's first session.

Question: Which field isn't known until later in the future?

Answer: will_buy_on_return_visit is not known after the first visit. Again, you're predicting for a subset of users who returned to your
website and purchased. Since you don't know the future at prediction time, you cannot say with certainty whether a new visitor come back
and purchase. The value of building a ML model is to get the probability of future purchase based on the data gleaned about their first
session.

Question: Looking at the initial data results, do you think time_on_site and bounces will be a good indicator of whether the user will return
and purchase or not?

Answer: It's often too early to tell before training and evaluating the model, but at first glance out of the top 10 time_on_site, only 1
customer returned to buy, which isn't very promising. Let's see how well the model does.
Note: There are many additional model types used in Machine Learning (like Neural Networks and decision trees) and available using
libraries like [TensorFlow](https://www.tensorflow.org/tutorials/). At time of writing, BQML supports the two listed above.

Which model type should you choose?

Since you are bucketing visitors into "will buy in future" or "won't buy in future", use logistic_reg in a classification model.

Add this query in the Query editor to create a model and specify model options:

#standardSQL

CREATE OR REPLACE MODEL `ecommerce.classification_model`


OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS

#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM

# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
;

Select your performance criteria


For classification problems in ML, you want to minimize the False Positive Rate (predict that the user will return and purchase and they
don't) and maximize the True Positive Rate (predict that the user will return and purchase and they do).

This relationship is visualized with a ROC (Receiver Operating Characteristic) curve like the one shown here, where you try to maximize
the area under the curve or AUC:
In BQML, roc_auc is simply a queryable field when evaluating your trained ML model.

Now that training is complete, you can evaluate how well the model performs with this query using ML.EVALUATE.

Add this query in the Query editor, then Run the query:
#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model, (

SELECT
* EXCEPT(fullVisitorId)
FROM

# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)

));
Improve model performance with Feature Engineering
As was hinted at earlier, there are many more features in the dataset that may help the model better understand the relationship between a
visitor's first session and the likelihood that they will purchase on a subsequent visit.

Let's add some new features and create a second machine learning model which will be called classification_model_2:

How far the visitor got in the checkout process on their first visit
Where the visitor came from (traffic source: organic search, referring site etc..)
Device category (mobile, tablet, desktop)
Geographic information (country)
Create this second model by adding this query in the Query editor, then Run the query:

#standardSQL
CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS
(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS

WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)

# add in new features


SELECT * EXCEPT(unique_session_id) FROM (

SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

# labels
will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site


IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
IFNULL(totals.pageviews, 0) AS pageviews,

# where the visitor came from


trafficSource.source,
trafficSource.medium,
channelGrouping,

# mobile or desktop
device.deviceCategory,

# geographic
IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months

GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);

Note: You are still training on the same first 9 months of data, even with this new model. It's important to have the same training dataset so
you can be certain a better model output is attributable to better input features and not new or different training data.

A key new feature that was added to the training dataset query is the maximum checkout progress each visitor reached in their session,
which is recorded in the field hits.eCommerceAction.action_type. If you search for that field in the field definitions you will see the field
mapping of 6 = Completed Purchase.

As an aside, the web analytics dataset has nested and repeated fields like ARRAYS which need to broken apart into separate rows in your
dataset. This is accomplished by using the UNNEST() function, which you can see in the above query.

Evaluate this new model to see if there is better predictive power.

#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_2, (

WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)

# add in new features


SELECT * EXCEPT(unique_session_id) FROM (

SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

# labels
will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site


IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,

# where the visitor came from


trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,

# geographic
IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months

GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));

Predict which new visitors will come back and purchase


Next you will write a query to predict which new visitors will come back and make a purchase. The prediction query below uses the
improved classification model we trained above to predict the probability that a first-time visitor to the Google Merchandise Store will make
a purchase in a later visit. The predictions are made on the last 1 month (out of 12 months) of the dataset.

#standardSQL
SELECT
*
FROM
ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(

WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)

SELECT
CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,

# labels
will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,


# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,

# where the visitor came from


trafficSource.source,
trafficSource.medium,
channelGrouping,

# mobile or desktop
device.deviceCategory,

# geographic
IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month

GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)

ORDER BY
predicted_will_buy_on_return_visit DESC;
Results
Of the top 6% of first-time visitors (sorted in decreasing order of predicted probability), more than 6% make a purchase in a later visit.

These users represent nearly 50% of all first-time visitors who make a purchase in a later visit.

Overall, only 0.7% of first-time visitors make a purchase in a later visit.

Targeting the top 6% of first-time increases marketing ROI by 9x vs targeting them all!

Additional information
Tip: add warm_start = true to your model options if you are retraining new data on an existing model for faster training times. Note that you
cannot change the feature columns (this would necessitate a new model).

roc_auc is just one of the performance metrics available during model evaluation. Also available are accuracy, precision, and recall.
Knowing which performance metric to rely on is highly dependent on what your overall objective or goal is

You might also like