From Data To Insights Course Summary
From Data To Insights Course Summary
Why a cloud?
Data analysts wants to concentrate in creating insights, although the time dedicated to that is usually 1/8 of the total big data process. Bigger
data requires more infrastructure. However when using the cloud, analysts can concentrate in creating insights as the infrastructure and the
servers are externalized.
On premise systems require data engineers to manage the hardware, however when moving toward the cloud, all of that is managed by the
platform itself.
It is difficult to balance capacity with On premise systems, however with cloud services the scalability problem is solved.
In on-premise systems storing (Hard drives) and processing units (Cpus) are coupled, however in the cloud, storing and processing units are
decoupled allowing better resource management and allocation.
BigQuery scales automatically and you only pay for what you use
BigQuery in a nutshell:
1. Web UI
2. Command-Line Interface (CLI): Cloud shell or the Google Cloud SDK
3. REST API: Programmatically run queries using languages like Java and Python over HTTP
BigQuery operates two services in one, the analytics engine (Processing [SQL engine / Dremel]) and the Storage (Hard drives [Scalable data
storage])
In the past, MapReduce lay down the base for Hadoop, however later Google innovated with the Dremel technology which is processing
queries over smaller chunks of data doing it massively in parallel. Paired with Colossus which is the massive hard drive in the cloud formed
the basis of what was BigQuery and Google Cloud Storage.
Differences between data analysts, data scientists and data engineers
Data scientist: Analyze data and model systems using statistics and machine learning.
Background: Statistical analysis using SQL, R, Python
alfonso.delosreyes@gmail.com
7VvfnVB7w8vNRV4
SQL Functions
String Manipulation Functions – FORMAT()
Create aggregation groupings with GROUP BY (Mixing non aggregated fields with functions that aggregate)
This is also useful to locate duplicate records with COUNT + GROUP BY
Don’t forget to use a GROUP BY if you are mixing aggregated and non-aggregated fields
If we just want 2015 data, a solution would involve parsing, converting and filtering for 2014
Data types
Numeric data (Integer [INT64], Float)
String data [STRING]
Dates [DATE]
Other (Boolean, Array, Struct)
NULL value
NULL is the absence of data and this is not the same as a blank or “”
Pricing in BigQuery
The unit of work in BigQuery is called a Job
Each job:
Is given an unique ID
Can run concurrently
Can perform tasks
History is stored for 6 months
Job types
Query – Charged by bytes processed
Load data into a Table - Free
Extract to GCS - Free
Copy existing table – Free
Storing data in BigQuery is a separate cost
The price calculator calculates prices considering queries and storage separately
Slots can be reserved and has a separate pricing, they are not shared through other users
The more you use BigQuery, the more slots you get automatically
Quotas
Quotas are used to protect all BigQuery tenants
50 concurrent queries
Query timeout: 6 hours
Max result size: 128mb compressed
Dataset integrity
1. Valid data follows constraints on uniqueness (Telephone number, addresses, etc) and constraints on range constraints (Dice
with a value of 7)
2. Accurate data matches to a known sources of truth (U.S. State name = Hot Dog)
3. Completeness refers to data that have enough context to explain (clock and Big Ben tower picture)
4. Consistency data ensures harmony across systems (Who owns the house?)
5. Uniformity in terms of measuring the same way (English vs Metric system)
For validity
Setup field data type constraints
Specify fields as NULLABLE or REQUIRED
Proactively check for NULL values
Check and Filter for allowable range values – SQL conditionals: CASE WHEN, IF()
Require primary keys / Relational constraints in upstream source system
For accuracy
Create test cases or calculated fields to check values – SQL: (quantity_ordered * item_price) AS sub_total
Lookup values against an objective reference dataset – SQL: IN() with a subquery or JOIN
For completeness
Explore the data shape and skew and look for missing values – SQL: NULLIF(), IFNULL(), COALESCE()
Enrich the existing dataset with others using UNIONs and JOINs
For consistency
Store one fact in one place and use IDs to lookup
Use String functions to clean data
o PARSE_DATE()
o SUBSTR()
o REPLACE()
For uniformity
Document and comment your approach
Use FORMAT() to clearly indicate units
CAST() data types to the same formats and digits
Label all visualizations appropriately
Recipe
The most common source of data for ETL (Extract-Transform-Load) applications is typically data stored in comma separated value (CSV)
format text files, as many database systems export and import data in this fashion. For the purposes of this lab you’ll be using a CSV file,
but the same techniques can be applied to database sources as well as any other data source that you have available in Cloud Data Fusion.
Building a pipeline
Write if empty
Append records
Overwrite table
What is the key insight behind how BigQuery caches your query results?
BigQuery stores all your query results in temporary tables automatically which you then can promote to permanent tables
There is also a streaming option available to ingest data into BigQuery through the API
Data consistency is not guaranteed if the data values in the source are changed while querying.
Data sources stored outside of BigQuery lose the performance benefits of having BigQuery manage your data storage (including but not
limited to auto-optimization of your query execution path, certain wildcard functions are disabled, etc.).
Partitioned tables
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By
dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of
bytes read by a query.
Ingestion time: Tables are partitioned based on the data's ingestion (load) time or arrival time.
Date/timestamp/datetime: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column.
Integer range: Tables are partitioned based on an integer column.
SQL JOIN
Joining data tables can provide meaningful insight into your dataset. However, when you join your data, there are common pitfalls that
could corrupt your results. This lab focuses on avoiding those pitfalls. Types of joins:
STRING_AGG
Hmmm. you have 1,909 distinct SKUs which is less than the 2,273 number for total number of products on the website. The first results
probably contain products with duplicate SKUs.
Take an even closer look at the records. Determine which products have more than one SKU and which SKUs have more than one product.
What if you wanted one query that listed all products missing from either the website or inventory?
#standardSQL
SELECT DISTINCT
website.productSKU AS website_SKU,
inventory.SKU AS inventory_SKU
FROM `data-to-insights.ecommerce.all_sessions_raw` AS website
FULL JOIN `data-to-insights.ecommerce.products` AS inventory
ON website.productSKU = inventory.SKU
WHERE website.productSKU IS NULL OR inventory.SKU IS NULL
Learning Review
Use Union Wildcards to treat multiple tables as a single group
Use _TABLE_SUFFIX to filter wildcard tables and create calculated fields with the table name
FULL JOINs (also called FULL OUTER JOINs) include all records from each joined table regardless of whether there are
matches on the join key
Having non-unique join keys can result in an unintentional CROSS JOIN (more output rows than input rows) which should be
avoided
Use COUNT() and GROUP BY to determine if a key field is indeed unique
A dimension is a field that you can classify, that is an independent variable. So normally, I like to think of that as your qualitative or
categorical information. So, if you had your IRS 990, this could be like the name of the charity, is their address, anything that's not
necessarily mathematical, whereas your measure, is going to be dependent. So, that's going to be generally quantitative information.
Google Data Studio UI
Advanced insights
Summary of previous functions
Statistical functions
Approximation functions
Say you have an absurd number of rows. What you can use is approximate count distinct. So much like you've seen before with the count
distinct there on that second line, you can get that exact count. But you can also get the approximation, which can be good enough to get the
job done.
Analytic functions: WITH clause and RANK()
Real example
Now if you recall, you cannot use the alias of a field that you just created directly after FROM in the WHERE clause
To solve this, we will use a temporary table to save the first part of the query and then query that table making the filter.
User Defined Functions (UDF)
UDF limits
Traditional Relational Database Architecture
Wide tables are not scalable
Instead of storing everything in one massive table. You can actually begin to break apart those fields as rows in another table.
So how do you access elements in that array? In order to do that, we've used the function that's actually called UNNEST
Explicitly flatten arrays with UNNEST()
Lab
As you saw earlier, arrays can only have one data type. There is, however, a data type which supports having both multiple field names and
types: the STRUCT data type.
To recap:
Structs are containers that can have multiple field names and data types nested inside.
An arrays can be one of the field types inside of a Struct (as shown above with the splits field).
In traditional relational SQL, if you had a races table and a participants table what would you do to get information from both tables? You
would JOIN them together. Here the participant STRUCT (which is conceptually very similar to a table) is already part of your races table
but is not yet correlated correctly with your non-STRUCT field "race".
Can you think of what two word SQL command you would use to correlate the 800M race with each of the racers in the first table?
If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible
race like a cartesian product?
Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, see
working with ARRAYs and STRUCTs
Recap of STRUCTs:
A SQL STRUCT is simply a container of other data fields which can be of different data types. The word struct means data structure.
Recall the example from earlier:
STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner
STRUCTs are given an alias (like runner above) and can conceptually be thought of as a table inside of your main table.
STRUCTs (and ARRAYs) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the
struct itself or the struct field that is an array in order to unpack and flatten it.
Recall that the UNNEST operator takes an ARRAY and returns a table, with one row for each element in the ARRAY. This will allow you
to perform normal SQL operations like:
As a reminder, an Array is an ordered list of elements that share a data type. Here is a string array of the 8 racer names.
['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']
You can create arrays in BigQuery by adding brackets [ ] and comma separating values.
In order to find the racers whose names begin with the letter M, you need to unpack the above array into individual rows so you can use a
WHERE clause.
Unpacking the array is done by wrapping the array (or the name of the array) with UNNEST() as shown below. Run the below query and
note how many rows are returned.
#standardSQL
SELECT * FROM
UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names
Now add a normal WHERE clause to filter these rows, and run the query:
#standardSQL
SELECT * FROM
UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names
WHERE unnested_array_of_names LIKE 'M%'
Write a query that will list the total race time for racers whose names begin with R. Order the results with the fastest total time first. Use the
and start with the partially written query below
#standardSQL
SELECT
p.name,
SUM(split_times) as total_race_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_times
WHERE p.name LIKE 'R%'
GROUP BY p.name
ORDER BY total_race_time ASC;
#standardSQL
SELECT
p.name,
split_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_time
WHERE split_time = 23.2;
Advanced Data Studio
Case statements and filters
As a quick summary:
1. you can create calculated fields directly in data studio, as you saw.
2. You can disable the prefetch cache for more real time data needs.
3. And lastly, be aware that sharing reports does not automatically add your audience as viewers to the underlying dataset and
BigQuery. You'll have to do that separately, and we'll cover data access later in this course.
Avoid BigQuery Performance Pitfalls
Now the main takeaway with WITH clauses is that the results of those queries,
those subqueries that you've named in that WITH clause, like line 6 and line 15 here, those results are not materialized and they're actually
requeried every single time, if you're referencing those tables more than once in one of your long queries. So it's a trade-off between
readability for your code, and also performance or speed, when you're storing those tables as permanent.
First up, are those yellow bars and this represents the amount of time that your workers waiting to actually receive work. And
one of the potential pitfalls here is that you have a highly skewed data set and one of your workers is actually being bottlenecked.
And the rest are just sitting around idly waiting for the work to be performed on that one worker before the rest of the work can
be distributed out.
Second up is read time and this represents the amount of time it takes your workers to actually read your data from disk.
Third in orange is the amount of computational work that is done as part of your query.
Fourth, is the amount of time that it takes to write your data to disk and this is in the case where you're creating something like a
permanent table.
Each stage represents a shuffling of information and data between those workers. If you have many, many stages, 1, 2, 3, 4, 5, 6 stages, that
represents a lot of communication that's happening, or passing of data between the workers.
The key thing to remember is that each individual worker and then sometimes you can have upwards of 2,000 workers working on a query
at once is selfish. And likes to have a partition of your data where it has all of the IDs on its own local memory to perform those aggregation
operations or any other kind of querying operations across that subset of your data, very efficiently.
So all the way on the left, what you see is traditionally in relational database systems, you'd have a database administrator, or DBA, that
would take a look at your database. And then manually partition, or pre-allocate that space in very very defined partitions on disk. As you
bet, you'd pretty much have to be a hardware master to deal with that, to get those individual partitions set up and performing along with
your queries. Second, later on down the road, what actually came out is you can manually shard your own tables.
So the last and best way that if you came across this issue and you just really wanted to query most recent data, like 2016, 2017 and 2018, is
to actually set up a special version of your table. We will just call the date partition table. And it can be just like a normal table except that it
actually has what we will call a partition column in it, and this is just like a date, time field as you see here all the way on the right. And the
example I'm going to go with is if you have a massive amount of customer orders historically, going back say 30 years, but you're really
only interested in querying things from the last 3 years. You want to wait and partition off the table in and of itself, so that you're not
scanning over everything, as is typically the case when you're doing like a select query.
Recap
Since you are paying for the total amount of bytes that you are processing, you want to first limit the amount of columns that
you're returning and also consider filtering out your rows by using that where filter whenever you can. Now this doesn't mean
you can't write queries that process your entire data set, BigQuery of course was built for petabyte scale. You just want to be
mindful of the resources that you're consuming.
We then cover the query explanation map, where we get those cool visual cues of the types of work that is most demanded by
your queries.
Note here that a large difference between the average, and the max, could indicate a heavy data skew, which can be helped by
filtering your data as early as possible.
And we covered some SQL bad behavior, like selecting every record and ordering it, without using a limit clause.
Finally, if you only access in recent data like the most recent events, consider using table partitioning to reduce the total amount
of bytes that are scanned by your query.
Some observations
The denormalized query (#2) is faster and uses less slot time to achieve the same result
The relational query (#1) has many more input stages and spends the most worker time in joining the datasets together
The denormalized query (#2) spends the most time reading data input and outputting the results. There is minimal time spent in
aggregations and joins.
Neither query resulted in bytes spilled to disk which suggests out datasets are likely not skewed (or significantly large enough to
spill out from memory from an individual worker)
Jupyter Notebooks
One of the BIGGEST advantages for Data Analysts is the ability to have both your SQL code AND your documentation (text, images, bash
scripts) in a single version-controlled file. I've included an optional lab following this reading item for those who want to get practice using
Notebooks with BigQuery ahead of the ML course content.
And one of the really cool things that you can do is invoke some BigQuery reserved keywords like session user, which you can then pull the
email address of the person that's logged in, and assuming that you have your data structured in such a way where you have a column called
allowed viewer or allowed group that you can then maps that email address on, you can set up some pretty neat role level filtering on your
underlying data
So two key areas to really focus on for access permission or at the project level, which includes BigQuery and your cloud storage buckets
and everything else and then individually at the BigQuery dataset level. And you can access and share datasets by clicking on that triangle
beneath each of your different datasets
Machine learning is not about providing the model with rules like a hard-coded recipe, the beauty of ML is that the recipe is created by the
algorithm itself.
An example about Google search: There were a ton of rules that were part of the search engine code-based to decide which sports team to
show and where based on where the user was. If the
query is giants, and the users in the Bay Area, show
them the results by the San Francisco Giants. If the user
is in the New York area, show them the results but the
New York Giants.
After separating training and testing data, we need to select the ML model
Examples:
The ultimate goal is the same. If we're given labeled training data, predict the label for unseen or unknown data.
Now, no topic on ML models would be complete without first discussing the difference between supervised and unsupervised learning.
Supervised learning is everything that we've covered so far. It's where you already know the correct answer as a label in your data-set or
your training data-set, and spoiler alert, these problems are typically easier for prediction. Unsupervised learning is where you have the data
but you don't have any labels.
Pre-trained vs trained ML models
Now, there are two broad categories of ML models. You have models that are already built,and called pre-trained, and then those custom
models that you build yourself using code like tensorflow.
Cloud NLP
Collaborative notebooks in the Cloud
So as you'll see in the later modules, often times data scientists, which you're quickly on the road to becoming, will develop, well, I'll say
locally within data lab. And that means you got your box on Google Compute Engine instance, and you've got Datalab up and running,
much like in the Google Analytics demo that we were showing a little bit earlier. And you're bringing in data and you're processing it, and
you're using something like, if you're familiar with, like a Pandas dataframe, or you're operating as a data store. Or your data store is a CSV
file or a small subset or a sample of that data. And then once you've got, in the case of ML, once you've got a ML prototype or sample
working really, really well and it's fast and it's iterating fine. Then what you can do is you can scale out and really get all the access to the
cloud resources and replace a lot of those with what we call local stamps with cloud stamps, right? So instead of your CSV files, you could
just be pulling data in directly from BigQuery or a cloud storage. And you're up-leveling your Apache Beam workflows into a managed
workflow service like Cloud Dataflow. And instead of using Pandas dataframes, you're using BigQuery for them, kind of that
petabyte scale data warehouse as well. And same thing with TensorFlow, instead of running individual TensorFlow lines of code,
you could use the Cloud Machine Learning engine APIs or just get that kind of managed TensorFlow aspect there. So what you'll see
in a lot of these examples, especially in the later modules in this course, is you'll do a lot of the initial development locally within your
Cloud Datalab notebook. And then you'll reach out to a lot of these other services as you get more towards what we'll call productionalizing
or serving your ML model.
import google.datalab.bigquery as bq
Use -n to provide a name for the query as you can have more than one query in a single notebook
Charting commands
What makes a good dataset for ML
A rock curve or receiver operating characteristic curve, is just a graph that shows the performance of a classification model at all
classification thresholds. The curve plots the two parameters that you see here, the true positive rate against your false positive rate.
LAB
Scenario: Your data analyst team exported the Google Analytics logs for an ecommerce website into BigQuery and created a new table of
all the raw ecommerce visitor session data for you to explore. Using this data, you'll try to answer a few questions.
Question: Out of the total visitors who visited our website, what % made a purchase?
#standardSQL
WITH visitors AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `data-to-insights.ecommerce.web_analytics`
),
purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.transactions IS NOT NULL
)
SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers
#standardSQL
SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5;
#standardSQL
# visitors who bought on a return visit (could have bought on first as well
WITH all_visitor_stats AS (
SELECT
fullvisitorid, # 741,721 unique visitors
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
COUNT(DISTINCT fullvisitorid) AS total_visitors,
will_buy_on_return_visit
FROM all_visitor_stats
GROUP BY will_buy_on_return_visit
Your team decides to test whether these two fields are good inputs for your classification model:
Question: What are the risks of only using the above two fields?
Answer: Machine learning is only as good as the training data that is fed into it. If there isn't enough information for the model to determine
and learn the relationship between your input features and your label (in this case, whether the visitor bought in the future) then you will not
have an accurate model. While training a model on just these two fields is a start, you will see if they're good enough to produce an accurate
model.
#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1)
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
ORDER BY time_on_site DESC
LIMIT 10;
Which fields are the model features? What is the label (correct answer)?
Question: Which two fields are known after a visitor's first session?
Answer: bounces and time_on_site are known after a visitor's first session.
Answer: will_buy_on_return_visit is not known after the first visit. Again, you're predicting for a subset of users who returned to your
website and purchased. Since you don't know the future at prediction time, you cannot say with certainty whether a new visitor come back
and purchase. The value of building a ML model is to get the probability of future purchase based on the data gleaned about their first
session.
Question: Looking at the initial data results, do you think time_on_site and bounces will be a good indicator of whether the user will return
and purchase or not?
Answer: It's often too early to tell before training and evaluating the model, but at first glance out of the top 10 time_on_site, only 1
customer returned to buy, which isn't very promising. Let's see how well the model does.
Note: There are many additional model types used in Machine Learning (like Neural Networks and decision trees) and available using
libraries like [TensorFlow](https://www.tensorflow.org/tutorials/). At time of writing, BQML supports the two listed above.
Since you are bucketing visitors into "will buy in future" or "won't buy in future", use logistic_reg in a classification model.
Add this query in the Query editor to create a model and specify model options:
#standardSQL
#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430') # train on first 9 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
;
This relationship is visualized with a ROC (Receiver Operating Characteristic) curve like the one shown here, where you try to maximize
the area under the curve or AUC:
In BQML, roc_auc is simply a queryable field when evaluating your trained ML model.
Now that training is complete, you can evaluate how well the model performs with this query using ML.EVALUATE.
Add this query in the Query editor, then Run the query:
#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model, (
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630') # eval on 2 months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
));
Improve model performance with Feature Engineering
As was hinted at earlier, there are many more features in the dataset that may help the model better understand the relationship between a
visitor's first session and the likelihood that they will purchase on a subsequent visit.
Let's add some new features and create a second machine learning model which will be called classification_model_2:
How far the visitor got in the checkout process on their first visit
Where the visitor came from (traffic source: organic search, referring site etc..)
Device category (mobile, tablet, desktop)
Geographic information (country)
Create this second model by adding this query in the Query editor, then Run the query:
#standardSQL
CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS
(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);
Note: You are still training on the same first 9 months of data, even with this new model. It's important to have the same training dataset so
you can be certain a better model output is attributable to better input features and not new or different training data.
A key new feature that was added to the training dataset query is the maximum checkout progress each visitor reached in their session,
which is recorded in the field hits.eCommerceAction.action_type. If you search for that field in the field definitions you will see the field
mapping of 6 = Completed Purchase.
As an aside, the web analytics dataset has nested and repeated fields like ARRAYS which need to broken apart into separate rows in your
dataset. This is accomplished by using the UNNEST() function, which you can see in the above query.
#standardSQL
SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'decent'
WHEN roc_auc > .6 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_2, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));
#standardSQL
SELECT
*
FROM
ml.PREDICT(MODEL `ecommerce.classification_model_2`,
(
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM `data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid
)
SELECT
CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h
WHERE
# only predict for new visits
totals.newVisits = 1
AND date BETWEEN '20170701' AND '20170801' # test 1 month
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
ORDER BY
predicted_will_buy_on_return_visit DESC;
Results
Of the top 6% of first-time visitors (sorted in decreasing order of predicted probability), more than 6% make a purchase in a later visit.
These users represent nearly 50% of all first-time visitors who make a purchase in a later visit.
Targeting the top 6% of first-time increases marketing ROI by 9x vs targeting them all!
Additional information
Tip: add warm_start = true to your model options if you are retraining new data on an existing model for faster training times. Note that you
cannot change the feature columns (this would necessitate a new model).
roc_auc is just one of the performance metrics available during model evaluation. Also available are accuracy, precision, and recall.
Knowing which performance metric to rely on is highly dependent on what your overall objective or goal is