Practice Use Case
In this Chapter you will practice applying what you learned in the previous lessons in the context of a use
case working with customer data. Before you begin working on this lesson, read the following background
information about the Customers dataset and use case. Once you have completed the use case, you can
find the steps we took and the results produced with in the following pages. This example is meant to test
what you have learned thus far. When you feel comfortable with this use case example, we recommend
you challenge your wrangling skills by taking our Certification Exam.
Before getting started, download the following datasets:
• Use Case Data
Example
You work for a technical services company that sells three monthly subscription products:
• Silver (price: $9.99/month)
• Gold (price: $14.99/month)
• Platinum (price: $29.99/month)
The company occasionally offers promotional discounts – therefore, some product prices may be slightly
lower than those listed above. Your overall goal is to provide an analysis of sales activity by zip code.
In order to do this, you’ll need to join your Lab_customers dataset (where zip code information resides)
with sales data from your transactions datasets. Once you’ve joined the data, you’ll need to aggregate the
results.
Create a new flow
Add Following Datasets:
• lab_customers.csv
• lab_2010_transactions.csv
• lab_2011_transactions.csv
• lab_2012_transactions.csv
• lab_2013_transactions.csv
• lab_2014_transactions.csv
• lab_2015_transactions.csv
• zip_to_state_map.csv
Add recipe to Lab_Customers
Add a Recipe to the Lab_Customers dataset. Edit the Recipe to enter the Transformer Grid
Filter out faulty data
Remove the contacts who are not actual customers: We identify actual customers by the start_date
column.
Provide a default end_date value for active customers
For those current customers without an end date, set the default end date to January 01, 2050.
Fix the start_date data type mismatches
Notice that the start_date column contains in two different formats. You want all of the dates in this
column to conform to the format ‘yyyy/MM/dd’.
Lab_Transactions
Add a Recipe to the Lab_2010_transactions dataset. Edit that Recipe to enter the grid
Remove dummy data
Since this company sells online subscription products, the website sends a test transaction of $0.01 to
validate that a customer’s credit card is valid. Test transactions are recorded in the database and appear in
the transactions file. You need to remove these transactions so they don’t distort your analysis.
Clean up Discount Column
Remove the % from the Discount column, then convert the column to a decimal. Fill in the missing values
with 0.
Standardize product names
After 2011, the company renamed the subscription products. To maintain consistency between the
transactions files from each year, you need to replace the old product names with the new product names.
• Change ‘basic’ to ‘silver’
• Change ‘deluxe’ to ‘gold’
Determine the real price for each product
The real price for each product is a combination of the list price and the discount. Create a new column
that contains the actual charge for each transaction.
Union transactions datasets
Union 2010 – 2015 datasets. We want to apply the above transforms to all data, make sure your recipe step
is added to the beginning of your recipe, so that all the steps in your recipe are applied to all of the
unioned datasets.
• 2011_Transactions
• 2012_Transactions
• 2013_Transactions
• 2014_Transactions
• 2015_Transactions
Join with the Customers dataset
Join transaction data with the customer data, keep the following columns.
• Customer_id (2010_transactions)
• transaction_date (2010 Transactions)
• adjusted_price (2010 Transactions)
• product (2010 Transactions)
• address_state (Customers)
• address_zip (Customers)
• region (Customers)
• start_date (Customers)
• end_date (Customers)
Lookup state by zip
Scroll to the address_state column and notice that there are some blank records. Perform a lookup on the
address_zip column to zip_to_state_map.csv. This Lookup file contains a state mapping for each zip code.
Rename column3 to State and delete Address_state
Filter transactions based on date
Only keep transactions that have occurred during the course of a customer’s active subscription (i.e. the
transaction_date is before the end_date
Create a new table with aggregated results
Create a table that lists the count, sum, min and max of adjusted price, grouped by address zip.
Run the job at scale
Run job and view job results.
Solution
Solution Guide