[go: up one dir, main page]

0% found this document useful (0 votes)
24 views28 pages

Chapter 2 Lecture

Uploaded by

John Mcaulay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views28 pages

Chapter 2 Lecture

Uploaded by

John Mcaulay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Week 2 Class Agenda

1. DAA Ch. 2 Lecture

2. Lab 2-3 Data Cleaning Exercise

[10 min break]

3. SQL Part 1 Lecture

4. SQL Part 1 Exercises


Ch. 2: Data Preparation
and Cleaning
Brandon Lock, Ph.D.
Baruch College, CUNY
Where we are now
1. Data 2. Data
3. Performing 4.
Analytics in Preparation
the Test Plan Visualization
Accounting and Cleaning

7. Managerial 8. Financial
5. Modern 6. Audit
Performance Statement
Accounting Analytics
Indicators Analytics
Learning Objectives
1. Why and how is data stored in relational databases?

2. Extraction, transformation, and loading (ETL) data techniques

3. Common data cleaning techniques


Identify the
questions

Track Master the

In the IMPACT cycle, outcomes data

we’re going to look at


Mastering the Data. Communicate
insights
Perform test
plan

Address and
refine
results

Exhibit 1-1 The IMPACT Cycle


Show an exciting image of a SQL database at
Baruch College in Manhattan on 25th St and
Lexington in the style of a Michael Bay movie
Source: DALL-E 3
1. Why and how is data
stored in relational
databases?
Relational Databases versus Flat Files
Structured data is generally stored in either a relational database or a flat file.
• Structured data = quantitative, machine-readable data

Flat files
• A file containing a single table with all relevant data (e.g., CSV file)
• Pro: Simple storage format; ready to analyze
• Con: Data redundancy can lead to inefficient storage

Relational database
• Most common database type that stores data in tables that can be linked
(related) to one another
• Relational databases store data in a normalized format, which organizes data
into separate tables to reduce data redundancy
• Denormalizing = Process of merging tables in a relational database to create
a flat file for analysis
Q: Why store business data in a
relational database versus other
file formats (e.g., Excel file)?
Benefits of Relational Databases
• Completeness: Ensures all relevant data required for a business process
are together in the same dataset

• Reduces redundancy (improves efficiency): Reduces redundancy over


flat files, mitigates errors by using multiple tables, and facilitates
working with large heterogenous datasets

• Business rules are enforced: Relational databases can be designed to


aid in the enforcement of internal controls and business rules

• Aid integration and communication of business processes: Relational


databases support business processes across an organization,
facilitating interactions between business units who are all using
consistent datasets.
Relational Database Table Attributes
Primary keys = Column that uniquely
identifies each row in the table and Supplier Table
cannot be missing. Every table has one
primary key (Supplier ID). Supplier Supplier Supplier
Supplier Name
ID Address Type

Composite keys = Combination of two or Northern Brewery 6021 Lyndale


more variables that can be combined to 1 1
Homebrew Supply Ave S
be a primary key.
Could be (Supplier Name, Supplier 2 Hops Direct LLC
686 Green
1
Address) if there was no Supplier ID Valley Road
column
455 E. Township
3 The Home Brewery 1
St.
Foreign keys = Attributes that point to the
primary key in another table. 408 N. Walton
4 The Payroll Company 2
Blvd
Descriptive attributes = Include all other
variables.
Examples of two tables, attributes, and data. Notice the
Primary Key (PK)-Foreign Key (FK) relationship.

Purchase Order Table Supplier Table


PO Create Approved Supplier Supplier Supplier Supplier
Date Supplier Name
No. d By By ID (FK) ID (PK) Address Type
1787 11/1/2017 1001 1010 1 Northern Brewery 6021 Lyndale
1 1
1788 11/1/2017 1005 1010 2 Homebrew Supply Ave S
1789 11/8/2017 1002 1010 1
1790 11/15/2017 1005 1010 1 686 Green
2 Hops Direct LLC 1
Valley Road

455 E.
3 The Home Brewery 1
Township St.

The Payroll 408 N.


4 2
Company Walton Blvd
Data dictionaries define data attributes
within a dataset
For each attribute, we Supplier Table Data Dictionary
learn: Primary or
Require Attribute Defaul Field
Foreign Description Data Type Notes
• What type of key it is. Key?
d Name t Value Size

• What data are Unique


Supplier Identifier for
required. PK Y
ID each Supplier
Number n/a 10

• What data can be


stored in it. N
Supplier
Name
First and Last
Name
Short Text n/a 30
• How much data is Type Code for
stored. FK N
Supplier
Different
Supplier Number Null 10 1: Vendor
Type
Categories 2: Misc
Modeling Relational Databases
Unified Modeling Language (UML) is a way to diagram the tables
within databases, including the primary key (PK) and foreign key
(FK) relationships

FGI_Product Sales_Subset Customer

Product_Code [PK] Sales_Order_ID [PK]


Customer_ID [PK]

Product_Description Product_Code [FK]


Customer_Name

… Customer_ID [FK]


Q. How would you organize data on
Students, Courses, and Student Course
Performance in a database?
• Assume there are three tables: Students, Courses, and Course
Enrollment
• Student table = Persistent student characteristics
• Course table = Characteristics of courses offered each semester
• Course enrollment table = Students who are enrolled in courses each semester

• What would be the primary and foreign keys?

• Q: Given a student name, how would we build a student’s


transcript?
2. What does it mean to
extract, transform, and
load (ETL) data?
Extract, Transform, and Load (ETL) Data
EXTRACT
Step 1: Determine the purpose and scope of data
required
Step 2: Obtain the data
TRANSFORM
Step 3: Validate the data for completeness and integrity
Step 4: Clean the data
LOAD
Step 5: Load the data for data analysis
Step 1: Determine the purpose and
scope of data required
Ask a few questions before beginning the process:
• What is the purpose of the data request?
• What data do you need to perform your analysis?
• What risks exist in data integrity?
• What other information will impact your data analysis?
Step 2: Obtain the data
• How will data be requested and/or obtained?
• Do you have access to the data yourself, or do you need to request a
database administrator or the information systems department to
provide the data for you?
• If you need to request the data, is there a standard data request form
that you should use and from whom do you request the data?
• Where are the data located (e.g., financial or other related systems)?
• What specific data are needed (tables and fields)?
• What tools will be used to perform data analytic tests or procedures
and why?
Step 2: Obtain the data
There are a couple options:
• Obtain data through a data request to the IT department.
• Obtain data yourself.
Example: Typical Data Request Form
SECTION 1: REQUEST DETAILS
Frequency (circle One-Off Annually Termly
Requestor Name: one) Other:___________
Requestor Contact
Number:
Spreadsheet
Requestor Email Format you wish the
Word Document
Address: data to be delivered
Text File
in(circle one):
Other: ____________
Please provide a description of the information
needed (indicate which tables and which fields
you require): Request Date:

Required Date:
What will the information be used for?
Intended Audience:

Customer
(if not requestor):
Obtain the data yourself
If you have direct access to a data warehouse, you can use SQL and other
tools to pull the data yourself.
1. Identify the tables that contain the information you need. You can do
this by looking through the data dictionary or the relationship model.
2. Identify which attributes, specifically, hold the information you need in
each table.
3. Identify how those tables are related to each other.
4. Retrieve data subset using a SQL query in Excel or other software
Q. What are some common issues
with data that must be fixed before
analysis can take place?
Step 3: Validate the data for
completeness and integrity
1. Check the number of records are as expected
2. Check that the data types for each variable (e.g., text, integer, floating
point number) are as expected
• Be aware of variable size and length limits (e.g., text, integer)
• Validate date/time variables are in a date-time format
3. Examine descriptive statistics (e.g., max, min, missing values) to
evaluate data integrity
Common Data Types:
https://www.w3schools.com/sql/sql_datatypes.asp ;
https://www.tutorialspoint.com/sqlite/sqlite_data_types.htm
Data Types
• In data analytics software, each data value is characterized by a data
type
• The data type defines the set of possible values, allowed operations, and
computational representation of the data.

Excel Data Types


• Numbers (including integer, decimal, currency, percentage)
• Dates (stored as a “serial number” representing days since Jan 1, 1900)
• Text data (also called “string” type)
• Logical data (TRUE, FALSE)
• Error Data Type (when equation results in error)
Step 4: Clean the data
Once you have valid data, there is still some work that needs to be done
to make sure it is consistent and ready for analysis:
1. Remove extraneous data (e.g., headings or subtotals)
2. Clean numbers (e.g., leading zeros, negative numbers)
3. Clean textual data (e.g., remove leading/trailing spaces, remove non-
printable characters, convert to numerical data when necessary)
4. Correct inconsistencies across data, in general
Step 5: Load the data for data analysis
Finally, you can now import your data into the tool of your
choice and expect the functions to work properly.
Summary
Relational databases facilitate the storage of business data in a
normalized format and have several efficiency and consistency
advantages over flat files

Relational databases have several important components:


• Including: primary keys, composite keys, foreign keys, and
descriptive attributes

Extract, Transform, and Load (ETL) describes the procedure to obtain,


validate, clean, and load data for analyses.

You might also like