0% found this document useful (0 votes)

24 views28 pages

Chapter 2 Lecture

Uploaded by

John Mcaulay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views28 pages

Chapter 2 Lecture

Uploaded by

John Mcaulay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Week 2 Class Agenda

1. DAA Ch. 2 Lecture

2. Lab 2-3 Data Cleaning Exercise

[10 min break]

3. SQL Part 1 Lecture

4. SQL Part 1 Exercises

Ch. 2: Data Preparation
and Cleaning
Brandon Lock, Ph.D.
Baruch College, CUNY
Where we are now
1. Data 2. Data
3. Performing 4.
Analytics in Preparation
the Test Plan Visualization
Accounting and Cleaning

7. Managerial 8. Financial
5. Modern 6. Audit
Performance Statement
Accounting Analytics
Indicators Analytics
Learning Objectives
1. Why and how is data stored in relational databases?

2. Extraction, transformation, and loading (ETL) data techniques

3. Common data cleaning techniques

Identify the
questions

Track Master the

In the IMPACT cycle, outcomes data

we’re going to look at

Mastering the Data. Communicate
insights
Perform test
plan

Address and
refine
results

Exhibit 1-1 The IMPACT Cycle

Show an exciting image of a SQL database at
Baruch College in Manhattan on 25th St and
Lexington in the style of a Michael Bay movie
Source: DALL-E 3
1. Why and how is data
stored in relational
databases?
Relational Databases versus Flat Files
Structured data is generally stored in either a relational database or a flat file.
• Structured data = quantitative, machine-readable data

Flat files
• A file containing a single table with all relevant data (e.g., CSV file)
• Pro: Simple storage format; ready to analyze
• Con: Data redundancy can lead to inefficient storage

Relational database
• Most common database type that stores data in tables that can be linked
(related) to one another
• Relational databases store data in a normalized format, which organizes data
into separate tables to reduce data redundancy
• Denormalizing = Process of merging tables in a relational database to create
a flat file for analysis
Q: Why store business data in a
relational database versus other
file formats (e.g., Excel file)?
Benefits of Relational Databases
• Completeness: Ensures all relevant data required for a business process
are together in the same dataset

• Reduces redundancy (improves efficiency): Reduces redundancy over

flat files, mitigates errors by using multiple tables, and facilitates
working with large heterogenous datasets

• Business rules are enforced: Relational databases can be designed to

aid in the enforcement of internal controls and business rules

• Aid integration and communication of business processes: Relational

databases support business processes across an organization,
facilitating interactions between business units who are all using
consistent datasets.
Relational Database Table Attributes
Primary keys = Column that uniquely
identifies each row in the table and Supplier Table
cannot be missing. Every table has one
primary key (Supplier ID). Supplier Supplier Supplier
Supplier Name
ID Address Type

Composite keys = Combination of two or Northern Brewery 6021 Lyndale

more variables that can be combined to 1 1
Homebrew Supply Ave S
be a primary key.
Could be (Supplier Name, Supplier 2 Hops Direct LLC
686 Green
1
Address) if there was no Supplier ID Valley Road
column
455 E. Township
3 The Home Brewery 1
St.
Foreign keys = Attributes that point to the
primary key in another table. 408 N. Walton
4 The Payroll Company 2
Blvd
Descriptive attributes = Include all other
variables.
Examples of two tables, attributes, and data. Notice the
Primary Key (PK)-Foreign Key (FK) relationship.

Purchase Order Table Supplier Table

PO Create Approved Supplier Supplier Supplier Supplier
Date Supplier Name
No. d By By ID (FK) ID (PK) Address Type
1787 11/1/2017 1001 1010 1 Northern Brewery 6021 Lyndale
1 1
1788 11/1/2017 1005 1010 2 Homebrew Supply Ave S
1789 11/8/2017 1002 1010 1
1790 11/15/2017 1005 1010 1 686 Green
2 Hops Direct LLC 1
Valley Road

455 E.
3 The Home Brewery 1
Township St.

The Payroll 408 N.

4 2
Company Walton Blvd
Data dictionaries define data attributes
within a dataset
For each attribute, we Supplier Table Data Dictionary
learn: Primary or
Require Attribute Defaul Field
Foreign Description Data Type Notes
• What type of key it is. Key?
d Name t Value Size

• What data are Unique

Supplier Identifier for
required. PK Y
ID each Supplier
Number n/a 10

• What data can be

stored in it. N
Supplier
Name
First and Last
Name
Short Text n/a 30
• How much data is Type Code for
stored. FK N
Supplier
Different
Supplier Number Null 10 1: Vendor
Type
Categories 2: Misc
Modeling Relational Databases
Unified Modeling Language (UML) is a way to diagram the tables
within databases, including the primary key (PK) and foreign key
(FK) relationships

FGI_Product Sales_Subset Customer

Product_Code [PK] Sales_Order_ID [PK]

Customer_ID [PK]

Product_Description Product_Code [FK]

Customer_Name

… Customer_ID [FK]
…
…
Q. How would you organize data on
Students, Courses, and Student Course
Performance in a database?
• Assume there are three tables: Students, Courses, and Course
Enrollment
• Student table = Persistent student characteristics
• Course table = Characteristics of courses offered each semester
• Course enrollment table = Students who are enrolled in courses each semester

• What would be the primary and foreign keys?

• Q: Given a student name, how would we build a student’s

transcript?
2. What does it mean to
extract, transform, and
load (ETL) data?
Extract, Transform, and Load (ETL) Data
EXTRACT
Step 1: Determine the purpose and scope of data
required
Step 2: Obtain the data
TRANSFORM
Step 3: Validate the data for completeness and integrity
Step 4: Clean the data
LOAD
Step 5: Load the data for data analysis
Step 1: Determine the purpose and
scope of data required
Ask a few questions before beginning the process:
• What is the purpose of the data request?
• What data do you need to perform your analysis?
• What risks exist in data integrity?
• What other information will impact your data analysis?
Step 2: Obtain the data
• How will data be requested and/or obtained?
• Do you have access to the data yourself, or do you need to request a
database administrator or the information systems department to
provide the data for you?
• If you need to request the data, is there a standard data request form
that you should use and from whom do you request the data?
• Where are the data located (e.g., financial or other related systems)?
• What specific data are needed (tables and fields)?
• What tools will be used to perform data analytic tests or procedures
and why?
Step 2: Obtain the data
There are a couple options:
• Obtain data through a data request to the IT department.
• Obtain data yourself.
Example: Typical Data Request Form
SECTION 1: REQUEST DETAILS
Frequency (circle One-Off Annually Termly
Requestor Name: one) Other:___________
Requestor Contact
Number:
Spreadsheet
Requestor Email Format you wish the
Word Document
Address: data to be delivered
Text File
in(circle one):
Other: ____________
Please provide a description of the information
needed (indicate which tables and which fields
you require): Request Date:

Required Date:
What will the information be used for?
Intended Audience:

Customer
(if not requestor):
Obtain the data yourself
If you have direct access to a data warehouse, you can use SQL and other
tools to pull the data yourself.
1. Identify the tables that contain the information you need. You can do
this by looking through the data dictionary or the relationship model.
2. Identify which attributes, specifically, hold the information you need in
each table.
3. Identify how those tables are related to each other.
4. Retrieve data subset using a SQL query in Excel or other software
Q. What are some common issues
with data that must be fixed before
analysis can take place?
Step 3: Validate the data for
completeness and integrity
1. Check the number of records are as expected
2. Check that the data types for each variable (e.g., text, integer, floating
point number) are as expected
• Be aware of variable size and length limits (e.g., text, integer)
• Validate date/time variables are in a date-time format
3. Examine descriptive statistics (e.g., max, min, missing values) to
evaluate data integrity
Common Data Types:
https://www.w3schools.com/sql/sql_datatypes.asp ;
https://www.tutorialspoint.com/sqlite/sqlite_data_types.htm
Data Types
• In data analytics software, each data value is characterized by a data
type
• The data type defines the set of possible values, allowed operations, and
computational representation of the data.

Excel Data Types

• Numbers (including integer, decimal, currency, percentage)
• Dates (stored as a “serial number” representing days since Jan 1, 1900)
• Text data (also called “string” type)
• Logical data (TRUE, FALSE)
• Error Data Type (when equation results in error)
Step 4: Clean the data
Once you have valid data, there is still some work that needs to be done
to make sure it is consistent and ready for analysis:
1. Remove extraneous data (e.g., headings or subtotals)
2. Clean numbers (e.g., leading zeros, negative numbers)
3. Clean textual data (e.g., remove leading/trailing spaces, remove non-
printable characters, convert to numerical data when necessary)
4. Correct inconsistencies across data, in general
Step 5: Load the data for data analysis
Finally, you can now import your data into the tool of your
choice and expect the functions to work properly.
Summary
Relational databases facilitate the storage of business data in a
normalized format and have several efficiency and consistency
advantages over flat files

Relational databases have several important components:

• Including: primary keys, composite keys, foreign keys, and
descriptive attributes

Extract, Transform, and Load (ETL) describes the procedure to obtain,

validate, clean, and load data for analyses.

DAA - Chapter 02
No ratings yet
DAA - Chapter 02
12 pages
Mastering ETL for Data Analysis
No ratings yet
Mastering ETL for Data Analysis
5 pages
Microsoft PowerPoint - DAA - Chapter 02
No ratings yet
Microsoft PowerPoint - DAA - Chapter 02
8 pages
02 - Data Preparation and Cleaning
No ratings yet
02 - Data Preparation and Cleaning
16 pages
02 - Data Preparation and Cleaning
No ratings yet
02 - Data Preparation and Cleaning
16 pages
Ais Elect - Reviewer
No ratings yet
Ais Elect - Reviewer
5 pages
Slide For Chapter 2
No ratings yet
Slide For Chapter 2
16 pages
DAA - Chapter 02
No ratings yet
DAA - Chapter 02
11 pages
Week2 - Master The Data
No ratings yet
Week2 - Master The Data
28 pages
Mastering Data in Accounting Systems
No ratings yet
Mastering Data in Accounting Systems
10 pages
Chapter 2 - Preparing Data For Analysis
No ratings yet
Chapter 2 - Preparing Data For Analysis
35 pages
SQL ANalyst by CT Taylor Part 3
No ratings yet
SQL ANalyst by CT Taylor Part 3
5 pages
Unit 1
No ratings yet
Unit 1
30 pages
Analysis Terms
No ratings yet
Analysis Terms
1 page
Business Data Analytics Part 3
No ratings yet
Business Data Analytics Part 3
59 pages
Introduction To Business Intelligence 2
No ratings yet
Introduction To Business Intelligence 2
23 pages
Download
No ratings yet
Download
4 pages
CourseNotes - Learning Data Analytics 1 Foundations
No ratings yet
CourseNotes - Learning Data Analytics 1 Foundations
8 pages
INF10024 - 2023 - WEEKLY SCHEDULE TOPIC 7 - Data
No ratings yet
INF10024 - 2023 - WEEKLY SCHEDULE TOPIC 7 - Data
40 pages
RDBMS
No ratings yet
RDBMS
30 pages
Database Management: An Introduction
No ratings yet
Database Management: An Introduction
71 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
12 pages
Learn
No ratings yet
Learn
31 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Data Wrangling for Marketers
No ratings yet
Data Wrangling for Marketers
23 pages
Data Analytic For Accounting (DAFA) Main Reference
No ratings yet
Data Analytic For Accounting (DAFA) Main Reference
448 pages
Data Sources Advance Data Handling
No ratings yet
Data Sources Advance Data Handling
23 pages
Chapter 9
No ratings yet
Chapter 9
37 pages
Learn
No ratings yet
Learn
33 pages
Data Analytics and Supporting Services - Module 3-1
No ratings yet
Data Analytics and Supporting Services - Module 3-1
65 pages
Databases
No ratings yet
Databases
37 pages
Data Profiling Is A Critical Step in Data Manageme
No ratings yet
Data Profiling Is A Critical Step in Data Manageme
7 pages
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
No ratings yet
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
94 pages
Lecture 2. RDB and SQL
No ratings yet
Lecture 2. RDB and SQL
75 pages
Managing Data and Assignments
No ratings yet
Managing Data and Assignments
5 pages
Data Analystic
No ratings yet
Data Analystic
35 pages
PTDLKT
No ratings yet
PTDLKT
11 pages
Lecture Database Course Introdutcion For Student
No ratings yet
Lecture Database Course Introdutcion For Student
42 pages
Unit - 1 Learning Notes
No ratings yet
Unit - 1 Learning Notes
11 pages
Week 08 Database
No ratings yet
Week 08 Database
40 pages
Data Sources Data Handling Data Visualization
No ratings yet
Data Sources Data Handling Data Visualization
23 pages
Accounting Information Systems, 6: Edition James A. Hall
No ratings yet
Accounting Information Systems, 6: Edition James A. Hall
42 pages
Principles of Data Science
No ratings yet
Principles of Data Science
46 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
IBA Chapter 2 Slides Final Accessible
No ratings yet
IBA Chapter 2 Slides Final Accessible
43 pages
Unit-1 DM
No ratings yet
Unit-1 DM
16 pages
DBMS Project
No ratings yet
DBMS Project
9 pages
Semana 3
No ratings yet
Semana 3
5 pages
DA - Unit I
No ratings yet
DA - Unit I
83 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
9 pages
Database Management Systems Guide
No ratings yet
Database Management Systems Guide
7 pages
01-Explore Fundamentals of Data
No ratings yet
01-Explore Fundamentals of Data
13 pages
DMDW - Preprocessing L-6,7
No ratings yet
DMDW - Preprocessing L-6,7
16 pages
Hair EOMA 1e Chap002 PPT
No ratings yet
Hair EOMA 1e Chap002 PPT
17 pages
Week 1
No ratings yet
Week 1
50 pages
Module 1 - Database Systems
No ratings yet
Module 1 - Database Systems
9 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Lab 3-3 In-Class Lab1
No ratings yet
Lab 3-3 In-Class Lab1
4 pages
Chapter 5 Lecture
No ratings yet
Chapter 5 Lecture
39 pages
Lab 2-3 - Data Cleaning Exercise1
No ratings yet
Lab 2-3 - Data Cleaning Exercise1
4 pages
Chapter 8 Lecture2
No ratings yet
Chapter 8 Lecture2
35 pages
User Guide
No ratings yet
User Guide
6 pages
Chapter 7 Lecture1
No ratings yet
Chapter 7 Lecture1
22 pages
Chapter 4 Lecture
No ratings yet
Chapter 4 Lecture
39 pages
Case Study: Market Entry Strategy - Selection of Slides
No ratings yet
Case Study: Market Entry Strategy - Selection of Slides
12 pages
MBA Competitive Strategy Course
No ratings yet
MBA Competitive Strategy Course
11 pages
Dow
100% (1)
Dow
20 pages
Volunteer'S Deposit: I I Volunteer To
No ratings yet
Volunteer'S Deposit: I I Volunteer To
3 pages
Module3 Question Bank
No ratings yet
Module3 Question Bank
10 pages
Time Tabling System Dissertation (LOUIS)
No ratings yet
Time Tabling System Dissertation (LOUIS)
79 pages
PostgreSQL Mastery Guide
100% (1)
PostgreSQL Mastery Guide
1 page
Lecture 2.3.5 Views
No ratings yet
Lecture 2.3.5 Views
22 pages
Module 2: Information Technology Infrastructure: Chapter 5: Databases and Information Management
No ratings yet
Module 2: Information Technology Infrastructure: Chapter 5: Databases and Information Management
39 pages
SQL Quiz for Database Learners
No ratings yet
SQL Quiz for Database Learners
8 pages
L2.1 BW Introduction and Source Systems
No ratings yet
L2.1 BW Introduction and Source Systems
24 pages
Fundamentals of Information Systems, 6e - Ralph M. Stair, George Reynolds (New)
No ratings yet
Fundamentals of Information Systems, 6e - Ralph M. Stair, George Reynolds (New)
333 pages
Land Use Policy: Sciencedirect
No ratings yet
Land Use Policy: Sciencedirect
14 pages
Bca 2019 PDF
No ratings yet
Bca 2019 PDF
36 pages
Database Models Overview
No ratings yet
Database Models Overview
3 pages
Idmc Iics 101 Labs v10
No ratings yet
Idmc Iics 101 Labs v10
120 pages
PT2 Sample Paper-Iii Periodic Test - 2 Class - Xi Informatics Practices
75% (4)
PT2 Sample Paper-Iii Periodic Test - 2 Class - Xi Informatics Practices
2 pages
HSC ICT (Akkhar-Patra)
No ratings yet
HSC ICT (Akkhar-Patra)
545 pages
SQL Database Constraints & Keys
No ratings yet
SQL Database Constraints & Keys
8 pages
Database Concepts Notes
No ratings yet
Database Concepts Notes
48 pages
DMS (313302) Unit1
No ratings yet
DMS (313302) Unit1
75 pages
Assessment 5 - Attempt Review
No ratings yet
Assessment 5 - Attempt Review
4 pages
Mad Unit 5 Imp Questions
No ratings yet
Mad Unit 5 Imp Questions
7 pages
Cs4604 Syllabus
No ratings yet
Cs4604 Syllabus
4 pages
SQL Notes
No ratings yet
SQL Notes
190 pages
Computer First Half Book
No ratings yet
Computer First Half Book
5 pages
Data Mining and Predictive Analytics - Andres Fortino
100% (1)
Data Mining and Predictive Analytics - Andres Fortino
390 pages
CS101 Short Notes by Tus
No ratings yet
CS101 Short Notes by Tus
29 pages
(Ebook PDF) Database System Concepts 6th Editioninstant Download
100% (6)
(Ebook PDF) Database System Concepts 6th Editioninstant Download
54 pages
CH 2
No ratings yet
CH 2
42 pages
Short Answer Type Question
No ratings yet
Short Answer Type Question
3 pages
Grade 12 SQL Workshop Guide
76% (29)
Grade 12 SQL Workshop Guide
47 pages
Employee Loan Management System Project Report
No ratings yet
Employee Loan Management System Project Report
9 pages
DBS Unit 1
No ratings yet
DBS Unit 1
30 pages

Chapter 2 Lecture

Uploaded by

Chapter 2 Lecture

Uploaded by

Week 2 Class Agenda

1. DAA Ch. 2 Lecture

2. Lab 2-3 Data Cleaning Exercise

[10 min break]

3. SQL Part 1 Lecture

4. SQL Part 1 Exercises

2. Extraction, transformation, and loading (ETL) data techniques

3. Common data cleaning techniques

Track Master the

In the IMPACT cycle, outcomes data

we’re going to look at

Exhibit 1-1 The IMPACT Cycle

• Reduces redundancy (improves efficiency): Reduces redundancy over

• Business rules are enforced: Relational databases can be designed to

• Aid integration and communication of business processes: Relational

Composite keys = Combination of two or Northern Brewery 6021 Lyndale

Purchase Order Table Supplier Table

The Payroll 408 N.

• What data are Unique

• What data can be

FGI_Product Sales_Subset Customer

Product_Code [PK] Sales_Order_ID [PK]

Product_Description Product_Code [FK]

• What would be the primary and foreign keys?

• Q: Given a student name, how would we build a student’s

Excel Data Types

Relational databases have several important components:

Extract, Transform, and Load (ETL) describes the procedure to obtain,

You might also like