0% found this document useful (0 votes)

17 views13 pages

2-Data Wrangling

The document discusses data wrangling, which is the process of cleaning and organizing complex data sets for analysis, and outlines its steps including discovery, structuring, cleaning, enriching, validating, and publishing. It also highlights the importance of feature engineering in transforming raw data into usable features for machine learning models, emphasizing techniques such as imputation and one-hot encoding. Additionally, various tools for data wrangling and feature engineering are mentioned, including Python libraries like Pandas and Numpy.

Uploaded by

Atiya Falak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views13 pages

2-Data Wrangling

Uploaded by

Atiya Falak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

12/2/2021

Big Data Analytics

Data Wrangling and Feature
Engineering
Muhammad Affan Alim

What is data Wrangling

• Data wrangling is the process of cleaning and unifying messy and
complex data sets for easy access and analysis.

• With the amount of data and data sources rapidly growing and
expanding, it is getting increasingly essential for large amounts of
available data to be organized for analysis.

• This process typically includes manually converting and mapping data

from one raw form into another format to allow for more convenient
consumption and organization of the data.

1
12/2/2021

What is data Wrangling-second

• Data wrangling—also called data cleaning, data remediation, or data
munging—refers to a variety of processes designed to transform raw
data into more readily used formats.
• The exact methods differ from project to project depending on the
data you’re leveraging and the goal you’re trying to achieve.

What is data Wrangling-second

• Some examples of data wrangling include:
• Merging multiple data sources into a single dataset for analysis
• Identifying gaps in data (for example, empty cells in a
spreadsheet) and either filling or deleting them
• Deleting data that’s either unnecessary or irrelevant to the
project you’re working on
• Identifying extreme outliers in data and either explaining the
discrepancies or removing them so that analysis can take place
4

2
12/2/2021

Data Wrangling Steps

• Each data project requires a unique approach to ensure its final
dataset is reliable and accessible.
• That being said, several processes typically inform the approach. These
are commonly referred to as data wrangling steps or activities.

Data Wrangling Steps

1. Discovery
2. Structuring
3. Cleaning
4. Enriching
5. Validating
6. Publishing

3
12/2/2021

Data Wrangling Steps

1. Discovery
• Discovery refers to the process of familiarizing yourself with data so you can
conceptualize how you might use it. You can liken it to looking in your
refrigerator before cooking a meal to see what ingredients you have at your
disposal.

• During discovery, you may identify trends or patterns in the data, along
with obvious issues, such as missing or incomplete values that need to be
addressed. This is an important step, as it will inform every activity that
comes afterward.

Data Wrangling Steps

2. Structuring
• Raw data is typically unusable in its raw state because it’s either
incomplete or misformatted for its intended application.
• Data structuring is the process of taking raw data and transforming it
to be more readily leveraged. The form your data takes will depend on
the analytical model you use to interpret it.

4
12/2/2021

Data Wrangling Steps

2. Structuring-example
• Structural errors are when you measure or transfer data and notice
strange naming conventions, typos, or incorrect capitalization.
• These inconsistencies can cause mislabeled categories or classes. For
example, you may find “N/A” and “Not Applicable” both appear, but
they should be analyzed as the same category.

Data Wrangling Steps

2. Structuring-example
• For example, the “purchase date” column name variations across sources may include:
• purchaseDate
• transaction_date
• txDate
• prchsdt
• And the values themselves are likely not rationalized:
• 6-20-2018
• 06/20/2018
• 20-JUN-2018 08:03
• 20/06/18

5
12/2/2021

Data Wrangling Steps

3. Cleaning
• Data cleaning is the process of removing inherent errors in data that
might distort your analysis or render it less valuable.
• Cleaning can come in different forms, including deleting empty cells or
rows, removing outliers, and standardizing inputs.
• The goal of data cleaning is to ensure there are no errors (or as few as
possible) that could influence your final analysis.

Data Wrangling Steps

4. Enriching
• Once you understand your existing data and have transformed it into a
more usable state, you must determine whether you have all of the data
necessary for the project at hand.
• If not, you may choose to enrich or augment your data by incorporating
values from other datasets.
• For this reason, it’s important to understand what other data is available
for use.

• If you decide that enrichment is necessary, you need to repeat the steps
above for any new data.
12

6
12/2/2021

Data Wrangling Steps

5. Validating
• Data validation refers to the process of verifying that your data is both
consistent and of a high enough quality.
• During validation, you may discover issues you need to resolve or
conclude that your data is ready to be analyzed. Validation is typically
achieved through various automated processes and requires
programming.

Data Wrangling Steps

5. Validating-tools
Key Data Validation Testing Tools | Data Validation Software
• Various Data Validation Testing tools are available in the market for data validation. Some of
them given below -
• Datameer
• Talend
• Informatica
• QuerySurge
• ICEDQ
• Datagaps ETL Validator
• DbFit
• Data-Centric Testing

7
12/2/2021

Data Wrangling Steps

5. Validating-How to Adopt Data Validation Testing?
• There are various approaches and techniques to accomplish Data Validation testing.
1. Data Accuracy testing to ensure that the provided data is correct.
2. Data Completeness testing to check whether the data is complete or not.
3. To verify that the provided data go successfully through transformations or not is
by Data Transformation Testing.
4. Data Quality testing to handle bad data.
5. Database comparison testing to compare the source DB and target DB.
6. End to End testing.
7. Data warehouse testing.

Data Wrangling Steps

6. Publishing
• Once your data has been validated, you can publish it. This involves
making it available to others within your organization for analysis. The
format you use to share the information—such as a written report or
electronic file—will depend on your data and the organization’s goals.

8
12/2/2021

The Goals of Data Wrangling

• Reveal a "deeper intelligence" by gathering data from multiple
sources
• Provide accurate, actionable data in the hands of business analysts
in a timely matter
• Reduce the time spent collecting and organizing unruly data before
it can be utilized
• Enable data scientists and analysts to focus on the analysis of data,
rather than the wrangling
• Drive better decision-making skills by senior leaders in an
organization
17

Data Wrangling Tools

Basic Data Munging Tools
• Excel Power Query / Spreadsheets — the most basic structuring tool
for manual wrangling.
• OpenRefine — more sophisticated solutions, requires programming
skills
• Google DataPrep - for exploration, cleaning, and preparation.
• Tabula — swiss army knife solutions — suitable for all types of data
• DataWrangler — for data cleaning and transformation.
• CSVKit — for data converting

9
12/2/2021

Data Wrangling Tools

Data Wrangling in Python
1. Numpy (aka Numerical Python) — the most basic package. Lots of
features for operations on n-arrays and matrices in Python. The library
provides vectorization of mathematical operations on the NumPy array
type, which improves performance and accordingly speeds up the
execution.
2. Pandas — designed for fast and easy data analysis operations. Useful
for data structures with labeled axes. Explicit data alignment prevents
common errors that result from misaligned data coming in from
different sources.

Data Wrangling Tools

Data Wrangling in Python

3. Matplotlib — Python visualization module. Good for line graphs, pie

charts, histograms, and other professional grade figures.
4. Plotly — for interactive, publication-quality graphs. Excellent for line
plots, scatter plots, area charts, bar charts, error bars, box plots,
histograms, heatmaps, subplots, multiple-axis, polar graphs, and bubble
charts.
5. Theano — library for numerical computation similar to Numpy. This
library is designed to define, optimize, and evaluate mathematical
expressions involving multi-dimensional arrays efficiently.

10
12/2/2021

What is Feature Engineering

• Feature engineering is the process of transforming raw data into
features that better represent the underlying problem to the
predictive models, resulting in improved model accuracy on
unseen data.

What is Feature Engineering-Inroduction

• Basically, all machine learning algorithms use some input data
to create outputs. This input data comprise features, which are
usually in the form of structured columns. Algorithms require
features with some specific characteristic to work properly.
Here, the need for feature engineering arises

11
12/2/2021

What is Feature Engineering-Inroduction

• I think feature engineering efforts mainly have two goals:
1. Preparing the proper input dataset, compatible with the
machine learning algorithm requirements.
2. Improving the performance of machine learning models.

What is Feature Engineering-Inroduction

According to a survey in Forbes, data scientists spend 80% of their
time on data preparation:

12
12/2/2021

What is Feature Engineering-Inroduction

List of Techniques
1. Imputation
2. Handling Outliers
3. Binning
4. Log Transform
5. One-Hot Encoding
6. Grouping Operations
7. Feature Split
8. Scaling
9. Extracting Date
25

DWDV Notes
No ratings yet
DWDV Notes
111 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Unit IV
No ratings yet
Unit IV
27 pages
Unit-1 DM
No ratings yet
Unit-1 DM
10 pages
Module - 1 (Introduction To Data Wrangling)
No ratings yet
Module - 1 (Introduction To Data Wrangling)
29 pages
M-1 Chapter-1
No ratings yet
M-1 Chapter-1
2 pages
Math211101020
No ratings yet
Math211101020
12 pages
Data Wrangling for Analysts
No ratings yet
Data Wrangling for Analysts
17 pages
DWDV Unit 1
No ratings yet
DWDV Unit 1
21 pages
DATA WRANGLING AND DATA VISUALIZATION - Unit-01
No ratings yet
DATA WRANGLING AND DATA VISUALIZATION - Unit-01
19 pages
Data Wrangling
No ratings yet
Data Wrangling
9 pages
Unit-1, 1
No ratings yet
Unit-1, 1
5 pages
211101088math - Data Ass 2
No ratings yet
211101088math - Data Ass 2
12 pages
Data Wrangling
0% (1)
Data Wrangling
5 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
Step by Step Data Wrangling
No ratings yet
Step by Step Data Wrangling
4 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Data Analytics - Module-1.1
No ratings yet
Data Analytics - Module-1.1
42 pages
DATA WRANGLING New
No ratings yet
DATA WRANGLING New
13 pages
Data Wrangling Techniques in R
No ratings yet
Data Wrangling Techniques in R
29 pages
Unit 4
No ratings yet
Unit 4
60 pages
Scribd 3
No ratings yet
Scribd 3
2 pages
Unit II Notes
No ratings yet
Unit II Notes
39 pages
Lecture Week 6-Data Scraping and Data Wrangling
No ratings yet
Lecture Week 6-Data Scraping and Data Wrangling
16 pages
Data Wrangling: Clean, Transform, Merge
No ratings yet
Data Wrangling: Clean, Transform, Merge
60 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
110 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Data Wrangling for Analysts
No ratings yet
Data Wrangling for Analysts
6 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
Data Wrangling Steps
No ratings yet
Data Wrangling Steps
10 pages
Data Wrangling and Munging
No ratings yet
Data Wrangling and Munging
21 pages
Data Wrangling: T.Y. B.Sc. DS
No ratings yet
Data Wrangling: T.Y. B.Sc. DS
24 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
112 pages
Ijitcs V10 N1 4
No ratings yet
Ijitcs V10 N1 4
9 pages
Data Wrangling for Analysts
No ratings yet
Data Wrangling for Analysts
1 page
Data Wrangling Tools
No ratings yet
Data Wrangling Tools
3 pages
1708443470801
No ratings yet
1708443470801
71 pages
Unit-1, 2
No ratings yet
Unit-1, 2
5 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Data Wrangling: Process & Importance
0% (1)
Data Wrangling: Process & Importance
7 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Unit V
No ratings yet
Unit V
47 pages
Data Pre Processing
No ratings yet
Data Pre Processing
4 pages
Data Wrangling
No ratings yet
Data Wrangling
3 pages
Data Munging
No ratings yet
Data Munging
20 pages
Interview Questions For Data Analysis
No ratings yet
Interview Questions For Data Analysis
83 pages
Data Munging for Data Scientists
No ratings yet
Data Munging for Data Scientists
54 pages
DSILYTC Session 2 - Data Wrangling
No ratings yet
DSILYTC Session 2 - Data Wrangling
46 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Binning
No ratings yet
Data Binning
9 pages
Data Wrangling and Visualization
No ratings yet
Data Wrangling and Visualization
48 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Big Data
No ratings yet
Big Data
51 pages
Second Conditonal Qs
No ratings yet
Second Conditonal Qs
1 page
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
No ratings yet
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
56 pages
First Conditional
No ratings yet
First Conditional
1 page
Have To
No ratings yet
Have To
2 pages
DUHS Strategic Plan
No ratings yet
DUHS Strategic Plan
55 pages
Language 1
No ratings yet
Language 1
1 page
2008 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
No ratings yet
2008 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
44 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
2005 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
No ratings yet
2005 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
36 pages
Python-Final Exam
No ratings yet
Python-Final Exam
2 pages
SQL
No ratings yet
SQL
1 page
Jamia Tul Madina Faizan
No ratings yet
Jamia Tul Madina Faizan
6 pages
Proforma Invoice Lift (Highway Traders LHR)
No ratings yet
Proforma Invoice Lift (Highway Traders LHR)
9 pages
Data For Gratuity Valuation - June 30 2021 v1
No ratings yet
Data For Gratuity Valuation - June 30 2021 v1
27 pages
Carbohydrateanki CSV
No ratings yet
Carbohydrateanki CSV
2 pages
Writing Approaches
No ratings yet
Writing Approaches
3 pages
Chemistry Blanks
No ratings yet
Chemistry Blanks
15 pages
Akhuwat Internship Programme
No ratings yet
Akhuwat Internship Programme
2 pages
GRIP (BIOLOGY) 2021 PMC NMDCAT NUMS AGHA KHAN 12000+ MCQS Question Bank
No ratings yet
GRIP (BIOLOGY) 2021 PMC NMDCAT NUMS AGHA KHAN 12000+ MCQS Question Bank
103 pages
Dogar AMC Book Biology Portion (Taleem360)
No ratings yet
Dogar AMC Book Biology Portion (Taleem360)
49 pages
Meer Taqi Meer
No ratings yet
Meer Taqi Meer
4 pages
Guess Paper XI Zoology 2022
No ratings yet
Guess Paper XI Zoology 2022
3 pages
CH SHM, Waves & Sound
No ratings yet
CH SHM, Waves & Sound
2 pages
Result Chem GT (CH # 2, 5) MDCAT
No ratings yet
Result Chem GT (CH # 2, 5) MDCAT
1 page
Cell Cycle and Division Overview
No ratings yet
Cell Cycle and Division Overview
12 pages
Haste Makes Waste Hurry Makes Curry
No ratings yet
Haste Makes Waste Hurry Makes Curry
1 page
Chapter 9 Biotechnology
No ratings yet
Chapter 9 Biotechnology
21 pages
Cybersecurity Threats & Solutions
No ratings yet
Cybersecurity Threats & Solutions
13 pages
Latestlog
No ratings yet
Latestlog
136 pages
Sensors: Design and Implementation of A Pressure Monitoring System Based On Iot For Water Supply Networks
No ratings yet
Sensors: Design and Implementation of A Pressure Monitoring System Based On Iot For Water Supply Networks
19 pages
Workshop: Jesús Arturo Cruz Ortiz
100% (2)
Workshop: Jesús Arturo Cruz Ortiz
101 pages
MPLS Convergence: IGP & BGP Impacts
No ratings yet
MPLS Convergence: IGP & BGP Impacts
29 pages
9-6 Error Messages Reference
No ratings yet
9-6 Error Messages Reference
2,536 pages
Differential Bus Transceivers Guide
No ratings yet
Differential Bus Transceivers Guide
28 pages
Asi-3 Gateways With Integrated Safety Monitor
No ratings yet
Asi-3 Gateways With Integrated Safety Monitor
8 pages
Assignment ET 4061-5061
No ratings yet
Assignment ET 4061-5061
2 pages
Barracuda Impersonation Protection
No ratings yet
Barracuda Impersonation Protection
6 pages
ICS Lecture 6
No ratings yet
ICS Lecture 6
65 pages
Excel Exercise7
No ratings yet
Excel Exercise7
4 pages
Lecture 1-Introduction: Data Structure and Algorithm Analysis
No ratings yet
Lecture 1-Introduction: Data Structure and Algorithm Analysis
27 pages
IPv4 Subnetting Reference Chart
No ratings yet
IPv4 Subnetting Reference Chart
1 page
Wacom Ink SDK Fo Verification
No ratings yet
Wacom Ink SDK Fo Verification
7 pages
Pathogen Asset Control System (PACS)
No ratings yet
Pathogen Asset Control System (PACS)
2 pages
AXIAR TECHNICAL NOTES (49B) - Windows 2012-Windows 8 Installation
No ratings yet
AXIAR TECHNICAL NOTES (49B) - Windows 2012-Windows 8 Installation
14 pages
IEEE 802.15.6 BAN Sync Method
No ratings yet
IEEE 802.15.6 BAN Sync Method
5 pages
CVIP Chapter One
No ratings yet
CVIP Chapter One
91 pages
MATLAB M-Files (SumanSendenTamang)
No ratings yet
MATLAB M-Files (SumanSendenTamang)
12 pages
WP Sap Mes 07 02 2014 PDF
100% (1)
WP Sap Mes 07 02 2014 PDF
18 pages
Java IO: Input-Output in Java With Examples
No ratings yet
Java IO: Input-Output in Java With Examples
7 pages
Project Grey Goose Attacks On Critical Infrastructure
No ratings yet
Project Grey Goose Attacks On Critical Infrastructure
21 pages
Aixperf Part1
No ratings yet
Aixperf Part1
28 pages
Professional Resume Template
No ratings yet
Professional Resume Template
1 page
Block Chain
No ratings yet
Block Chain
18 pages
Wireshark Lab Guide for Beginners
No ratings yet
Wireshark Lab Guide for Beginners
11 pages
IT Career Profile for Employers
No ratings yet
IT Career Profile for Employers
4 pages
Data Services Code Migration
No ratings yet
Data Services Code Migration
8 pages
DCE Brochure
No ratings yet
DCE Brochure
5 pages

2-Data Wrangling

Uploaded by

2-Data Wrangling

Uploaded by

12/2/2021

Big Data Analytics

What is data Wrangling

• This process typically includes manually converting and mapping data

What is data Wrangling-second

What is data Wrangling-second

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

Data Wrangling Steps

The Goals of Data Wrangling

Data Wrangling Tools

Data Wrangling Tools

Data Wrangling Tools

3. Matplotlib — Python visualization module. Good for line graphs, pie

What is Feature Engineering

What is Feature Engineering-Inroduction

What is Feature Engineering-Inroduction

What is Feature Engineering-Inroduction

What is Feature Engineering-Inroduction

You might also like