8/16/2021
Data and Types of Data
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Modeling Inference
Data Collection (Machine
Learning)
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
2
1
8/16/2021
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Modeling Inference
Data Collection (Machine
Learning)
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
3
Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed
2
8/16/2021
Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed
– Data entered in a database as system entry
• E.g. Student information entered on academic automation
system etc.
– Data in the form of signals (comes from sensors)
• Speech/Audio, Images and videos, Temperature readings,
Humidity, Seismic data, EEG (all bio-type signals) etc.
• According to the objective of the task, the way the
data is collected will change
Types of Data: Based on Organization
1. Unstructured data:
– Rawest form of data
– Example: Any type of files like texts, images, sounds or
videos etc.
– This type of data stored in a repository of files
• Well organised directories on the computer hard drive
3
8/16/2021
Types of Data: Based on Organization
2. Structured data:
– It is a tabular data (rows and columns), which are very
well defined
– Stored in databases
• Spreadsheets [Comma Separated Value (CSV) format]
• Oracle
• DB2
• MySQL etc.
7
Types of Data: Based on Organization
3. Semi-Structured data:
– Anywhere between unstructured and structured data
– A consistent format is defined, however there is no strict
structure and parts of data may be incomplete or
different type
– Example: Data in the form of XML and JSON
• Stored in document oriented databases
4
8/16/2021
Types of Data: Based on Organization
3. Semi-Structured data:
– Anywhere between unstructured and structured data
– A consistent format is defined, however there is no strict
structure and parts of data may be incomplete or
different type
– Example: Data in the form of XML and JSON
• Stored in document oriented databases
Type of Data: Based on Variables
(Value) found in Data
• Mainly in Structured Data:
1. Numerical data:
– Data represented as numbers
– Data in which information is measurable
– This type of data is called quantitative data as its
describes a quantity
– Two types based on the values taken:
• Continuous valued data:
– Numbers does not have logical end
– Range lies in the natural limit of what we are measuring
– Example: Cost of the books, atmospheric temperature etc.
• Discrete valued data:
– Numbers have logical end
– There is a specific limit on the range of the values
– Example: number of members of family, number of days in a
month, number of colours in flag etc.
10
5
8/16/2021
Type of Data: Based on Variables
(Value) found in Data
2. Categorical data:
– Data that is not a number. It can be string of text or
date
– It describe an item or event to one of few different
categories
– Example: Ethnicity, gender, eye colour, etc.
– This type of data is called qualitative data as its
describes a quality
– Three types values they hold:
• Ordinal values: Values that have a set order to them
– Example: Severity of a alarm as “Critical”, “Medium” and
“”Low”, Ranking of a running race as “ First”, Second”, Third”
• Nominal values: Values that have no set order to them
– Example: Values for the variables “Marital Status”, “Country”,
“Eye Colour” etc.
• Binary values: Special type of categorical data
– Have only two values – “Yes” and “No” OR “True” and “False”
OR “1” and “0” 11
Type of Data: Based on Variables
(Value) found in Data
3. Time series data:
– Series of data. It involve time and some kind of value
– Example: Temperature at every hour
– It is clearly structured and numeric in nature
– Special case of numerical data
– This type of data is important because of IoT and
sensors
– Data from sensors are almost always time-series in
nature
12
6
8/16/2021
Data, Types of Data and Data
Collection using Sensors
Need for Data Preprocessing
Summery of Previous Class:
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Modeling Inference
Data Collection (Machine
Learning)
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
14
7
8/16/2021
Summery of Previous Class:
Types of Data: Based on Organization
1. Unstructured data:
2. Structured data:
– It is a tabular data (rows and columns), which are very
well defined
– Each row is finite ordered list (sequence) of elements,
where each element in a column is belonging to an
attribute of specific type
– Example: Spreadsheets [Comma Separated Value
(CSV) format]
3. Semi-structured data:
15
Summery of Previous Class:
Type of Data: Based on Variables (Value)
found in Data
• Mainly in Structured Data:
1. Numerical data:
– Two types based on the values taken:
• Continuous valued data:
• Discrete valued data:
2. Categorical data:
– Three types values they hold:
• Ordinal values:
• Nominal values:
• Binary values:
3. Time series data:
16
8
8/16/2021
Summery of Previous Class:
Data Collection
• Data manifests itself in many different forms
• Different forms of data require different ways to
collect them and different storage solutions
• Collection of data may consists of sending out
surveys, polls or doing other experiments
• Data based on the way it is collected:
– Data that comes from surveys
• Usually textual form of data or mixed
– Data entered in a database as system entry
• E.g. Student information entered on academic automation
system etc.
– Data in the form of signals (comes from sensors)
• Speech/Audio, Images and videos, Temperature readings,
Humidity, Seismic data, EEG (all bio-type signals) etc.
• According to the objective of the task, the way the
data is collected will change
Data Collection from Sensors
• Sensors are the devices that respond to the
environment around it and convert the physical
parameters into a signal (e.g., optical, electrical,
mechanical ) suitable for processing
Surrounding signal
Sensor electrical,
environment
optical or
mechanical
• Example: a temperature sensor outputs an electrical
signal whose voltage or current can be used to
identify the temperature around it
• Sensors can be an electrical/mechanical component, a
module or a subsystem
18
9
8/16/2021
Different Types of Sensors
• Acoustic, sound sensors (e.g., microphone)
• Visual sensors (e.g. cameras)
• Environmental sensors (e.g., temperature, humidity,
pressure etc.)
• Chemical sensors (e.g., Diesel Nitrogen Oxide (Nox)
sensors to measure engine-out NOx gas concentration)
• Flow sensors (e.g., water flow sensors)
• Motion sensors (e.g., gyroscope)
• Proximity or presence sensor (e.g., Passive Infrared
(PIR) )
• Biosensors (e.g., glucose monitor)
• And many more …
19
IIT Mandi Weather Station: Environmental Data
(Temperature, Humidity, Pressure etc) Collection
High-Level Overview
Zigbee/802.15.4
network
Environmental Sensor
SQL Database
+ PHP (web
access)
IITMandi
intranet
This is running
inside a
Raspberry Pi
WiFi/802.11b
Source: Dr. Siddhartha Sarma 20
10
8/16/2021
High-Level Overview: Environmental Data
(Temperature, Humidity, Pressure etc) Collection
Zigbee/802.15.4
network
Environmental Sensor
SQL Database
+ PHP (web
access)
IITMandi
intranet
This is running
inside a
Raspberry Pi
WiFi/802.11b
Source: Dr. Siddhartha Sarma 21
Land Slide Monitoring System (LMS)
• LMSs that rely on Internet of Things (IoT) and low-cost Micro-
Electro-Mechanical Systems (MEMS) sensors
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 22
11
8/16/2021
Components of LMS
• The LMS monitors a number of weather and soil
parameters via sensors on deployment location
GY 61 Pin Diagram of YL 69 Soil SIM 900A GSM E
Accelerometer GY-61 Moisture Sensor Module Force Sensor
Sensor
F G H I
Humidity Sensor Light Sensor Temperature and Tipping Rain Gauge
DHT 22 BH-1750 Pressure Sensor
BMP-180
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 23
Architecture and Features of LMS
• The LMS monitors a number of weather and soil
parameters via sensors on deployment location
Temperature & Barometric Rainfall Light Intensity
Humidity Pressure Intensity (0 - 65535 Lux)
(-40 C to +80 C & (300-1100 (in mm)
0-100 %) mb)
Soil force Soil moisture
Soil movement
(0-100N) (0-100 %)
(±2000°/sec rotational & ±16g
gravitational acceleration)
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 24
12
8/16/2021
Architecture and Features of LMS
• The LMS monitors a number of weather and soil
parameters via sensors on deployment location
Architecture diagram of LMS
The LMS will alert people via traffic lights, SMSs, or smart-apps on mobile
phones about the danger of impending landslides
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 25
Architecture and Features of LMS
• The LMS monitors a number of weather and soil
parameters via sensors on deployment location
Architecture diagram of LMS
The LMS will alert people via traffic lights, SMSs, or smart-apps on mobile
phones about the danger of impending landslides
Source: Dr. Varun Dutt (SCEE) and Dr. Uday Kala (SE) 26
13
8/16/2021
Data Preprocessing
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Modeling Inference
Data Collection (Machine
Learning)
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
28
14
8/16/2021
Need for Data Preprocessing
• Real world data are tend to be incomplete, noisy and
inconsistent due to their huge size and their likely
origin from multiple heterogeneous sources
• Preprocessing is important to clean the data
• Low quality data will lead to low quality of analysis
results
• If the users believe the data is of low quality (dirty),
they are unlikely to trust the results of any data
analytics that has been applied to
• Low quality data can cause confusion for analytic
procedure using machine learning techniques,
resulting in unreliable output
• Incomplete, noisy and inconsistent data are common
properties of large real world databases
Tuple (Record) in Structured Data
• A tuple (record) is finite ordered list (sequence) of
elements, where each element is belonging to an
attribute
Tuple
(record)
• Each row is a tuple
15
8/16/2021
Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Example:
Incomplete Data
• Many tuple (records) have no recorded value for
several attributes
• Reasons for incomplete data:
– User forgot to fill in a field
– User chose not to fill out the field as it was not
considered important at the time of the entry
– Relevant data may not be recorded due to
malfunctioning of equipment
– Data might have lost while transferring from recorded
place
– Data may not be recorded due to programming error
– Data might not be recorded due to technology
limitations like limited memory
16
8/16/2021
Noisy Data
• Many tuple (records) have incorrect value for several
attributes
• Reasons for noisy data:
– There may be human or computer error occurring in
data entry
– The data collection instruments used may be faulty
– Error in data transmission
– There may be technology limitation such as limited
buffer size for coordinating synchronised data transfer
and consumption
Inconsistent Data
• Data containing discrepancies in stored values for
some attributes
• Reasons for inconsistent data:
– It may result from inconsistencies in
• name conventions or
– Example: “Dept_ID”, “Department_ID”
“Roll_No”, “Registation_No”
• data codes used (mismatch in writing values) or
– Example: For department – “SCEE”, “School of Computing and
EE”
• inconsistent formats of input fields such as date
– Example: “dd-mm-yy”, “dd-mm-yyyy”, “mm/dd/yyyy”
– Inconsistency in name convention or formats of input
fields while integrating
– Example: While Integrating temperature records from
different locations, if the name conventions are different
– Inconsistent data may be due to human or computer
error occurring in data entry
17