0% found this document useful (0 votes)

5 views44 pages

Slide#3 - Understanding Data

The document discusses the importance of understanding data collection methods, including sensors, surveys, and record keeping, to transform raw data into actionable information. It outlines various data types such as structured, unstructured, and semi-structured data, along with their specific characteristics and examples. Additionally, it emphasizes the need to consider the source and quality of data for effective analysis.

Uploaded by

aamodu2555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views44 pages

Slide#3 - Understanding Data

Uploaded by

aamodu2555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Understanding Data

Usman Baba
To use a computer to analyze data, you
need to both access a data set and interpret
that data set so that you can ask meaningful
questions about it.

UNDERSTANDING
This will enable you to transform raw data
DATA into actionable information.

Before beginning to work with data, it’s

important to understand where data comes
from.
There are a variety of processes for capturing events as data, each of which has its
own limitations and assumptions. Some of the modes of data collection include:
 Sensors
 Surveys
 Record Keeping
 Secondary data analysis
Devices or instruments that detect,
measure, and collect data from the physical
environment.

These devices convert physical phenomena

into digital signals or structured data that
can be used for analysis.
SENSORS
The volume of data being collected by
sensors has increased dramatically in the
last decade.

Assuming these devices have been

properly calibrated, they offer a reliable
and consistent mechanism for data
collection.
Environmental Sensors: Measure
physical conditions like Image Sensors: Capture visual
temperature, humidity, and air data.E.g Cameras for object
quality. E.g. Weather monitoring detection or facial recognition.
systems.

Audio Sensors: Capture sound Motion and Position Sensors:

waves and convert them into Detect movement or position
digital signals.E.g. Microphones changes.E.g. Accelerometers,
for speech recognition or noise gyroscopes (used in fitness
detection. trackers or autonomous vehicles).
CHEMICAL SENSORS: DETECT BIOMETRIC SENSORS: PROXIMITY SENSORS:
THE PRESENCE OR CAPTURE BIOLOGICAL DATA MEASURE THE DISTANCE TO AN
CONCENTRATION OF SPECIFIC SUCH AS HEART RATE OR OBJECT OR DETECT THE
CHEMICALS. E.G. GAS SENSORS FINGERPRINTS.E.G. PRESENCE OF OBJECTS.E.G.
FOR DETECTING POLLUTANTS SMARTWATCHES AND SECURITY LIDAR USED IN AUTONOMOUS
OR INDUSTRIAL EMISSIONS. SYSTEMS. VEHICLES.
Unlike sensors, which collect data
passively or automatically, surveys rely
on actively querying individuals or
groups for their responses.

The data collected from surveys is often

used in machine learning models for
SURVEYS tasks that involve human opinions,
preferences, behaviors, or
demographics.

The biases inherent in survey responses

should be recognized and, when
possible, adjusted for in your analysis.
Customer Feedback Behavioral Analysis:
Sentiment Analysis: E.g
Analysis: E.g Predicting E.g Understanding user
Analyzing free-text
customer satisfaction or preferences for
survey responses to
churn based on survey personalized
gauge public sentiment.
responses. recommendations.

Health and Social

Market Research: E.g
Research: E.g Studying
Using survey data to
health patterns, societal
forecast product demand
behaviors, or mental
or identify market
well-being using survey
trends.
data.
For example, a hospital may
In many domains, organizations track the length and result of
use both automatic and manual every surgery it performs (and a
processes to keep track of their governing body may require
activities. that hospital to report those
results).

The reliability of such data will Scientific experiments also

depend on the quality of the depend on diligent record
systems used to produce it. keeping of results.
Data can be compiled from existing knowledge artifacts or measurements,
such as counting word occurrences in a historical text (computers can
help with this!).

All of these methods of collecting data can lead to potential concerns and
biases...

When working with any data set, it is vital to consider where the data
came from (e.g., who recorded it, how, and why) to effectively and
meaningfully analyze it.
Computers’ abilities to record and persist data have led to an explosion of available
data values that can be analyzed, this includes:

 personal biological measures (how many steps have I taken?)

 social network structures (who are my friends?)
 private information leaked from insecure websites and government agencies (what
are their Social Security numbers?).
 In professional environments, you will likely be working with proprietary data
collected or managed by your organization. This might be anything from purchase
orders of fair-trade coffee to the results of medical research
 Luckily, there are also plenty of free, non-proprietary data sets that you can work
with.
 Organizations will often make large amounts of data available to the public to
support experimental duplication, promote transparency, or just see what other
people can do with that data.
 These data sets are great for building your skills and portfolio and are made
available in a variety of formats.
 For example, data may be accessed as downloadable CSV spreadsheets or through
a web service API…
 Government organizations produce a lot of data as part of their everyday activities,
and often make these data sets available in an effort to appear transparent and
accountable to the public.
 You can currently find publicly available data from many countries that covers a
broad range of topics, though it can be influenced by the political situation
surrounding its gathering.
 US government’s open data: https://www.data.gov/
 Government of Canada open data: https://open.canada.ca/en/opendata
 Open Government Data Platform India: https://data.gov.in/
 City of Seattle open data portal: https://data.seattle.gov/
 And so on
 Journalism remains one of the most important contexts in which data is gathered
and analyzed.
 Journalists do much of the legwork in producing data—searching existing artifacts,
questioning and surveying people, or otherwise revealing and connecting
previously hidden or ignored information.
 News media usually publish the analyzed, summative information for consumption,
but they also may make the source data available for others to confirm and expand
on their work.
 For example, the New York Times makes much of its historical data available
through a web service, while the data politics blog FiveThirtyEight makes all of the
data behind its articles available on GitHub.
 Scientific studies are (in theory) well grounded and structured, providing
meaningful data when considered within their proper scope.
 Since science needs to be disseminated and validated by others to be usable,
research is often made publicly available for others to study and critique.
 Some scientific journals, such as the premier journal Nature, require authors to
make their data available for others to access and investigate (check out its list of
scientific data repositories!).
 Nature:RecommendedDataRepositories:
 https://www.nature.com/sdata/policies/repositories
 To better integrate these services into people’s everyday lives, social media
companies make much of their data programmatically available for other
developers to access and use.
 For example, it is possible to access live data from Twitter, which has been used for
a variety of interesting analyses.
 Google also provides programmatic access to most of its many services (including
search and YouTube).
 Twitter developer platform: https://developer.twitter.com/en/docs
 Google APIs Explorer: https://developers.google.com/apisexplorer/
 Online community and its online spaces are another great source for interesting
and varied data sets and analysis.
 For example, Kaggle hosts a number of data sets as well as “challenges” to analyze
them.
 Somewhat similarly, the UCI Machine Learning Repository maintains a collection of
data sets used in machine learning, drawn primarily from academic sources. And
there are many other online lists of data sources as well
 Kaggle: “the home of data science and machine learning”:
https://www.kaggle.com/
 Socrata: data as a service platform: https://opendata.socrata.com/
 UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
 /r/DataSets: https://www.reddit.com/r/datasets/
 Once you acquire a data set, you will have to understand its structure and content
before (programmatically) investigating it.
 Understanding the types of data you will encounter depends on your ability to
discern the level of measurement for a given piece of data, as well as the different
structures that are used to hold that data.
 Data can be made up of a variety of types of values (represented by the concept of
“data type” ).
 More generally, data values can also be discussed in terms of their level of
measurement—a way of classifying data values in terms of how they can be
measured and compared to other values.
 Data (datum—singular form of data)
 Structured data
 Targeted for computers to process
 Numeric, Categorical, Time-series
 Unstructured
 Targeted for humans to process/digest

 Semi-structured data
 XML, HTML, Log files, etc.
Structured data is
highly organized Categorical Data
and resides in a
fixed schema, such
as databases or Numerical Data
spreadsheets. It
can be further
divided based on Time-Series Data
its nature:
These represent the labels of multiple classes used to
divide a variable into specific groups.
• Examples of categorical variables include race, sex, age group, and
educational level.
• Although the latter two variables can also be considered in a
numerical manner by using exact values for age and highest grade
completed, for example, it is often more informative to categorize
such variables into a relatively small number of ordered classes.
• It can be further categorized into Nominal and Ordinal
These contain measurements of simple codes assigned to
objects as labels, which are not measurements.

Categories with no inherent order. E.g Gender (male,

female), colors (red, green, blue)

These contain measurements of simple codes assigned

to objects as labels, which are not measurements.

For example, the variable marital status can be generally

categorized as (1) single, (2) married, and (3) divorced.
Nominal data can be represented with binomial values
having two possible values (e.g., yes/no, true/false,
good/bad) or multinomial values having three or more
possible values (e.g., brown/green/blue,
white/black/Latino/Asian, single/married/ divorced).

Nominal data can be represented by strings (such as the

name of the fruit), but also by numbers (e.g., “fruit type #1”,
“fruit type #2”).
 Categories with a meaningful order, but intervals between values are not uniform.
 These contain codes assigned to objects or events as labels that also represent the
rank order among them.
 For example, the variable credit score can be generally categorized as (1) low, (2)
medium, or (3) high.
 Similar ordered relationships can be seen in variables such as age group (i.e.,
child, young, middle-aged, elderly) and educational level (i.e., high school,
bachelor’s, master’s), customer satisfaction (low, medium, high).
 Ordinal data establishes an order for nominal categories.
 Data that represents measurable quantities.
 These represent the numeric values of specific variables. Examples of numerically
valued variables include age, number of children, total household income (in
Naira), travel distance (in miles), and temperature (in Fahrenheit degrees).
 Numeric values representing a variable can be integers (only whole numbers) or
real (also fractional numbers).
 The numeric data can also be called continuous data, implying that the variable
contains continuous measures on a specific scale that allows insertion of interim
values.
 Unlike a discrete variable, which represents finite, countable data (E.g. Number of
students in a class, shoe size), a continuous variable represents scalable
measurements, and it is possible for the data to contain an infinite number of
fractional values (E.g. Temperature (e.g., 22.5°C), height (e.g., 175.3 cm)).
Aspect Continuous Data Discrete Data

Data that can take any value within a range, Data that can only take specific, separate values, often
Definition
including fractions or decimals. whole numbers.
Nature Measurable quantities. Countable quantities.
Infinite or highly granular values within a
Range Finite, distinct values.
given range.
Temperature (e.g., 23.5°C), height (e.g., Number of students in a class (e.g., 25), shoe sizes
Examples
172.4 cm), weight (e.g., 68.25 kg). (e.g., 7, 8, 9).
Values can exist between any two points
Values cannot exist between two discrete points (e.g.,
Intervals (e.g., between 23.1 and 23.2°C, you could
you can't have 2.5 students).
have 23.15°C).
Graph Represented using smooth, continuous Represented using separate bars or points (e.g., a bar
Representation curves (e.g., a line graph). graph).
Measurements
Measurements (e.g., weight, time). Counts (e.g., number of cars, number of people).
or Counts?
Can the data take fractional or decimal values? If
yes, it’s continuous. If no, it’s discrete.

Is the data measured or counted? If measured, it’s

continuous. If counted, it’s discrete.

Can it have infinite possible values within a

range? If yes, it’s continuous. If no, it’s discrete.
Continuous Data: Used in regression problems (e.g.,
predicting house prices or temperature).

Discrete Data: Used in classification problems (e.g.,

identifying categories like "yes/no" or "low/medium/high").
 Classification Tasks: Used in predicting ordered categories. categorical but
Predicting customer satisfaction levels (e.g., "low," "medium," "high").
 Risk Assessment: Example: Grading loan applicants by creditworthiness ("poor,"
"fair," "good").
 Ranking Systems: Example: Ranking universities (e.g., "tier 1," "tier 2").
 Healthcare: Example: Severity of a disease (e.g., "mild," "moderate," "severe").
 Treat ordinal data as categorical, but respect the order:
 Encode as integers: (e.g., "low" = 1, "medium" = 2, "high" = 3).
 Use techniques like ordinal encoding or target encoding.
 Categorization Problems: Example: Classifying products into types (e.g.,
"electronics," "clothing," "furniture").
 Customer Segmentation: Example: Grouping customers by region ("North,"
"South").
 Medical Diagnosis: Example: Classifying blood groups (e.g., "A," "B," "O").
 Natural Language Processing: Example: Classifying text into topics (e.g., "sports,"
"politics," "entertainment").
 Use one-hot encoding or label encoding to convert nominal data to numerical
form:
 One-hot encoding: Creates binary columns for each category.
 Label encoding: Assigns arbitrary integers to categories (not ideal if model assumes
order).
Data collected
over time,
typically at
regular intervals.
TIME-SERIES DATA

Examples: Stock
prices, daily
temperatures
Unstructured data does not follow a
predefined format or schema. It can
be further categorized by its format
and processing requirements:

Text Data: Includes natural

UNSTRUCTURED language text that requires
DATA (1/2) processing. E.g. Social media posts,
email content, reviews.

Image Data: Includes static visual

data. E.g. Photographs, medical X-
rays, satellite images.
Audio Data: Sound recordings that
often require speech-to-text or
waveform analysis. E.g. Podcasts,
voice commands, music tracks.

UNSTRUCTURED Video Data: Moving visual data.

E.g. YouTube videos, surveillance
DATA (2/2) footage.

Sensor Data: Raw data from IoT

devices, typically unstructured until
processed. E.g. Accelerometer
readings, gyroscope data, e.t.c.
Partially organized with tags or markers but lacks a rigid schema. It can be
categorized as follows
 Markup-Based Data: Data stored in formats with embedded tags or metadata.
E.g.
 XML: <name>John Doe</name>
 HTML: <p>This is a paragraph.</p>

 JSON Data: Data stored in JavaScript Object Notation format, often used in APIs.
E.g. {"name": "John", "age": 30, "city": "New York"}
 Log Data: Machine-generated logs with identifiable patterns. E.g.:
 Server logs ([INFO] 2025-02-26 Server started).
 Application logs.

 Key-Value Data: Data stored as key-value pairs. E.g. Redis databases and
Configuration files (key: value).
 Graph Data: Represents entities as nodes and relationships as edges, typically
used in graph databases. E.g. Social network data (e.g., connections on LinkedIn)
and Transportation networks.
The first thing you will need to do upon encountering a data set (whether
one you found online or one that was provided by your organization) is
to understand the meaning of the data.

This requires understanding the domain you are working in, as well as
the specific data schema you are working with.

In practice, most data sets are structured as tables of information, with

individual data values arranged into rows and columns
These tables are similar to how data may be recorded in a spreadsheet
(using a program such as Microsoft Excel).

In a table, each row represents a record or observation: an instance of a

single thing being measured (e.g., a person, a sports match).

Each column represents a feature: a particular property or aspect of the

thing being measured (e.g., the person’s height or weight, the scores in a
sports game). Each data value can be referred to as a cell in the table.
Name Height Weight
1 Ada 64 135
2 Bob 74 156 Fig. 1 A table of people’s weight and
height. Rows represent Observations,
3 Chris 69 139 while Colums represent Features.

4 Dike 69 144
5 Emma 71 152
 The first step toward being able to understand a data set is to research and
understand the data’s problem domain.
 The problem domain is the set of topics that are relevant to the problem—that is,
the context for that data.
 Working with data requires domain knowledge: you need to have a basic level of
understanding of that problem domain to do any sensible analysis of that data.
 This includes understanding the significance and purpose of any features (so
you’re not doing math on contextless numbers), the range of expected values for a
feature (to detect outliers and other errors), and some of the details that may not be
explicit in the data set
 Once you have a general understanding of the context for a data set, you can begin
interpreting the data set itself. You will need to focus on understanding the data
schema (e.g., what is represented by the rows and columns), as well as the specific
context for those values.
 “What metadata is available for the data set?”
 “Who created the data set? Where does it come from?”
 “What features does the data set have?” Go through each columns and check:
1. What “real-world” aspect does each column attempt to capture?
2. For continuous data: what units are the values in?
3. For categorical data: what different categories are represented, and what do those
mean?
4. What is the possible range of values?
 “What features does the data set have?” Go through each columns and check:
1. What “real-world” aspect does each column attempt to capture?
2. For continuous data: what units are the values in?
3. For categorical data: what different categories are represented, and what do those
mean?
4. What is the possible range of values?
Ordinal and nominal data are types of categorical data and fall under
structured data.

Nominal: Purely categorical, unordered. Often encoded as integers

for machine learning (e.g., "Male" = 1, "Female" = 2).

Ordinal: Categorical but ordered. Can be encoded as integers (e.g.,

"Low" = 1, "Medium" = 2, "High" = 3) while preserving order.

Numerical (Discrete and Continuous): Quantitative values that can

be measured or counted.
Unstructured and semi-structured data often require preprocessing
to convert their content into structured formats.

For example: Converting text into categorical labels (e.g., sentiment

analysis: "positive" or "negative").

Extracting numerical features from images (e.g., average pixel

intensity).

By categorizing data this way, you can better understand its nature,
preprocessing needs, and how to use it in ML tasks.

Module 5 Lecture Note
No ratings yet
Module 5 Lecture Note
8 pages
4.0 Introduction To Data
No ratings yet
4.0 Introduction To Data
16 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
Data Science
No ratings yet
Data Science
244 pages
3 Data
No ratings yet
3 Data
23 pages
Data Science
No ratings yet
Data Science
12 pages
Introduction to Data Science Concepts
100% (1)
Introduction to Data Science Concepts
167 pages
Topic 1 T
No ratings yet
Topic 1 T
20 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Principles of Data Science WEB 3
No ratings yet
Principles of Data Science WEB 3
30 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
472 Eb
No ratings yet
472 Eb
6 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
DATA ANALYSIS - Full - Note - Immersive 2
No ratings yet
DATA ANALYSIS - Full - Note - Immersive 2
13 pages
Data Science
No ratings yet
Data Science
6 pages
FDS 4 Unit
No ratings yet
FDS 4 Unit
156 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
Understanding Data: Class 11 Guide
No ratings yet
Understanding Data: Class 11 Guide
11 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Unit - IV Part-1
No ratings yet
Unit - IV Part-1
40 pages
CHAPTER 1 - Introduction To Data Science
No ratings yet
CHAPTER 1 - Introduction To Data Science
67 pages
Data Science and Ai Education For Young Minds
No ratings yet
Data Science and Ai Education For Young Minds
75 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
5 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Unit 1
No ratings yet
Unit 1
26 pages
Data Visulaziation
No ratings yet
Data Visulaziation
42 pages
Week 1 Data Discovery
No ratings yet
Week 1 Data Discovery
5 pages
Cs3352 Fds Notes Mk1
No ratings yet
Cs3352 Fds Notes Mk1
30 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
43 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Data Evolution Unit 1 Material
No ratings yet
Data Evolution Unit 1 Material
28 pages
Unit 2
No ratings yet
Unit 2
105 pages
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
No ratings yet
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
5 pages
Foundation of Data Science Imp Notes
No ratings yet
Foundation of Data Science Imp Notes
6 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Ds Notes-Unit 1, II and III Upto Part1
No ratings yet
Ds Notes-Unit 1, II and III Upto Part1
341 pages
R Programming Lab Manual
No ratings yet
R Programming Lab Manual
54 pages
DS Notes
No ratings yet
DS Notes
49 pages
Unit I
No ratings yet
Unit I
262 pages
Intro to Big Data & Analytics
No ratings yet
Intro to Big Data & Analytics
34 pages
Data v2
No ratings yet
Data v2
25 pages
Data Science Basics for Beginners
No ratings yet
Data Science Basics for Beginners
25 pages
Unit 1
No ratings yet
Unit 1
9 pages
Combine PDF
No ratings yet
Combine PDF
270 pages
Lecture 01
No ratings yet
Lecture 01
40 pages
Fha Unit 1 Introduction
No ratings yet
Fha Unit 1 Introduction
8 pages
Module1 Cse2500 Da
No ratings yet
Module1 Cse2500 Da
54 pages
ERP-EWM Integration (Y31) : Sap Ewm 7.01 November 2011 English
No ratings yet
ERP-EWM Integration (Y31) : Sap Ewm 7.01 November 2011 English
6 pages
VIPRION Systems Configuration
No ratings yet
VIPRION Systems Configuration
28 pages
Apple Deployment and Management Test Study Guide
No ratings yet
Apple Deployment and Management Test Study Guide
51 pages
11 Cone of Experience With Examples in Using Techn...
No ratings yet
11 Cone of Experience With Examples in Using Techn...
2 pages
Amadeus Dynamic Travel Documents: " (Frontpage Subtitle) "
No ratings yet
Amadeus Dynamic Travel Documents: " (Frontpage Subtitle) "
3 pages
Deepfake and Beyond - A Survey of Face Manipulation and Fake Detection
No ratings yet
Deepfake and Beyond - A Survey of Face Manipulation and Fake Detection
23 pages
Information Technology
No ratings yet
Information Technology
31 pages
PLC Programming Cables Guide
No ratings yet
PLC Programming Cables Guide
225 pages
Chapter 1-Introduction To Computer Programming
100% (1)
Chapter 1-Introduction To Computer Programming
26 pages
Service Manual
No ratings yet
Service Manual
2,039 pages
ProProctor Exam Day 101 Guide
No ratings yet
ProProctor Exam Day 101 Guide
1 page
CATIA Tailgate Analysis Guide
No ratings yet
CATIA Tailgate Analysis Guide
25 pages
Lab Programs 12,13,14,15,16
No ratings yet
Lab Programs 12,13,14,15,16
3 pages
Gym Management System
No ratings yet
Gym Management System
15 pages
Jordan Maxwell Collection
33% (3)
Jordan Maxwell Collection
4 pages
BlueRoom - Onboarding - PractitionerV1.6
No ratings yet
BlueRoom - Onboarding - PractitionerV1.6
31 pages
Unit - 3 Cascading Style Sheets (CSS) : Web Technology
No ratings yet
Unit - 3 Cascading Style Sheets (CSS) : Web Technology
43 pages
Brochure - Prime Marine Maintenance System
No ratings yet
Brochure - Prime Marine Maintenance System
13 pages
Bca 4 Years Syllabus As Per Nep - Updated On 2.1.24
No ratings yet
Bca 4 Years Syllabus As Per Nep - Updated On 2.1.24
68 pages
Cloud Compliance Controls Guide
No ratings yet
Cloud Compliance Controls Guide
1 page
Java Developer - FPT Software - C99
No ratings yet
Java Developer - FPT Software - C99
4 pages
Collection of Various BIOS Utilities
No ratings yet
Collection of Various BIOS Utilities
7 pages
WP HandBook
No ratings yet
WP HandBook
409 pages
IoT and Private Cloud Insights
100% (1)
IoT and Private Cloud Insights
34 pages
Computer Progra-WPS Office
No ratings yet
Computer Progra-WPS Office
4 pages
Design and Implementation of Simple As Possible Computer (SAP-1)
No ratings yet
Design and Implementation of Simple As Possible Computer (SAP-1)
52 pages
(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs
No ratings yet
(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs
12 pages
Awesome Table in PDF
No ratings yet
Awesome Table in PDF
28 pages
Dps Entrust Nshield 5s Ds
No ratings yet
Dps Entrust Nshield 5s Ds
4 pages
Labwithanser
No ratings yet
Labwithanser
13 pages

Slide#3 - Understanding Data

Uploaded by

Slide#3 - Understanding Data

Uploaded by

Understanding Data

Before beginning to work with data, it’s

These devices convert physical phenomena

Assuming these devices have been

Audio Sensors: Capture sound Motion and Position Sensors:

The data collected from surveys is often

The biases inherent in survey responses

Health and Social

The reliability of such data will Scientific experiments also

 personal biological measures (how many steps have I taken?)

Categories with no inherent order. E.g Gender (male,

These contain measurements of simple codes assigned

For example, the variable marital status can be generally

Nominal data can be represented by strings (such as the

Is the data measured or counted? If measured, it’s

Can it have infinite possible values within a

Discrete Data: Used in classification problems (e.g.,

Text Data: Includes natural

Image Data: Includes static visual

UNSTRUCTURED Video Data: Moving visual data.

Sensor Data: Raw data from IoT

In practice, most data sets are structured as tables of information, with

In a table, each row represents a record or observation: an instance of a

Each column represents a feature: a particular property or aspect of the

Nominal: Purely categorical, unordered. Often encoded as integers

Ordinal: Categorical but ordered. Can be encoded as integers (e.g.,

Numerical (Discrete and Continuous): Quantitative values that can

For example: Converting text into categorical labels (e.g., sentiment

Extracting numerical features from images (e.g., average pixel

You might also like