Understanding Data
Usman Baba
To use a computer to analyze data, you
need to both access a data set and interpret
that data set so that you can ask meaningful
questions about it.
UNDERSTANDING
This will enable you to transform raw data
DATA into actionable information.
Before beginning to work with data, it’s
important to understand where data comes
from.
There are a variety of processes for capturing events as data, each of which has its
own limitations and assumptions. Some of the modes of data collection include:
Sensors
Surveys
Record Keeping
Secondary data analysis
Devices or instruments that detect,
measure, and collect data from the physical
environment.
These devices convert physical phenomena
into digital signals or structured data that
can be used for analysis.
SENSORS
The volume of data being collected by
sensors has increased dramatically in the
last decade.
Assuming these devices have been
properly calibrated, they offer a reliable
and consistent mechanism for data
collection.
Environmental Sensors: Measure
physical conditions like Image Sensors: Capture visual
temperature, humidity, and air data.E.g Cameras for object
quality. E.g. Weather monitoring detection or facial recognition.
systems.
Audio Sensors: Capture sound Motion and Position Sensors:
waves and convert them into Detect movement or position
digital signals.E.g. Microphones changes.E.g. Accelerometers,
for speech recognition or noise gyroscopes (used in fitness
detection. trackers or autonomous vehicles).
CHEMICAL SENSORS: DETECT BIOMETRIC SENSORS: PROXIMITY SENSORS:
THE PRESENCE OR CAPTURE BIOLOGICAL DATA MEASURE THE DISTANCE TO AN
CONCENTRATION OF SPECIFIC SUCH AS HEART RATE OR OBJECT OR DETECT THE
CHEMICALS. E.G. GAS SENSORS FINGERPRINTS.E.G. PRESENCE OF OBJECTS.E.G.
FOR DETECTING POLLUTANTS SMARTWATCHES AND SECURITY LIDAR USED IN AUTONOMOUS
OR INDUSTRIAL EMISSIONS. SYSTEMS. VEHICLES.
Unlike sensors, which collect data
passively or automatically, surveys rely
on actively querying individuals or
groups for their responses.
The data collected from surveys is often
used in machine learning models for
SURVEYS tasks that involve human opinions,
preferences, behaviors, or
demographics.
The biases inherent in survey responses
should be recognized and, when
possible, adjusted for in your analysis.
Customer Feedback Behavioral Analysis:
Sentiment Analysis: E.g
Analysis: E.g Predicting E.g Understanding user
Analyzing free-text
customer satisfaction or preferences for
survey responses to
churn based on survey personalized
gauge public sentiment.
responses. recommendations.
Health and Social
Market Research: E.g
Research: E.g Studying
Using survey data to
health patterns, societal
forecast product demand
behaviors, or mental
or identify market
well-being using survey
trends.
data.
For example, a hospital may
In many domains, organizations track the length and result of
use both automatic and manual every surgery it performs (and a
processes to keep track of their governing body may require
activities. that hospital to report those
results).
The reliability of such data will Scientific experiments also
depend on the quality of the depend on diligent record
systems used to produce it. keeping of results.
Data can be compiled from existing knowledge artifacts or measurements,
such as counting word occurrences in a historical text (computers can
help with this!).
All of these methods of collecting data can lead to potential concerns and
biases...
When working with any data set, it is vital to consider where the data
came from (e.g., who recorded it, how, and why) to effectively and
meaningfully analyze it.
Computers’ abilities to record and persist data have led to an explosion of available
data values that can be analyzed, this includes:
personal biological measures (how many steps have I taken?)
social network structures (who are my friends?)
private information leaked from insecure websites and government agencies (what
are their Social Security numbers?).
In professional environments, you will likely be working with proprietary data
collected or managed by your organization. This might be anything from purchase
orders of fair-trade coffee to the results of medical research
Luckily, there are also plenty of free, non-proprietary data sets that you can work
with.
Organizations will often make large amounts of data available to the public to
support experimental duplication, promote transparency, or just see what other
people can do with that data.
These data sets are great for building your skills and portfolio and are made
available in a variety of formats.
For example, data may be accessed as downloadable CSV spreadsheets or through
a web service API…
Government organizations produce a lot of data as part of their everyday activities,
and often make these data sets available in an effort to appear transparent and
accountable to the public.
You can currently find publicly available data from many countries that covers a
broad range of topics, though it can be influenced by the political situation
surrounding its gathering.
US government’s open data: https://www.data.gov/
Government of Canada open data: https://open.canada.ca/en/opendata
Open Government Data Platform India: https://data.gov.in/
City of Seattle open data portal: https://data.seattle.gov/
And so on
Journalism remains one of the most important contexts in which data is gathered
and analyzed.
Journalists do much of the legwork in producing data—searching existing artifacts,
questioning and surveying people, or otherwise revealing and connecting
previously hidden or ignored information.
News media usually publish the analyzed, summative information for consumption,
but they also may make the source data available for others to confirm and expand
on their work.
For example, the New York Times makes much of its historical data available
through a web service, while the data politics blog FiveThirtyEight makes all of the
data behind its articles available on GitHub.
Scientific studies are (in theory) well grounded and structured, providing
meaningful data when considered within their proper scope.
Since science needs to be disseminated and validated by others to be usable,
research is often made publicly available for others to study and critique.
Some scientific journals, such as the premier journal Nature, require authors to
make their data available for others to access and investigate (check out its list of
scientific data repositories!).
Nature:RecommendedDataRepositories:
https://www.nature.com/sdata/policies/repositories
To better integrate these services into people’s everyday lives, social media
companies make much of their data programmatically available for other
developers to access and use.
For example, it is possible to access live data from Twitter, which has been used for
a variety of interesting analyses.
Google also provides programmatic access to most of its many services (including
search and YouTube).
Twitter developer platform: https://developer.twitter.com/en/docs
Google APIs Explorer: https://developers.google.com/apisexplorer/
Online community and its online spaces are another great source for interesting
and varied data sets and analysis.
For example, Kaggle hosts a number of data sets as well as “challenges” to analyze
them.
Somewhat similarly, the UCI Machine Learning Repository maintains a collection of
data sets used in machine learning, drawn primarily from academic sources. And
there are many other online lists of data sources as well
Kaggle: “the home of data science and machine learning”:
https://www.kaggle.com/
Socrata: data as a service platform: https://opendata.socrata.com/
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
/r/DataSets: https://www.reddit.com/r/datasets/
Once you acquire a data set, you will have to understand its structure and content
before (programmatically) investigating it.
Understanding the types of data you will encounter depends on your ability to
discern the level of measurement for a given piece of data, as well as the different
structures that are used to hold that data.
Data can be made up of a variety of types of values (represented by the concept of
“data type” ).
More generally, data values can also be discussed in terms of their level of
measurement—a way of classifying data values in terms of how they can be
measured and compared to other values.
Data (datum—singular form of data)
Structured data
Targeted for computers to process
Numeric, Categorical, Time-series
Unstructured
Targeted for humans to process/digest
Semi-structured data
XML, HTML, Log files, etc.
Structured data is
highly organized Categorical Data
and resides in a
fixed schema, such
as databases or Numerical Data
spreadsheets. It
can be further
divided based on Time-Series Data
its nature:
These represent the labels of multiple classes used to
divide a variable into specific groups.
• Examples of categorical variables include race, sex, age group, and
educational level.
• Although the latter two variables can also be considered in a
numerical manner by using exact values for age and highest grade
completed, for example, it is often more informative to categorize
such variables into a relatively small number of ordered classes.
• It can be further categorized into Nominal and Ordinal
These contain measurements of simple codes assigned to
objects as labels, which are not measurements.
Categories with no inherent order. E.g Gender (male,
female), colors (red, green, blue)
These contain measurements of simple codes assigned
to objects as labels, which are not measurements.
For example, the variable marital status can be generally
categorized as (1) single, (2) married, and (3) divorced.
Nominal data can be represented with binomial values
having two possible values (e.g., yes/no, true/false,
good/bad) or multinomial values having three or more
possible values (e.g., brown/green/blue,
white/black/Latino/Asian, single/married/ divorced).
Nominal data can be represented by strings (such as the
name of the fruit), but also by numbers (e.g., “fruit type #1”,
“fruit type #2”).
Categories with a meaningful order, but intervals between values are not uniform.
These contain codes assigned to objects or events as labels that also represent the
rank order among them.
For example, the variable credit score can be generally categorized as (1) low, (2)
medium, or (3) high.
Similar ordered relationships can be seen in variables such as age group (i.e.,
child, young, middle-aged, elderly) and educational level (i.e., high school,
bachelor’s, master’s), customer satisfaction (low, medium, high).
Ordinal data establishes an order for nominal categories.
Data that represents measurable quantities.
These represent the numeric values of specific variables. Examples of numerically
valued variables include age, number of children, total household income (in
Naira), travel distance (in miles), and temperature (in Fahrenheit degrees).
Numeric values representing a variable can be integers (only whole numbers) or
real (also fractional numbers).
The numeric data can also be called continuous data, implying that the variable
contains continuous measures on a specific scale that allows insertion of interim
values.
Unlike a discrete variable, which represents finite, countable data (E.g. Number of
students in a class, shoe size), a continuous variable represents scalable
measurements, and it is possible for the data to contain an infinite number of
fractional values (E.g. Temperature (e.g., 22.5°C), height (e.g., 175.3 cm)).
Aspect Continuous Data Discrete Data
Data that can take any value within a range, Data that can only take specific, separate values, often
Definition
including fractions or decimals. whole numbers.
Nature Measurable quantities. Countable quantities.
Infinite or highly granular values within a
Range Finite, distinct values.
given range.
Temperature (e.g., 23.5°C), height (e.g., Number of students in a class (e.g., 25), shoe sizes
Examples
172.4 cm), weight (e.g., 68.25 kg). (e.g., 7, 8, 9).
Values can exist between any two points
Values cannot exist between two discrete points (e.g.,
Intervals (e.g., between 23.1 and 23.2°C, you could
you can't have 2.5 students).
have 23.15°C).
Graph Represented using smooth, continuous Represented using separate bars or points (e.g., a bar
Representation curves (e.g., a line graph). graph).
Measurements
Measurements (e.g., weight, time). Counts (e.g., number of cars, number of people).
or Counts?
Can the data take fractional or decimal values? If
yes, it’s continuous. If no, it’s discrete.
Is the data measured or counted? If measured, it’s
continuous. If counted, it’s discrete.
Can it have infinite possible values within a
range? If yes, it’s continuous. If no, it’s discrete.
Continuous Data: Used in regression problems (e.g.,
predicting house prices or temperature).
Discrete Data: Used in classification problems (e.g.,
identifying categories like "yes/no" or "low/medium/high").
Classification Tasks: Used in predicting ordered categories. categorical but
Predicting customer satisfaction levels (e.g., "low," "medium," "high").
Risk Assessment: Example: Grading loan applicants by creditworthiness ("poor,"
"fair," "good").
Ranking Systems: Example: Ranking universities (e.g., "tier 1," "tier 2").
Healthcare: Example: Severity of a disease (e.g., "mild," "moderate," "severe").
Treat ordinal data as categorical, but respect the order:
Encode as integers: (e.g., "low" = 1, "medium" = 2, "high" = 3).
Use techniques like ordinal encoding or target encoding.
Categorization Problems: Example: Classifying products into types (e.g.,
"electronics," "clothing," "furniture").
Customer Segmentation: Example: Grouping customers by region ("North,"
"South").
Medical Diagnosis: Example: Classifying blood groups (e.g., "A," "B," "O").
Natural Language Processing: Example: Classifying text into topics (e.g., "sports,"
"politics," "entertainment").
Use one-hot encoding or label encoding to convert nominal data to numerical
form:
One-hot encoding: Creates binary columns for each category.
Label encoding: Assigns arbitrary integers to categories (not ideal if model assumes
order).
Data collected
over time,
typically at
regular intervals.
TIME-SERIES DATA
Examples: Stock
prices, daily
temperatures
Unstructured data does not follow a
predefined format or schema. It can
be further categorized by its format
and processing requirements:
Text Data: Includes natural
UNSTRUCTURED language text that requires
DATA (1/2) processing. E.g. Social media posts,
email content, reviews.
Image Data: Includes static visual
data. E.g. Photographs, medical X-
rays, satellite images.
Audio Data: Sound recordings that
often require speech-to-text or
waveform analysis. E.g. Podcasts,
voice commands, music tracks.
UNSTRUCTURED Video Data: Moving visual data.
E.g. YouTube videos, surveillance
DATA (2/2) footage.
Sensor Data: Raw data from IoT
devices, typically unstructured until
processed. E.g. Accelerometer
readings, gyroscope data, e.t.c.
Partially organized with tags or markers but lacks a rigid schema. It can be
categorized as follows
Markup-Based Data: Data stored in formats with embedded tags or metadata.
E.g.
XML: <name>John Doe</name>
HTML: <p>This is a paragraph.</p>
JSON Data: Data stored in JavaScript Object Notation format, often used in APIs.
E.g. {"name": "John", "age": 30, "city": "New York"}
Log Data: Machine-generated logs with identifiable patterns. E.g.:
Server logs ([INFO] 2025-02-26 Server started).
Application logs.
Key-Value Data: Data stored as key-value pairs. E.g. Redis databases and
Configuration files (key: value).
Graph Data: Represents entities as nodes and relationships as edges, typically
used in graph databases. E.g. Social network data (e.g., connections on LinkedIn)
and Transportation networks.
The first thing you will need to do upon encountering a data set (whether
one you found online or one that was provided by your organization) is
to understand the meaning of the data.
This requires understanding the domain you are working in, as well as
the specific data schema you are working with.
In practice, most data sets are structured as tables of information, with
individual data values arranged into rows and columns
These tables are similar to how data may be recorded in a spreadsheet
(using a program such as Microsoft Excel).
In a table, each row represents a record or observation: an instance of a
single thing being measured (e.g., a person, a sports match).
Each column represents a feature: a particular property or aspect of the
thing being measured (e.g., the person’s height or weight, the scores in a
sports game). Each data value can be referred to as a cell in the table.
Name Height Weight
1 Ada 64 135
2 Bob 74 156 Fig. 1 A table of people’s weight and
height. Rows represent Observations,
3 Chris 69 139 while Colums represent Features.
4 Dike 69 144
5 Emma 71 152
The first step toward being able to understand a data set is to research and
understand the data’s problem domain.
The problem domain is the set of topics that are relevant to the problem—that is,
the context for that data.
Working with data requires domain knowledge: you need to have a basic level of
understanding of that problem domain to do any sensible analysis of that data.
This includes understanding the significance and purpose of any features (so
you’re not doing math on contextless numbers), the range of expected values for a
feature (to detect outliers and other errors), and some of the details that may not be
explicit in the data set
Once you have a general understanding of the context for a data set, you can begin
interpreting the data set itself. You will need to focus on understanding the data
schema (e.g., what is represented by the rows and columns), as well as the specific
context for those values.
“What metadata is available for the data set?”
“Who created the data set? Where does it come from?”
“What features does the data set have?” Go through each columns and check:
1. What “real-world” aspect does each column attempt to capture?
2. For continuous data: what units are the values in?
3. For categorical data: what different categories are represented, and what do those
mean?
4. What is the possible range of values?
“What features does the data set have?” Go through each columns and check:
1. What “real-world” aspect does each column attempt to capture?
2. For continuous data: what units are the values in?
3. For categorical data: what different categories are represented, and what do those
mean?
4. What is the possible range of values?
Ordinal and nominal data are types of categorical data and fall under
structured data.
Nominal: Purely categorical, unordered. Often encoded as integers
for machine learning (e.g., "Male" = 1, "Female" = 2).
Ordinal: Categorical but ordered. Can be encoded as integers (e.g.,
"Low" = 1, "Medium" = 2, "High" = 3) while preserving order.
Numerical (Discrete and Continuous): Quantitative values that can
be measured or counted.
Unstructured and semi-structured data often require preprocessing
to convert their content into structured formats.
For example: Converting text into categorical labels (e.g., sentiment
analysis: "positive" or "negative").
Extracting numerical features from images (e.g., average pixel
intensity).
By categorizing data this way, you can better understand its nature,
preprocessing needs, and how to use it in ML tasks.