DATA SCIENCE
•Data science is an interdisciplinary field that uses
statistical, mathematical, and computational
techniques to extract insights and knowledge from
DATA data. It combines elements of statistics, machine
SCIENCE learning, data mining, and programming to analyze
and interpret complex data.
Data and Information
DIFFERENCE BETWEEN DATA AND DATA REFERS TO RAW FACTS SUCH
INFORMATION AS CLASS MARKS, INCOME LEVELS,
HEIGHT, AND WEIGHT, WHEREAS
INFORMATION IS PROCESSED DATA.
IF THE MEDIAN MARK OF AVERAGE
HEIGHT IS COMPUTED FROM THE
RAW DATA, THAT THEN BECOMES
INFORMATION.
Steps taken to analyze data
1 2 3 4 5 6 7 8 9
Data Data Cleaning Data Data Data Interpretation Reporting & Making Support
Collection – – Removing Exploration Transformati Modeling & & Insights – Visualization recommendat decision-
Gathering errors, (Exploratory on – Statistical Drawing – Presenting ions and making with
raw data from missing Data Structuring Analysis – conclusions, findings taking actions evidence.
various values, and Analysis) – and Applying making through through
sources such inconsistencie Summarizing converting techniques forecasts, and tables, charts, suggesting
as surveys, s to ensure data using data into a like formulating and strategies
databases, accuracy and descriptive suitable regression recommendat dashboards based on
financial reliability. statistics, format for analysis, ions based on to findings.
reports, or visualizations, analysis (e.g. machine the results. communicate
government and basic standardizing learning, or results
agencies correlations. through hypothesis effectively.
(e.g., Uganda natural testing to
Bureau of logarithms) identify
Statistics, trends and
Bank of relationships.
Uganda).
Data Types
Types of data Data can be broken down into two broad Qualitative Data, also known as categorical
categories, either qualitative or quantitative data, represents descriptive information that
based on whether it describes characteristics or cannot be measured numerically. It includes
measures numerical values. characteristics, attributes, and labels. This type
of data is further divided into nominal data and
ordinal data.
•Nominal data consists of categories that do not have a specific
order or ranking. The categories are mutually exclusive,
meaning each observation belongs to only one category.
However, there is no logical sequence or hierarchy among them.
Nominal •Examples:
•Gender: Male, Female, Other
Data •Nationality: Ugandan, Kenyan, Tanzanian
(Unordered •Types of Businesses: Sole proprietorship, Partnership,
Corporation
Categories) •Political Parties: NRM, FDC, DP, NUP
•Since there is no natural ranking in nominal data, you cannot
perform mathematical operations like greater than or less than.
You can only count and compare frequencies using methods like
percentages, or chi-square tests.
Ordinal Data (Ordered Categories)
Ordinal data consists of categories that have a meaningful order or
ranking. However, the differences between the categories are not
necessarily equal. This means that while you can compare which
category is higher or lower, you cannot measure the exact difference
between them.
• Examples:
•Education Levels: Primary, Secondary, Tertiary
•Socioeconomic Status: Low, Middle, High
•Satisfaction Levels: Very Dissatisfied, Dissatisfied, Neutral, Satisfied,
Very Satisfied
•Credit Ratings: Poor, Fair, Good, Excellent
•Since ordinal data has an inherent order, you can perform
comparisons (e.g., higher vs. lower) and use statistical techniques like
median, percentiles, and non-parametric tests (e.g., Mann-Whitney U
test and Kruskal-Wallis’s test). However, you cannot assume equal
intervals between the categories.
Quantitative Data
Quantitative Data, on the other hand, Continuous data consists of measurable
consists of numerical values that can be numerical values that can take any value
measured and analyzed mathematically. within a given range. It differs from
It is used for calculations, forecasting, discrete data, which consists of whole
and statistical modeling. For instance, numbers, because continuous data can
Uganda’s GDP, inflation rate, and have decimal or fractional values.
population are all quantitative data. This Continuous data is mainly categorized
type of data is further divided into into interval data and ratio data based on
discrete data and continuous data whether it has a true zero point.
Interval Data (No True Zero)
Interval data consists of numerical values where the difference between
values is meaningful, but there is no true zero (zero does not mean
"nothing"). This means that while you can perform addition and
subtraction, you cannot calculate meaningful ratios (e.g., twice as much).
Examples:
•Temperature in Celsius or Fahrenheit (0°C does not mean "no
temperature").
•Time on a clock (12:00 PM is not "twice as much time" as 6:00 AM).
•Since interval data has equal intervals, you can perform addition,
subtraction, and statistical analysis (mean, standard deviation,
correlation), but ratios (multiplication and division) do not make sense
because there is no absolute zero.
Ratio data is like interval data, but it has a true zero, meaning
zero represents the total absence of a quantity. This allows for
all mathematical operations, including meaningful ratios
(multiplication and division).
Examples:
Ratio Data •Income and Salary (0 UGX means "no income").
(Has a True •Height and Weight (0 kg means "no weight").
•Age (0 years means "not born yet").
Zero) •Interest Rates and Exchange Rates (0% means "no growth or
decline").
•Since ratio data has equal intervals and a true zero, it is the
most informative type of data and allows for all statistical
operations, including calculating ratios, percentages, and growth
rates.
Discrete data consists of countable, whole numbers that cannot take fractional or
decimal values. It represents quantities that are finite and distinct, often obtained
through counting rather than measurement. Discrete data is mainly divided into two
types:
Count Data (Simple Counts)
Count data represents the number of occurrences of an event or the number of objects
in a category. It is always non-negative (0, 1, 2, 3, ...) and is often used in frequency
analysis.
Examples:
•Number of students in a class (e.g., 45 students).
•Number of vehicles on a road (e.g., 120 cars).
•Number of foreign direct investment (FDI) projects in Uganda (e.g., 15 projects).
Since count data is whole numbers only, statistical methods like Poisson distribution
and frequency tables are commonly used for analysis.
Binary Data (Only Two Possible
Values)
Binary data consists of only two possible values, typically
represented as 0 and 1 (or "Yes/No", "Success/Failure",
"True/False"). It is commonly used in decision-making, surveys, and
probability models.
•Examples:
•Employment status (0 = Unemployed, 1 = Employed).
•Loan approval (0 = Denied, 1 = Approved).
•Product defect status (0 = No defect, 1 = Defective).
Binary data is often analyzed using logistic regression, probability
models, and classification algorithms in data science and economics.
Data classification basing on structure.
Structured Data is highly organized and
follows a predefined format, making it easy
to store in databases and spreadsheets.
For example, Uganda Revenue Authority’s
tax records, government trade balance
databases, and banking transaction logs
are structured because they are
systematically recorded and can be easily
queried using tools like SQL or Excel.
Unstructured Data
• Unstructured Data lacks a predefined format and is more
challenging to analyze directly. It includes text, images,
videos, and social media posts. For instance, news articles
discussing Uganda’s economic policies or tweets about
exchange rate fluctuations are unstructured because they
do not follow a fixed structure and require advanced
techniques.
•When it comes to data modeling, it is important to note the following types of
variables.
Independent Variable
•The independent variable is the factor that is manipulated, controlled, or categorized
by the researcher to observe its effect on another variable. It is presumed to be the
cause in a cause-and-effect relationship.
•Characteristics:
•It is the variable that is changed or controlled in an experiment.
•It influences or predicts changes in the dependent variable.
•It is also referred to as the explanatory variable, predictor variable, or input variable.
•Example: In a study on how education level affects income, education level (e.g., high
school, bachelor's, master's) is the independent variable because it is expected to
influence income.
Dependent Variable
The dependent variable is the factor that is observed and measured to
assess the impact of changes in the independent variable. It is
considered the effect in a cause-and-effect relationship.
Characteristics:
•It is the outcome or response variable in an experiment.
•It depends on or is influenced by changes in the independent variable.
•It is also called the response variable or outcome variable
Example: In the same study on education and income, income level is
the dependent variable because it is expected to change based on
education level.
Data Analysis
Data analysis is conducted at three
primary levels: univariate, bivariate,
and multivariate. Understanding these
levels, as well as the types of data
involved, is crucial because different
data types determine the appropriate
statistical tests to be performed at each
level.
Univariate Level
At the univariate level, a single variable
is analyzed at a given point in time,
and relevant statistical measures or
parameters are computed. For
continuous variables, these statistics
are categorized into measures of
central tendency and measures of
dispersion.
Measures of central tendency
Measures of central tendency describe the central
location of a dataset and include the mean, median,
and mode. The mean (or average) is calculated by
summing all observations and dividing by the total
number of observations. It provides a rough estimate
of a typical value within the dataset. The median
represents the middle value when data points are
arranged in ascending or descending order,
effectively dividing the dataset into two equal halves.
The mode is the value that appears most frequently
in the dataset.
Measures of dispersion
Measures of dispersion, on the other
hand, assess the extent to which individual
observations deviate from the central
value. These include the range, variance,
and standard deviation. The range is the
difference between the highest and lowest
values, while variance is the average
squared deviation of each observation
from the mean. Variance is always non-
negative and provides a measure of data
spread. The standard deviation, which is
the square root of the variance, indicates
the average distance of each data point
from the mean. A smaller standard
deviation suggests that values are closer
to the mean, making the mean a more
reliable measure of central tendency.
Understanding these measures is fundamental for selecting
appropriate analytical approaches, as they help determine the
nature and distribution of data. For example, data may follow
a normal distribution, where the mean provides an accurate
representation of the dataset, or it may be skewed, in which
case alternative measures such as the median are more
informative.
To illustrate, consider a company with three employees
earning UGX 100,000, UGX 150,000, and UGX 110,000. The
mean salary is UGX 120,000, which fairly represents the
employees' earnings, suggesting a normal distribution.
However, in another company where employees earn UGX
100,000, UGX 200,000, and UGX 3,000,000, the mean salary
is UGX 1,100,000. This figure is misleading since it is heavily
influenced by the extreme salary of one employee. Such data
would be considered skewed or non-normal.
To formally test for normality, one can use measures such as
skewness or more direct tests like the Shapiro-Wilk test.
Skewness provides a basic indication of whether a dataset is
symmetric or skewed, while Shapiro-Wilk is a more rigorous
method. These tests are essential because they influence
methodological choices. For instance, if we want to measure
the correlation between age and work experience in a dataset
with skewed variables, we would use Spearman’s rank
correlation (a non-parametric method) instead of Pearson’s
correlation, which assumes normality.
Bivariate Level
At the bivariate level, the analysis examines relationships
between two variables. The statistical approach used depends
on the nature of these variables.
∙ When both variables are continuous, correlation analysis
is employed to measure the strength and direction of their
relationship.
∙ When both variables are categorical, the chi-square test
is used to determine if they are significantly associated.
∙ When one variable is continuous and the other is
categorical, a one-way analysis of variance (One-Way
ANOVA) is applied to compare means across different
groups.
Multivariate Level
Multivariate analysis involves examining relationships among multiple variables simultaneously. It
is the most advanced level of data analysis and is used to understand how a set of independent
variables influence one or more dependent variables.
Common multivariate techniques include:
∙ Linear Regression, which models the relationship between a dependent variable and one or
more independent variables. The dependent variable in this case must be continuous.
∙ Logistic Regression, which is used for binary outcomes.
∙ Cox Regression, commonly applied in cases where the dependent variable is a count
variable.
∙ Vector Auto Regression (VAR), used in time-series analysis to capture relationships among
multiple variables over time.
∙ Co-integration Analysis, which examines long-term relationships between time-series
variables.
Each of these methods is selected based on the specific research question and the nature of the
data involved. Understanding the appropriate analytical level and corresponding statistical tests
ensures accurate and meaningful insights from data.