0% found this document useful (0 votes)

9 views25 pages

Day 1-Data Science.

Data science is an interdisciplinary field that utilizes statistical, mathematical, and computational techniques to extract insights from data. It differentiates between data (raw facts) and information (processed data), and categorizes data into qualitative and quantitative types, with further subdivisions. Data analysis occurs at univariate, bivariate, and multivariate levels, employing various statistical methods depending on the nature of the data.

Uploaded by

tugumekeith022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views25 pages

Day 1-Data Science.

Uploaded by

tugumekeith022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

DATA SCIENCE

•Data science is an interdisciplinary field that uses

statistical, mathematical, and computational
techniques to extract insights and knowledge from
DATA data. It combines elements of statistics, machine
SCIENCE learning, data mining, and programming to analyze
and interpret complex data.
Data and Information

DIFFERENCE BETWEEN DATA AND DATA REFERS TO RAW FACTS SUCH

INFORMATION AS CLASS MARKS, INCOME LEVELS,
HEIGHT, AND WEIGHT, WHEREAS
INFORMATION IS PROCESSED DATA.
IF THE MEDIAN MARK OF AVERAGE
HEIGHT IS COMPUTED FROM THE
RAW DATA, THAT THEN BECOMES
INFORMATION.
Steps taken to analyze data

1 2 3 4 5 6 7 8 9

Data Data Cleaning Data Data Data Interpretation Reporting & Making Support
Collection – – Removing Exploration Transformati Modeling & & Insights – Visualization recommendat decision-
Gathering errors, (Exploratory on – Statistical Drawing – Presenting ions and making with
raw data from missing Data Structuring Analysis – conclusions, findings taking actions evidence.
various values, and Analysis) – and Applying making through through
sources such inconsistencie Summarizing converting techniques forecasts, and tables, charts, suggesting
as surveys, s to ensure data using data into a like formulating and strategies
databases, accuracy and descriptive suitable regression recommendat dashboards based on
financial reliability. statistics, format for analysis, ions based on to findings.
reports, or visualizations, analysis (e.g. machine the results. communicate
government and basic standardizing learning, or results
agencies correlations. through hypothesis effectively.
(e.g., Uganda natural testing to
Bureau of logarithms) identify
Statistics, trends and
Bank of relationships.
Uganda).
Data Types

Types of data Data can be broken down into two broad Qualitative Data, also known as categorical
categories, either qualitative or quantitative data, represents descriptive information that
based on whether it describes characteristics or cannot be measured numerically. It includes
measures numerical values. characteristics, attributes, and labels. This type
of data is further divided into nominal data and
ordinal data.
•Nominal data consists of categories that do not have a specific
order or ranking. The categories are mutually exclusive,
meaning each observation belongs to only one category.
However, there is no logical sequence or hierarchy among them.

Nominal •Examples:
•Gender: Male, Female, Other
Data •Nationality: Ugandan, Kenyan, Tanzanian
(Unordered •Types of Businesses: Sole proprietorship, Partnership,
Corporation
Categories) •Political Parties: NRM, FDC, DP, NUP
•Since there is no natural ranking in nominal data, you cannot
perform mathematical operations like greater than or less than.
You can only count and compare frequencies using methods like
percentages, or chi-square tests.
Ordinal Data (Ordered Categories)

Ordinal data consists of categories that have a meaningful order or

ranking. However, the differences between the categories are not
necessarily equal. This means that while you can compare which
category is higher or lower, you cannot measure the exact difference
between them.
• Examples:
•Education Levels: Primary, Secondary, Tertiary
•Socioeconomic Status: Low, Middle, High
•Satisfaction Levels: Very Dissatisfied, Dissatisfied, Neutral, Satisfied,
Very Satisfied
•Credit Ratings: Poor, Fair, Good, Excellent
•Since ordinal data has an inherent order, you can perform
comparisons (e.g., higher vs. lower) and use statistical techniques like
median, percentiles, and non-parametric tests (e.g., Mann-Whitney U
test and Kruskal-Wallis’s test). However, you cannot assume equal
intervals between the categories.
Quantitative Data

Quantitative Data, on the other hand, Continuous data consists of measurable

consists of numerical values that can be numerical values that can take any value
measured and analyzed mathematically. within a given range. It differs from
It is used for calculations, forecasting, discrete data, which consists of whole
and statistical modeling. For instance, numbers, because continuous data can
Uganda’s GDP, inflation rate, and have decimal or fractional values.
population are all quantitative data. This Continuous data is mainly categorized
type of data is further divided into into interval data and ratio data based on
discrete data and continuous data whether it has a true zero point.
Interval Data (No True Zero)
Interval data consists of numerical values where the difference between
values is meaningful, but there is no true zero (zero does not mean
"nothing"). This means that while you can perform addition and
subtraction, you cannot calculate meaningful ratios (e.g., twice as much).
Examples:
•Temperature in Celsius or Fahrenheit (0°C does not mean "no
temperature").
•Time on a clock (12:00 PM is not "twice as much time" as 6:00 AM).
•Since interval data has equal intervals, you can perform addition,
subtraction, and statistical analysis (mean, standard deviation,
correlation), but ratios (multiplication and division) do not make sense
because there is no absolute zero.
Ratio data is like interval data, but it has a true zero, meaning
zero represents the total absence of a quantity. This allows for
all mathematical operations, including meaningful ratios
(multiplication and division).
Examples:
Ratio Data •Income and Salary (0 UGX means "no income").
(Has a True •Height and Weight (0 kg means "no weight").
•Age (0 years means "not born yet").
Zero) •Interest Rates and Exchange Rates (0% means "no growth or
decline").
•Since ratio data has equal intervals and a true zero, it is the
most informative type of data and allows for all statistical
operations, including calculating ratios, percentages, and growth
rates.
Discrete data consists of countable, whole numbers that cannot take fractional or
decimal values. It represents quantities that are finite and distinct, often obtained
through counting rather than measurement. Discrete data is mainly divided into two
types:
Count Data (Simple Counts)
Count data represents the number of occurrences of an event or the number of objects
in a category. It is always non-negative (0, 1, 2, 3, ...) and is often used in frequency
analysis.
Examples:
•Number of students in a class (e.g., 45 students).
•Number of vehicles on a road (e.g., 120 cars).
•Number of foreign direct investment (FDI) projects in Uganda (e.g., 15 projects).
Since count data is whole numbers only, statistical methods like Poisson distribution
and frequency tables are commonly used for analysis.
Binary Data (Only Two Possible
Values)

Binary data consists of only two possible values, typically

represented as 0 and 1 (or "Yes/No", "Success/Failure",
"True/False"). It is commonly used in decision-making, surveys, and
probability models.
•Examples:
•Employment status (0 = Unemployed, 1 = Employed).
•Loan approval (0 = Denied, 1 = Approved).
•Product defect status (0 = No defect, 1 = Defective).
Binary data is often analyzed using logistic regression, probability
models, and classification algorithms in data science and economics.
Data classification basing on structure.

Structured Data is highly organized and

follows a predefined format, making it easy
to store in databases and spreadsheets.
For example, Uganda Revenue Authority’s
tax records, government trade balance
databases, and banking transaction logs
are structured because they are
systematically recorded and can be easily
queried using tools like SQL or Excel.
Unstructured Data
• Unstructured Data lacks a predefined format and is more
challenging to analyze directly. It includes text, images,
videos, and social media posts. For instance, news articles
discussing Uganda’s economic policies or tweets about
exchange rate fluctuations are unstructured because they
do not follow a fixed structure and require advanced
techniques.
•When it comes to data modeling, it is important to note the following types of
variables.

Independent Variable
•The independent variable is the factor that is manipulated, controlled, or categorized
by the researcher to observe its effect on another variable. It is presumed to be the
cause in a cause-and-effect relationship.
•Characteristics:
•It is the variable that is changed or controlled in an experiment.
•It influences or predicts changes in the dependent variable.
•It is also referred to as the explanatory variable, predictor variable, or input variable.
•Example: In a study on how education level affects income, education level (e.g., high
school, bachelor's, master's) is the independent variable because it is expected to
influence income.
Dependent Variable
The dependent variable is the factor that is observed and measured to
assess the impact of changes in the independent variable. It is
considered the effect in a cause-and-effect relationship.
Characteristics:
•It is the outcome or response variable in an experiment.
•It depends on or is influenced by changes in the independent variable.
•It is also called the response variable or outcome variable
Example: In the same study on education and income, income level is
the dependent variable because it is expected to change based on
education level.
Data Analysis

Data analysis is conducted at three

primary levels: univariate, bivariate,
and multivariate. Understanding these
levels, as well as the types of data
involved, is crucial because different
data types determine the appropriate
statistical tests to be performed at each
level.
Univariate Level

At the univariate level, a single variable

is analyzed at a given point in time,
and relevant statistical measures or
parameters are computed. For
continuous variables, these statistics
are categorized into measures of
central tendency and measures of
dispersion.
Measures of central tendency
Measures of central tendency describe the central
location of a dataset and include the mean, median,
and mode. The mean (or average) is calculated by
summing all observations and dividing by the total
number of observations. It provides a rough estimate
of a typical value within the dataset. The median
represents the middle value when data points are
arranged in ascending or descending order,
effectively dividing the dataset into two equal halves.
The mode is the value that appears most frequently
in the dataset.
Measures of dispersion
Measures of dispersion, on the other
hand, assess the extent to which individual
observations deviate from the central
value. These include the range, variance,
and standard deviation. The range is the
difference between the highest and lowest
values, while variance is the average
squared deviation of each observation
from the mean. Variance is always non-
negative and provides a measure of data
spread. The standard deviation, which is
the square root of the variance, indicates
the average distance of each data point
from the mean. A smaller standard
deviation suggests that values are closer
to the mean, making the mean a more
reliable measure of central tendency.
Understanding these measures is fundamental for selecting
appropriate analytical approaches, as they help determine the
nature and distribution of data. For example, data may follow
a normal distribution, where the mean provides an accurate
representation of the dataset, or it may be skewed, in which
case alternative measures such as the median are more
informative.
To illustrate, consider a company with three employees
earning UGX 100,000, UGX 150,000, and UGX 110,000. The
mean salary is UGX 120,000, which fairly represents the
employees' earnings, suggesting a normal distribution.
However, in another company where employees earn UGX
100,000, UGX 200,000, and UGX 3,000,000, the mean salary
is UGX 1,100,000. This figure is misleading since it is heavily
influenced by the extreme salary of one employee. Such data
would be considered skewed or non-normal.
To formally test for normality, one can use measures such as
skewness or more direct tests like the Shapiro-Wilk test.
Skewness provides a basic indication of whether a dataset is
symmetric or skewed, while Shapiro-Wilk is a more rigorous
method. These tests are essential because they influence
methodological choices. For instance, if we want to measure
the correlation between age and work experience in a dataset
with skewed variables, we would use Spearman’s rank
correlation (a non-parametric method) instead of Pearson’s
correlation, which assumes normality.
Bivariate Level

At the bivariate level, the analysis examines relationships

between two variables. The statistical approach used depends
on the nature of these variables.
∙ When both variables are continuous, correlation analysis
is employed to measure the strength and direction of their
relationship.
∙ When both variables are categorical, the chi-square test
is used to determine if they are significantly associated.
∙ When one variable is continuous and the other is
categorical, a one-way analysis of variance (One-Way
ANOVA) is applied to compare means across different
groups.
Multivariate Level
Multivariate analysis involves examining relationships among multiple variables simultaneously. It
is the most advanced level of data analysis and is used to understand how a set of independent
variables influence one or more dependent variables.
Common multivariate techniques include:
∙ Linear Regression, which models the relationship between a dependent variable and one or
more independent variables. The dependent variable in this case must be continuous.
∙ Logistic Regression, which is used for binary outcomes.
∙ Cox Regression, commonly applied in cases where the dependent variable is a count
variable.
∙ Vector Auto Regression (VAR), used in time-series analysis to capture relationships among
multiple variables over time.
∙ Co-integration Analysis, which examines long-term relationships between time-series
variables.
Each of these methods is selected based on the specific research question and the nature of the
data involved. Understanding the appropriate analytical level and corresponding statistical tests
ensures accurate and meaningful insights from data.

MMW Stat 24 25
No ratings yet
MMW Stat 24 25
42 pages
Fds Unit II Notes
No ratings yet
Fds Unit II Notes
37 pages
Statistics Batch4 Lecture
No ratings yet
Statistics Batch4 Lecture
82 pages
Topic 1 Introduction To Statistics
No ratings yet
Topic 1 Introduction To Statistics
35 pages
Characteristics of Statistics
No ratings yet
Characteristics of Statistics
12 pages
Basic Statistics: Chapter One
No ratings yet
Basic Statistics: Chapter One
15 pages
Data Types For Analyst
No ratings yet
Data Types For Analyst
8 pages
Data & Types of Data-1
No ratings yet
Data & Types of Data-1
16 pages
Data Types and Analytics in R
No ratings yet
Data Types and Analytics in R
12 pages
Business Analytics (Tanya Pandey) Mba M3a
No ratings yet
Business Analytics (Tanya Pandey) Mba M3a
64 pages
Unit 2 1
No ratings yet
Unit 2 1
48 pages
Probability & Statistics Basics
No ratings yet
Probability & Statistics Basics
72 pages
Data Types
No ratings yet
Data Types
5 pages
Unit 2 Descriptive Analytics
No ratings yet
Unit 2 Descriptive Analytics
87 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
Types of Data
No ratings yet
Types of Data
14 pages
Module 01 Introduction To Business Statistics
No ratings yet
Module 01 Introduction To Business Statistics
16 pages
Class 2
No ratings yet
Class 2
5 pages
Data and Types of Data
No ratings yet
Data and Types of Data
7 pages
Basic Statistics Lesson 1
No ratings yet
Basic Statistics Lesson 1
5 pages
Biostatistics RM Unit 2 Part 1 2024.-1
No ratings yet
Biostatistics RM Unit 2 Part 1 2024.-1
36 pages
Introduction To STATISTICS-new
No ratings yet
Introduction To STATISTICS-new
44 pages
Lecture 2
No ratings yet
Lecture 2
61 pages
Introduction To Statistics and Presentation of Data: Session-1
No ratings yet
Introduction To Statistics and Presentation of Data: Session-1
66 pages
1-Introduction To Statistics PDF
100% (1)
1-Introduction To Statistics PDF
37 pages
Cec 218 - 042006
No ratings yet
Cec 218 - 042006
83 pages
Overview and Nature of Data
No ratings yet
Overview and Nature of Data
41 pages
Notes (Chapter 1 - 3)
No ratings yet
Notes (Chapter 1 - 3)
15 pages
Classes of Data
No ratings yet
Classes of Data
10 pages
Statistical Analysis (Lecture 1)
No ratings yet
Statistical Analysis (Lecture 1)
40 pages
CH 03
No ratings yet
CH 03
19 pages
Math As A Tool Data Management Introduction and Central Tendency
No ratings yet
Math As A Tool Data Management Introduction and Central Tendency
12 pages
Final AB 19-21 PIM3 Basics of Business Statistics
No ratings yet
Final AB 19-21 PIM3 Basics of Business Statistics
37 pages
Statistics Module: Arijit Mitra
No ratings yet
Statistics Module: Arijit Mitra
25 pages
CCM 202 Lecture 2 Statistics
No ratings yet
CCM 202 Lecture 2 Statistics
11 pages
Dav Theory
No ratings yet
Dav Theory
111 pages
Data Preparation-Part 1-231018-220411
No ratings yet
Data Preparation-Part 1-231018-220411
74 pages
Data Analytics Syllabus Overview
No ratings yet
Data Analytics Syllabus Overview
80 pages
LECTURE-1 PART-1 (Concept On Basic Statistics)
No ratings yet
LECTURE-1 PART-1 (Concept On Basic Statistics)
21 pages
DSC 7001: Statistical & Quantitative Methods: Kim Menezes Email: Kim - Menezes@gdgoenka - Ac.in
No ratings yet
DSC 7001: Statistical & Quantitative Methods: Kim Menezes Email: Kim - Menezes@gdgoenka - Ac.in
21 pages
Data Managemennt
No ratings yet
Data Managemennt
20 pages
Statistics and Probability Handouts - Basic Terms in Statistics
No ratings yet
Statistics and Probability Handouts - Basic Terms in Statistics
4 pages
Business Statistics Essentials
No ratings yet
Business Statistics Essentials
57 pages
Descriptive Statistics Guide
No ratings yet
Descriptive Statistics Guide
62 pages
DAT100 Int Data Ana Lec3 Types of Data
No ratings yet
DAT100 Int Data Ana Lec3 Types of Data
35 pages
Data and Its Types
No ratings yet
Data and Its Types
32 pages
Report Stat
No ratings yet
Report Stat
21 pages
Chapter 2 DS
No ratings yet
Chapter 2 DS
9 pages
Unit 1 Computational Statistics
No ratings yet
Unit 1 Computational Statistics
4 pages
Gre 322
No ratings yet
Gre 322
76 pages
Week 1 Chapter 1 - Introduction To Statistics and Sata Collection
No ratings yet
Week 1 Chapter 1 - Introduction To Statistics and Sata Collection
28 pages
UNIT-I - Data Categorization-by-Dr - SKY
No ratings yet
UNIT-I - Data Categorization-by-Dr - SKY
22 pages
1 Descriptive Part
No ratings yet
1 Descriptive Part
13 pages
Session 2
No ratings yet
Session 2
17 pages
Module 2 Variables and Data
No ratings yet
Module 2 Variables and Data
5 pages
Excel & Python Statistical Functions
No ratings yet
Excel & Python Statistical Functions
44 pages
03 Math15 Module (For BSCrim) Task Set 04
No ratings yet
03 Math15 Module (For BSCrim) Task Set 04
11 pages
Statistics - Handouts 1
No ratings yet
Statistics - Handouts 1
6 pages
W1L1,2,3 Lecture Script
No ratings yet
W1L1,2,3 Lecture Script
17 pages
Football Player Market Value Analysis
No ratings yet
Football Player Market Value Analysis
12 pages
Statistics & Probability Guide
No ratings yet
Statistics & Probability Guide
14 pages
Analysis of Variance
No ratings yet
Analysis of Variance
43 pages
George G. Judge, William E. Griffiths, R. Carter Hill, Helmut Lütkepohl, Tsoung-Chao Lee-The Theory and Practice of Econometrics (Wiley Series in Probability and Statistics) - Wiley (1985)
67% (6)
George G. Judge, William E. Griffiths, R. Carter Hill, Helmut Lütkepohl, Tsoung-Chao Lee-The Theory and Practice of Econometrics (Wiley Series in Probability and Statistics) - Wiley (1985)
1,033 pages
Unit 1 Student Notes Complete
No ratings yet
Unit 1 Student Notes Complete
44 pages
Example of Classification Model in Predictive Analytics Techniques
No ratings yet
Example of Classification Model in Predictive Analytics Techniques
9 pages
Assignment 1 - Ex 4.39: Olamide Gab-Opadokun 2/10/2020
No ratings yet
Assignment 1 - Ex 4.39: Olamide Gab-Opadokun 2/10/2020
3 pages
Percentiles
No ratings yet
Percentiles
2 pages
D Value Average Median STD Deviation 1st Quantile 3rd Quantile Skewness Excess Kurtosis
No ratings yet
D Value Average Median STD Deviation 1st Quantile 3rd Quantile Skewness Excess Kurtosis
4 pages
Goodness-Of-Fit Test
No ratings yet
Goodness-Of-Fit Test
18 pages
Chapter Three
No ratings yet
Chapter Three
15 pages
Rancangan Acak Lengkap Kasus 1 Pengaruh Penggunaan Hormon X Sebagai Pemicu Pertumbuhan Bibit Mahoni
No ratings yet
Rancangan Acak Lengkap Kasus 1 Pengaruh Penggunaan Hormon X Sebagai Pemicu Pertumbuhan Bibit Mahoni
5 pages
Polynomial Regression Guide
No ratings yet
Polynomial Regression Guide
69 pages
CH5.Operations On Multiple Random Variables
No ratings yet
CH5.Operations On Multiple Random Variables
12 pages
Quiet Quitting Among Employees A Proposed Cut-Off
No ratings yet
Quiet Quitting Among Employees A Proposed Cut-Off
13 pages
Mean and Variance in Probability
No ratings yet
Mean and Variance in Probability
14 pages
F-Test of A Linear Restriction: U L K y
No ratings yet
F-Test of A Linear Restriction: U L K y
2 pages
Ch03 Guan CM Aise
No ratings yet
Ch03 Guan CM Aise
41 pages
Module - 02 Machine Learning (BCS602)
No ratings yet
Module - 02 Machine Learning (BCS602)
31 pages
Thesis
No ratings yet
Thesis
8 pages
Mcom 2 Sem Paper 9 Advanced Statistics F 109 Jun 2022
No ratings yet
Mcom 2 Sem Paper 9 Advanced Statistics F 109 Jun 2022
8 pages
Chemometrics y Statistics Experimental Design 8-13
No ratings yet
Chemometrics y Statistics Experimental Design 8-13
6 pages
Econ5025 Practice Problems
43% (7)
Econ5025 Practice Problems
33 pages
Mean, Median, Standard
100% (1)
Mean, Median, Standard
11 pages
Supervised Logistic Regression
No ratings yet
Supervised Logistic Regression
13 pages
Levene's Test
No ratings yet
Levene's Test
2 pages
Statistical Analysis Is A Research Method That Involves The Process of Data Collection
No ratings yet
Statistical Analysis Is A Research Method That Involves The Process of Data Collection
7 pages
Probability Sampling
No ratings yet
Probability Sampling
16 pages
Acceptance of Evidence Based On The Results of Probability Sampling
100% (1)
Acceptance of Evidence Based On The Results of Probability Sampling
6 pages
Multiple Choice Questions: Review of Preventive and Social Medicine
100% (1)
Multiple Choice Questions: Review of Preventive and Social Medicine
1 page

Day 1-Data Science.

Uploaded by

Day 1-Data Science.

Uploaded by

DATA SCIENCE

•Data science is an interdisciplinary field that uses

DIFFERENCE BETWEEN DATA AND DATA REFERS TO RAW FACTS SUCH

Ordinal data consists of categories that have a meaningful order or

Quantitative Data, on the other hand, Continuous data consists of measurable

Binary data consists of only two possible values, typically

Structured Data is highly organized and

Data analysis is conducted at three

At the univariate level, a single variable

At the bivariate level, the analysis examines relationships

You might also like