[go: up one dir, main page]

0% found this document useful (0 votes)
12 views45 pages

Data Ana With R

The document discusses scales of measurement in data analysis. It defines measurement and the four levels of measurement: nominal, ordinal, interval, and ratio. It explains that the level of measurement determines what statistical analyses can be used and the conclusions that can be drawn from research.

Uploaded by

lutterford18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views45 pages

Data Ana With R

The document discusses scales of measurement in data analysis. It defines measurement and the four levels of measurement: nominal, ordinal, interval, and ratio. It explains that the level of measurement determines what statistical analyses can be used and the conclusions that can be drawn from research.

Uploaded by

lutterford18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Analysis I:

Introduction to Data Science with R

Innocent Ndoh Mbue, PhD


(Assoc. Prof. in Eco-informatics)

Email:dndoh2009@gmail.com
Tel: 677540384/653754070
CM TD TP TPE Credit Credits
hours
24 9 8 4 45 3
Course Goals and Objectives
―The goal is where we want to be. The objectives are the steps
needed to get there.‖
Goals
This course is designed to give students the necessary skills to analyze
data from research projects.

This course provides the participants with a practical application of R

Students will review several statistical techniques, gain an


understanding of when and why to use these various techniques as well
as how to apply them with confidence, interpret their output, and
graphically display the results.
10/17/2023 Prof. Dr. Ndoh Mbue 2
Course Outline
Unit I: An overview of data science
Unit II: Scales of Measurement
Unit III: Introduction to R
Unit IV: Data analysis in R
 Descriptive analytics with R
Mean, Mode, Median, Standard deviation, …
 Exploratory Analytics with R
 Predictive Analytics with R
 Types of Learning:
- Linear Regression- Simple Linear Regression
- Implementation in R - functions on lm() - predict() - plotting and fitting
regression line.
- Multiple Linear Regression - Introduction -comparison with simple linear
regression
Unit V: ANOVA and applications in R

17-Oct-23 3
Course Requirements
 Access to a computer with an internet connection

 Ability and permissions to download files & install software

 Basic knowledge of English (Course is delivered in English language)

17-Oct-23 4
ASSESSMENT COMPONENTS FOR COURSE GRADE

 Attendance + Assignments: 2/3 of grade.

 Effective class participation = 1/3 of grade

 You are considered absent should you be 30 minutes late

 Three absences, you lose 30% of the final CA mark

 Final Written Exam= 70% of grade!!!!

 Please respect deadlines!!!

17-Oct-23 10/17/2023 5
Lecture I

An overview of data science

Motivating Questions
● What is Data Science ?
● Who is Data Scientist?
● What Data Scientist do?
● Data Scientist Type?
● Why we need them?

17-Oct-23 6
What is Data Science (DS)?
• An area that manages, manipulates, extracts, and interprets knowledge
from tremendous amount of data
• A multidisciplinary field of study that combines programming skills,
domain expertise and knowledge of statistics and mathematics to extract
useful insights and knowledge from data‖.
• The goal is to address the challenges in big data

17-Oct-23 7
…Data Science ?
…solving problems with data…

…which step is most challenging?

17-Oct-23 8
Concentration in Data Science

• Mathematics and Applied Mathematics


• Applied Statistics/Data Analysis
• Solid Programming Skills (R, Python, Julia, AWS, SQL)
• Data Mining
• Data Base Storage and Management
• Machine Learning and discovery

17-Oct-23 9
…Data Science

17-Oct-23 10
Data Scientists

They combine a wide range of skills and modern technologies to


analyze data collected from sensors, customers, smartphones, the web,
and other sources.

What makes a good data scientist?

17-Oct-23 11
…Data Scientists
• Data Scientist
– The Sexiest Job of the 21st Century
• They find stories, extract knowledge. They are not reporters

• Data scientists are the key to realizing the opportunities


presented by big data. They bring structure to it, find
compelling patterns in it, and advise executives on the
implications for products, processes, and decisions

17-Oct-23 12
What do Data Scientists do?

• National Security
• Cyber Security
• Business Analytics
• Engineering
• Healthcare
• And more ….

17-Oct-23 13
What does a good data scientist look like?

17-Oct-23 14
Data Science Project Life Cycle

17-Oct-23 15
What is data analysis?

…using data to discover useful information…

17-Oct-23 16
Divisions of Data Analysis

Data Analysis

Descriptive Statistics Exploratory data analysis (EDA) Confirmatory data analysis (CDA)

17-Oct-23 17
General approach to data analysis

17-Oct-23 18
Key requirements for becoming a Data Analyst:

1. Be well-versed with programming languages (XML, Javascript, or ETL


frameworks), databases (SQL, SQLite, Db2, etc.), and also have extensive
knowledge on reporting packages (Business Objects).
2. Be able to analyze, organize, collect and disseminate Big Data efficiently.
3. You must have substantial technical knowledge in fields like database design,
data mining, and segmentation techniques.
4. Have a sound knowledge of statistical packages for analyzing massive datasets
such as R, Python, SAS, Excel, and SPSS, …

17-Oct-23 19
What is machine learning?

…creating and using models that learn from data…

17-Oct-23 20
17-Oct-23 21
Learning from data
a) Regression

17-Oct-23 22
b) Classification in Data Mining

input data is provided to the model along with the output


17-Oct-23 23
c) Clustering

17-Oct-23 24
What is machine learning?

17-Oct-23 25
Machine learning workflow

17-Oct-23 training phase, test phase, evaluation phase 26


Data science lifecycle
A data science lifecycle is defined as the iterative set of data science steps required
to deliver a project or analysis. There are no one-size-fits that define data science
projects. Hence you need to determine the one that best fits your business
requirements. Each step in the lifecycle should be performed carefully.

The main phases of data science life cycle


17-Oct-23 27
The algorithms and methods that data scientists use to filter data into categories
include the following, among others: Decision trees. ...
Naïve Bayes classifiers. ...
Support vector machines. ...
K-nearest neighbor. ...
Logistic regression. ...
Neural network

17-Oct-23 28
Review Questions
1. What is data science?
2. How is data science different from data analysis?
3. Distinguish between supervised and unsupervised
learning algorithms
4. What do you understand by:
1. Data
2. Machine learning
3. Big data
4. Data mining?

17-Oct-23 29
Lecture II

Scales of Measurement

17-Oct-23 30
What is Measurement?
 The assignment of numerals to objects or events according to rules.

 Numerals are labels that have no inherent meaning, for example zip
codes, or automobile license plates.

 Numbers are numerals that have quantitative meaning and can be


analyzed, for example, age.

 ■ The rules for assigning labels to properties of variables are the most
important components of measurement, because the result of poor rules is
meaningless outcomes.

 ■ Concepts often cannot be measured directly, e.g., ―intelligence,‖ so what is


usually measured are indicators of constructs, such as speed, logic, verbal
skill, etc.
10/17/2023 Prof. Dr. Ndoh Mbue 31
17-Oct-23 31
Levels of Measurement
■ Four levels of measurement have been identified:

 Nominal
 Ordinal
 Interval
 Ratio

These levels differ in how closely they approach the structure of the number
system we use.

 Understanding the level of measurement of variables used in research is


important because the level of measurement determines the types of statistical
analyses that can be conducted.

 The conclusions that can be drawn from research depend on the statistical
analysis used. 10/17/2023 Prof. Dr. Ndoh Mbue 32
17-Oct-23 32
Possible data types and levels of measure.

10/17/2023
***The type of data you have dictates the type of analysis Prof.
youDr. Ndoh
will Mbue
perform. 33

17-Oct-23 33
Nominal Scale

■ In nominal measurement, all observations in one category are alike on some


property and differ from the members in the other category on that property (e.g.,
sex, martial status).

■ Ordering of categories exists. We cannot say one category is better or worse, or


more or less than another.

Nominal — Numbers used as Names

■ Basic Empirical Operations


• Determination of equality
■ Permissible Statistics
• Number of cases
• Mode
• Contingency correlation
■ Examples
• Numbers on football jerseys
• Assignment of type or model numbers to classes
10/17/2023 Prof. Dr. Ndoh Mbue 34
17-Oct-23 34
Ordinal Scales: This scale has the characteristic of the nominal scale in that
different numbers mean different things, but also has the characteristic of "greater
or lesser". It measures a variable in terms of magnitude, or rank.

Example:
 socioeconomic
 class
 grades
 preferences

• Ordinal scales tell us relative order, but give us no information regarding


differences between the categories.

• For example, runners in the 100 meter dash finish 1st, 2nd, 3rd etc. Is the
number of seconds between 1st and 2nd place the same as those between 2nd
and 3rd place? Certainly not necessarily.

10/17/2023 Prof. Dr. Ndoh Mbue 35


17-Oct-23 35
… Ordinal

To convert these features into numerical, integer encoding/ label-


encoding can be used. Here, each value will be assigned an integer
label such as:

low=0, average=1, medium=2, high=3, very high=4.

17-Oct-23 36
… Ordinal Scale: Rank order data

 Most questionnaires use Likert type items. For example, we may ask teachers
about their job satisfaction.
 Asking whether a teachers is very satisfied, satisfied, neutral, dissatisfied, or very
dissatisfied is using an ordinal scale of measurement.

10/17/2023 Prof. Dr. Ndoh Mbue 37


17-Oct-23 37
Interval Scales
 This scale has the properties of the nominal and ordinal scales, but here
the magnitude between the consecutive intervals are equal.
Temperature is the example that is usually given to illustrate an
interval scale.

 When distance between attributes has meaning, for example,


temperature (in Fahrenheit) - distance from 30-40 is same as distance
from 70-80

 * Interval scales do not have a true zero. 0 degrees do not mean the
absence of heat (although it might feel like it).

" If a change from 1 to 2 has the same strength as a 4 to 5, then we would call it an
interval level measurement (if not, then it’s just an ordinal qualitative
measurement).
38

17-Oct-23 38
Ratio Scales
 Ratio scales have all of the characteristics of the nominal, ordinal and
interval scales. In addition, however, ratio scales have a true zero.
 There are true ratios. One can use all mathematical operations on this scale.
 Examples:
weight
height
time
distance
* 10 miles is twice as long as 5 miles. 0 miles is no distance.
• In our descriptions of data in this course, we will assume that we are using
ratio scales most of the time. We call these PARAMETRIC STATISTICS.

• However, there will be times when all we have to work with are ordinal
scales. When we use these scales, our data will be rank ordered. We will
call these NONPARAMETRIC STATISTICS.

10/17/2023 Prof. Dr. Ndoh Mbue 39


17-Oct-23 39
17-Oct-23 40
Types of Variables
• A variable is a characteristic that changes or varies over time and/or for
different individuals or objects under consideration.
• Variables are the quantities measured in a sample. They may be classified as:

Quantitative
Qualitative
( Numerical)
(Categorical)

Nominal Ordinal
e.g. gender, ranked e.g. mild,
blood group moderate or Discrete Continuous
Hair color severe weather

- No. of level II -pH of a sample


students in 2022 - Elevation
-Age 41
17-Oct-23
-Income category.
41
Variables
• Variables can be further classified as:
– Dependent/Response. Variable of primary interest (e.g. blood pressure in
an antihypertensive drug trial). Not controlled by the experimenter.

– Independent/Predictor
• called a Factor when controlled by experimenter. It is often nominal
(e.g. treatment)
• Covariate when not controlled.

• If the value of a variable cannot be predicted in advance then the variable is


referred to as a random variable

10/17/2023 Prof. Dr. Ndoh Mbue 4242


17-Oct-23 42
17-Oct-23 43
Revision questions
1. Identify the type of data (nominal, ordinal, interval and ratio)
represented by each of the following. Confirm your answers by giving
your own examples.
– Blood group
– Temperature (Celsius)
– Ethnic group
– Job satisfaction index (1-5)
– Time taken to go to school from your house
– Number of industrial accidents in a given factory
– Number of cases of each reportable disease reported by a health
worker

17-Oct-23 44
Next class

Unit III: Introduction to R

NB:
1. Everyone MUST come with a PC
2. Make sure you load your phones with enough data for downloads

17-Oct-23 45

You might also like