Data Analysis I:
Introduction to Data Science with R
        Innocent Ndoh Mbue, PhD
      (Assoc. Prof. in Eco-informatics)
           Email:dndoh2009@gmail.com
            Tel: 677540384/653754070
CM    TD      TP     TPE     Credit   Credits
                             hours
24    9       8      4       45       3
                  Course Goals and Objectives
 ―The goal is where we want to be. The objectives are the steps
 needed to get there.‖
Goals
This course is designed to give students the necessary skills to analyze
data from research projects.
This course provides the participants with a practical application of R
Students will review several statistical techniques, gain an
understanding of when and why to use these various techniques as well
as how to apply them with confidence, interpret their output, and
graphically display the results.
                              10/17/2023           Prof. Dr. Ndoh Mbue   2
  Course Outline
Unit I: An overview of data science
Unit II: Scales of Measurement
Unit III: Introduction to R
Unit IV: Data analysis in R
     Descriptive analytics with R
          Mean, Mode, Median, Standard deviation, …
     Exploratory Analytics with R
     Predictive Analytics with R
           Types of Learning:
- Linear Regression- Simple Linear Regression
- Implementation in R - functions on lm() - predict() - plotting and fitting
  regression line.
- Multiple Linear Regression - Introduction -comparison with simple linear
  regression
Unit V: ANOVA and applications in R
 17-Oct-23                                                                     3
Course Requirements
 Access to a computer with an internet connection
 Ability and permissions to download files & install software
 Basic knowledge of English (Course is delivered in English language)
 17-Oct-23                                                               4
ASSESSMENT COMPONENTS FOR COURSE GRADE
               Attendance + Assignments: 2/3 of grade.
               Effective class participation = 1/3 of grade
               You are considered absent should you be 30 minutes late
               Three absences, you lose 30% of the final CA mark
               Final Written Exam= 70% of grade!!!!
               Please respect deadlines!!!
    17-Oct-23                                                  10/17/2023   5
                 Lecture I
            An overview of data science
            Motivating Questions
                 ● What is Data Science ?
                 ● Who is Data Scientist?
                 ● What Data Scientist do?
                 ● Data Scientist Type?
                 ● Why we need them?
17-Oct-23                                    6
              What is Data Science (DS)?
• An area that manages, manipulates, extracts, and interprets knowledge
  from tremendous amount of data
• A multidisciplinary field of study that combines programming skills,
  domain expertise and knowledge of statistics and mathematics to extract
  useful insights and knowledge from data‖.
• The goal is to address the challenges in big data
  17-Oct-23                                                          7
                                   …Data Science ?
      …solving problems with data…
…which step is most challenging?
   17-Oct-23                                         8
            Concentration in Data Science
•   Mathematics and Applied Mathematics
•   Applied Statistics/Data Analysis
•   Solid Programming Skills (R, Python, Julia, AWS, SQL)
•   Data Mining
•   Data Base Storage and Management
•   Machine Learning and discovery
17-Oct-23                                             9
            …Data Science
17-Oct-23                   10
                        Data Scientists
They combine a wide range of skills and modern technologies to
analyze data collected from sensors, customers, smartphones, the web,
and other sources.
What makes a good data scientist?
  17-Oct-23                                                      11
                     …Data Scientists
 • Data Scientist
    – The Sexiest Job of the 21st Century
 • They find stories, extract knowledge. They are not reporters
• Data scientists are the key to realizing the opportunities
  presented by big data. They bring structure to it, find
  compelling patterns in it, and advise executives on the
  implications for products, processes, and decisions
 17-Oct-23                                                        12
                     What do Data Scientists do?
            •   National Security
            •   Cyber Security
            •   Business Analytics
            •   Engineering
            •   Healthcare
            •   And more ….
17-Oct-23                                          13
What does a good data scientist look like?
 17-Oct-23                                   14
Data Science Project Life Cycle
  17-Oct-23                       15
            What is data analysis?
…using data to discover useful information…
17-Oct-23                                     16
                         Divisions of Data Analysis
                             Data Analysis
Descriptive Statistics        Exploratory data analysis (EDA)   Confirmatory data analysis (CDA)
  17-Oct-23                                                                               17
            General approach to data analysis
17-Oct-23                                       18
 Key requirements for becoming a Data Analyst:
1.   Be well-versed with programming languages (XML, Javascript, or ETL
     frameworks), databases (SQL, SQLite, Db2, etc.), and also have extensive
     knowledge on reporting packages (Business Objects).
2.   Be able to analyze, organize, collect and disseminate Big Data efficiently.
3.   You must have substantial technical knowledge in fields like database design,
     data mining, and segmentation techniques.
4.   Have a sound knowledge of statistical packages for analyzing massive datasets
     such as R, Python, SAS, Excel, and SPSS, …
 17-Oct-23                                                                         19
            What is machine learning?
…creating and using models that learn from data…
17-Oct-23                                          20
17-Oct-23   21
   Learning from data
       a) Regression
17-Oct-23               22
b) Classification in Data Mining
                 input data is provided to the model along with the output
  17-Oct-23                                                                  23
c) Clustering
 17-Oct-23      24
What is machine learning?
 17-Oct-23                  25
Machine learning workflow
 17-Oct-23   training phase, test phase, evaluation phase   26
Data science lifecycle
A data science lifecycle is defined as the iterative set of data science steps required
to deliver a project or analysis. There are no one-size-fits that define data science
projects. Hence you need to determine the one that best fits your business
requirements. Each step in the lifecycle should be performed carefully.
                                                                     The main phases of data science life cycle
   17-Oct-23                                                                                                      27
The algorithms and methods that data scientists use to filter data into categories
include the following, among others:    Decision trees. ...
                                        Naïve Bayes classifiers. ...
                                        Support vector machines. ...
                                        K-nearest neighbor. ...
                                        Logistic regression. ...
                                        Neural network
   17-Oct-23                                                                  28
            Review Questions
1. What is data science?
2. How is data science different from data analysis?
3. Distinguish between supervised and unsupervised
   learning algorithms
4. What do you understand by:
   1. Data
   2. Machine learning
   3. Big data
   4. Data mining?
17-Oct-23                                              29
                Lecture II
            Scales of Measurement
17-Oct-23                           30
                        What is Measurement?
 The assignment of numerals to objects or events according to rules.
     Numerals are labels that have no inherent meaning, for example zip
      codes, or automobile license plates.
     Numbers are numerals that have quantitative meaning and can be
      analyzed, for example, age.
 ■ The rules for assigning labels to properties of variables are the most
  important components of measurement, because the result of poor rules is
  meaningless outcomes.
 ■ Concepts often cannot be measured directly, e.g., ―intelligence,‖ so what is
  usually measured are indicators of constructs, such as speed, logic, verbal
  skill, etc.
                                 10/17/2023              Prof. Dr. Ndoh Mbue    31
 17-Oct-23                                                                     31
                      Levels of Measurement
■ Four levels of measurement have been identified:
               Nominal
               Ordinal
               Interval
               Ratio
These levels differ in how closely they approach the structure of the number
system we use.
 Understanding the level of measurement of variables used in research is
  important because the level of measurement determines the types of statistical
  analyses that can be conducted.
 The conclusions that can be drawn from research depend on the statistical
   analysis used.             10/17/2023            Prof. Dr. Ndoh Mbue  32
 17-Oct-23                                                              32
                                Possible data types and levels of measure.
                                10/17/2023
***The type of data you have dictates  the type of analysis Prof.
                                                            youDr.  Ndoh
                                                                  will   Mbue
                                                                       perform.    33
 17-Oct-23                                                                        33
                           Nominal Scale
■ In nominal measurement, all observations in one category are alike on some
property and differ from the members in the other category on that property (e.g.,
sex, martial status).
■ Ordering of categories exists. We cannot say one category is better or worse, or
more or less than another.
              Nominal — Numbers used as Names
              ■ Basic Empirical Operations
                      • Determination of equality
              ■ Permissible Statistics
                  • Number of cases
                  • Mode
                  • Contingency correlation
              ■ Examples
                  • Numbers on football jerseys
                  • Assignment of type or model numbers to classes
                                     10/17/2023             Prof. Dr. Ndoh Mbue    34
  17-Oct-23                                                                       34
Ordinal Scales: This scale has the characteristic of the nominal scale in that
different numbers mean different things, but also has the characteristic of "greater
or lesser". It measures a variable in terms of magnitude, or rank.
                      Example:
                       socioeconomic
                       class
                       grades
                       preferences
• Ordinal scales tell us relative order, but give us no information regarding
  differences between the categories.
• For example, runners in the 100 meter dash finish 1st, 2nd, 3rd etc. Is the
  number of seconds between 1st and 2nd place the same as those between 2nd
  and 3rd place? Certainly not necessarily.
                                  10/17/2023                Prof. Dr. Ndoh Mbue    35
 17-Oct-23                                                                        35
                       … Ordinal
        To convert these features into numerical, integer encoding/ label-
        encoding can be used. Here, each value will be assigned an integer
        label such as:
        low=0, average=1, medium=2, high=3, very high=4.
17-Oct-23                                                                36
                                       … Ordinal Scale: Rank order data
 Most questionnaires use Likert type items. For example, we may ask teachers
  about their job satisfaction.
 Asking whether a teachers is very satisfied, satisfied, neutral, dissatisfied, or very
  dissatisfied is using an ordinal scale of measurement.
                                     10/17/2023               Prof. Dr. Ndoh Mbue    37
   17-Oct-23                                                                        37
                               Interval Scales
     This scale has the properties of the nominal and ordinal scales, but here
      the magnitude between the consecutive intervals are equal.
      Temperature is the example that is usually given to illustrate an
      interval scale.
     When distance between attributes has meaning, for example,
      temperature (in Fahrenheit) - distance from 30-40 is same as distance
      from 70-80
     * Interval scales do not have a true zero. 0 degrees do not mean the
      absence of heat (although it might feel like it).
" If a change from 1 to 2 has the same strength as a 4 to 5, then we would call it an
interval level measurement (if not, then it’s just an ordinal qualitative
measurement).
                                                                                   38
17-Oct-23                                                                         38
                          Ratio Scales
 Ratio scales have all of the characteristics of the nominal, ordinal and
   interval scales. In addition, however, ratio scales have a true zero.
 There are true ratios. One can use all mathematical operations on this scale.
 Examples:
                                      weight
                                      height
                                       time
                                     distance
* 10 miles is twice as long as 5 miles. 0 miles is no distance.
• In our descriptions of data in this course, we will assume that we are using
  ratio scales most of the time. We call these PARAMETRIC STATISTICS.
• However, there will be times when all we have to work with are ordinal
  scales. When we use these scales, our data will be rank ordered. We will
  call these NONPARAMETRIC STATISTICS.
                                10/17/2023              Prof. Dr. Ndoh Mbue    39
17-Oct-23                                                                     39
17-Oct-23   40
                                  Types of Variables
• A variable is a characteristic that changes or varies over time and/or for
  different individuals or objects under consideration.
• Variables are the quantities measured in a sample. They may be classified as:
                                                         Quantitative
             Qualitative
                                                         ( Numerical)
             (Categorical)
Nominal                Ordinal
e.g. gender,         ranked e.g. mild,
blood group          moderate or              Discrete             Continuous
Hair color           severe weather
                                         - No. of level II       -pH of a sample
                                         students in 2022        - Elevation
                                                                 -Age              41
 17-Oct-23
                                                                 -Income category.
                                                                                41
                              Variables
• Variables can be further classified as:
   – Dependent/Response. Variable of primary interest (e.g. blood pressure in
     an antihypertensive drug trial). Not controlled by the experimenter.
    – Independent/Predictor
       • called a Factor when controlled by experimenter. It is often nominal
         (e.g. treatment)
       • Covariate when not controlled.
• If the value of a variable cannot be predicted in advance then the variable is
  referred to as a random variable
                                  10/17/2023               Prof. Dr. Ndoh Mbue     4242
 17-Oct-23                                                                         42
17-Oct-23   43
              Revision questions
1. Identify the type of data (nominal, ordinal, interval and ratio)
represented by each of the following. Confirm your answers by giving
your own examples.
     – Blood group
     – Temperature (Celsius)
     – Ethnic group
     – Job satisfaction index (1-5)
     – Time taken to go to school from your house
     – Number of industrial accidents in a given factory
     – Number of cases of each reportable disease reported by a health
       worker
17-Oct-23                                                            44
             Next class
             Unit III: Introduction to R
       NB:
       1. Everyone MUST come with a PC
       2. Make sure you load your phones with enough data for downloads
17-Oct-23                                                             45