Lecture 2
Data types, and data in R
Agenda
• Types of data
• Working with data in Rstudio
• Intro to data exploration
2
Part 1
Types of data
Types of Data for Analytics
4
Types of Data for Analytics
Nominal Named categories
Categorical
(qualitative)
Ordinal
Categories within implied order
Data
Discrete Only particular number
Numerical
(quantitative)
Continous Any numeric value
5
Qualitative
Qualitative data - (or categorical or attribute)
Can be separated into different categories that are
distinguished by some nonnumeric characteristics.
Example: genders (male/female) of professional
athletes, States of a country etc.,
6
Quantitative
Quantitative data
Numbers representing counts or
measurements
Example: Profitability of a company,
weather, time etc.,
7
Exercise 1
Qualitative or Quantitative?
• Colors of automobiles in a dealer’s showroom.
• Number of seats in movie theaters.
• Classification of patients based on nursing care
needed(complete, partial, or self care)
• Lengths of newborn cats of a certain species.
• Number of complaint letters received by an airline
per month.
8
Quantitative data
Working with Quantitative data
Quantitative data can further be distinguished between
discrete and continuous types.
9
Discrete data
Discrete
Data result when the number of possible values is
either a finite number or a ‘countable’ number of
possible values - 0, 1, 2, 3, . . .
Example: The number of students in the class, The number
of outcomes of rolling 2 dice
10
Continuous data
Continuous
Numerical data result from infinitely many possible
values that correspond to some continuous scale
that covers values without gaps.
Example: Height, Weight, Time etc.,
11
Exercise 2
Discrete or continuous?
• Number of cartons of milk manufactured each
day.
• Temperatures of airplane interiors at a given
airport.
• Incomes of college students on work study
programs.
• Number of cars parked in a parking lot.
• Weights of newborn calves.
• Number of tomatoes on each plant in a field.
12
Qualitative data
Working with Qualitative data
Qualitative data can be distinguished
between nominal and ordinal types.
13
Nominal data
Nominal data
Characterized by data that consist of names, labels, or
categories only. The data cannot be arranged in an
ordering scheme (such as low to high), each label/category
is different.
Ex: Country/State/City, Male/Female, Yes/No etc.,
Can you convert Quantitative data to Qualitative data?
14
Ordinal data
Ordinal data
Involves data that may be arranged in some order, but
differences between data values either cannot be
determined or are meaningless
Ex- Course grades, Medals – Gold/Silver/Bronze
15
Exercise 3
Nominal or Ordinal
• Horsepower of motorcycle engines.
• Ratings of newscasts in Houston(poor, fair, good,
excellent)
• Temperature of automatic popcorn poppers
• Time required for drivers to complete a course
• Marital status of respondents to a survey of
savings accounts.
• Organizational hierarchy – Analyst, Manager,
Director, CEO
18
Part 2
Working with data in RStudio
Hello World
23
R Data types
• Character
> Strings
> Ex: “Survived” or “3.14”
• Numeric
> Integer/float/double
> Ex: 3.14/3.14L/3+14i
• Factor
> Factor is a class for categorical variable
> Factors have different levels of categories
> Ex: Survived has two levels – “Survived” and “Not Survived”
> Factors can have numeric levels too – Ex: Survived – “0” for Not
Survived and “1” for Survived
• Logical
> True/False
24
Data Structures
• Vector
• List
• Factor
• Matrix
• Data frame
26
R is Vectorized
25
Vector
• The most basic R object is a vector
• A vector can only contain objects of the
same data type
• Empty vectors can be created with the
vector() function
27
Vectors
The c() function can be used to create
vectors of objects
28
Exercise - 4
Spend 5 minutes to create vectors with the following
information:
• Bob’s age – 14,
• Smith’s age – 24
• Matt’s age – 17
• Liam’ age - 19
29
List
• List is a special type of vector
> Can contain elements of different classes (either basic class or compound
class)
> Each element of list can have a name
30
Factor
31
Matrix
32
Matrix
33
Data frames
34
Missing values
35
Coercion
36
Coercion – explicit coercion
37
Reading/Writing Data
• Many file formats can be imported into R.
• In this course we will only deal with either csv or xlsx.
• To read data first set working directory to the folder where
data sits.
• For csv
>{Any variable} read.csv(“filename.csv”)
There are other file formats that you can read into R but in
this course we will primarily use .csv
• For .xls files
> install.packages(“xlsx”)
> library(xlsx)
> {Variable name} read.xlsx(“filename.xls”) 38
Data wrangling with Dplyr
• dplyr() - The dplyr package contains five
key data manipulation functions, also
called verbs:
> select() - which returns a subset of the columns,
> filter() - that is able to return a subset of the rows,
> arrange() - that reorders the rows according to single or multiple
variables,
> mutate() - used to add columns from existing data,
> summarise() - which reduces each group to a single row by calculating
aggregate measures.
40
R for Business analytics
• Advantages
> Designed for Statistical Analysis
– Many built-in functions
> Large number of libraries
> Mature open source project
• Disadvantages
> Overhead (Does not scale well to very
large data)
• Use R as a “sandbox” to play with a
sample 22
Part 3
Data exploration (Intro)
Data exploration?
• Data exploration involves activities that increases
understanding on data
• No quality data, no quality predictive results!
> Quality decisions must be based on quality data
– e.g., duplicate or missing data may cause incorrect or even
misleading statistics. – Garbage in Garbage out or GIGO
> Data warehouse needs consistent integration of quality
data
42
Data reduction
Variables (headers in excel file)
samples
Sample Dataset
csv or excel file
43
The fundamental data problem
Program
Program
Incomplete data
data
data
Program data
Program data Program
Program Database data
Database data
data
Program data data
data Program Interface
Program Program
Program Program
Inaccurate data
Temporary Temporary
Database Database
Interface
Interface Interface Program
Inconsistent data
Program
Program
data
data
Program data
Program Program data
Database Program
data
Database data
data
data data
Unobtainable data
data
Program Program
Program Program
44
Data exploration
• Data in the real world might have issues such as:
> Missing or incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– e.g., occupation=“ ”
> noisy: containing errors or outliers
– e.g., Salary=“-10”
> inconsistent: containing discrepancies in codes or names
– e.g., Age=“42” Birthday=“03/07/1997”
– e.g., Was rating “1,2,3”, now rating “A, B, C”
– e.g., Duplicate records
45
Common exploration tools
• Drawing plots
• Using visualization tools (e.g., Tableau, Cognos)
• Programming in Rstudio / Python
• Rattle package in Rstudio
46
Thank You for your attention