ML Module1 Notes
ML Module1 Notes
Module1 notes
Dept of CSE,VIT 8
Dept of CSE,VIT 9
Dept of CSE,VIT 10
Dept of CSE,VIT 11
Dept of CSE,VIT 12
edges
Dept of CSE,VIT 13
edges
in
Dept of CSE,VIT
14
Dept of CSE,VIT 15
Dept of CSE,VIT 16
Dept of CSE,VIT 17
Dept of CSE,VIT 18
Module-1
Chapter1:
• Introduction: Need for Machine Learning,
• Machine Learning Explained,
• Machine Learning in Relation to other Fields, Types of Machine Learning,
• Challenges of Machine Learning,
• Machine Learning Process,
• Machine Learning Applications.
Chapter2:
• Understanding Data – 1: Introduction,
• Big Data Analysis Framework,
• Descriptive Statistics,
• Univariate Data Analysis and Visualization.
Chapter-1, 2 (2.1-2.5)
Dept of CSE,VIT 19
Introduction to Machine Learning
“Computers are able to see, hear and learn”.
1.1 NEED FOR MACHINE LEARNING
Business organizations use huge amount of data for their daily activities.
Earlier, the full potential of this data was not utilized due to two reasons.
One reason was data being scattered across different archive systems and organizations not
being able to integrate these sources fully.
Secondly, the lack of awareness about software tools that could help to discover the useful
information from data.
Not anymore! Business organizations have now started to use the latest technology, machine
learning, for this purpose.
Machine learning has become so popular because of three reasons:
1. High volume of available data to manage:
Big companies such as Facebook, Twitter, and YouTube generate huge amount of data that
grows at a phenomenal rate. It is estimated that the data approximately gets doubled every
year.
2. Second reason is that the cost of storage has reducec odn.g
The hardware cost has also dropped. Therefore, it is easier now to capture, process, store,
distribute, and transmit the digital information.
3. Third reason for popularity of machine learning is the availability of complex algorithms
now. Especially with the advent of deep learning, many algorithms are available for machine
learning.
Dept of CSE,VIT 20
Before starting the machine learning journey, let us establish these terms - data, information,
knowledge, intelligence, and wisdom. A knowledge pyramid is shown in Figure 1.1.
Data:
All facts are data. Data can be numbers or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data with data sources such as flat
files, databases, or data warehouses in different storage formats.
Information:
Processed data is called information. This includes patterns, associations, or relationships
among data.
For example:sales data can be analyzed to extract information like which is the fast selling
product.
Knowledge:
Condensed information is called knowledge.
For example, the historical patterns and future trends obtained in the above sales data can be called
knowledge.
• Unless knowledge is extracted, data i sDoe pf t noof ISuEs,TeO. 21
• Similarly, knowledge is not useful unless it is put into action.
Intelligence:
Intelligence is the applied knowledge for actions. An
actionable form of knowledge is called intelligence.
Wisdom:
The ultimate objective of knowledge pyramid is wisdom
that represents the maturity of mind that is, so far,
exhibited only by humans.
Here comes the need for machine learning.
The objective of machine learning is to process these
archival data for organizations :
• To take better decisions to design new products,
• To improve the business processes,
• and to develop effective support systems.
Dept of CSE,VIT 22
1.2 MACHINE LEARNING EXPLAINED
Machine learning is an important sub-branch of Artificial Intelligence (AI).
“Machine learning is the field of study that gives the computer's ability to
learn without being explicitly programmed.”
In conventional programming,
• after understanding the problem,
• a detailed design of the program such as a flowchart or an algorithm
needs to be created
• and converted into programs using a suitable programming language.
Dept of CSE,VIT 23
This idea of developing intelligent systems by using logic and reasoning by
converting an expert’s knowledge into a set of rules and programs is called
an expert system.
Example:
An expert system like MYCIN was designed for medical diagnosis after
converting the expert knowledge of many doctors into a system.
Disadvantages:
• This approach did not progress much as programs lacked real
intelligence.
• The above approach was impractical in many domains as programs still
depended on human expertise and hence did not truly exhibit
intelligence.
Then, machine learning came into the form of data driven systems.
• In data-driven approach, data is used as an input to develop intelligent
models.
• The models can then be used to predict new inputs.
• Thus, the aim of machine learning is to learn a model or set of rules
from the given dataset automatically so that it can predict the unknown
data correctly.
Dept of CSE,VIT 24
Figure 1.2: (a) A Learning System for Humans (b) A Learning System for Machine
Learning
Area in sq Price(y)
ft(x)
500 5lacs
700 7lacs
1000 10lacs
In statistical learning(the process of finding patterns in data to understand how things relate to
each other), the relationship between the input x and output y is modeled as a
function in the form y = f(x).
Where:
f is the learning function that maps the input x to output y. In machine learning,
this is simply called mapping of input to output.
Dept of CSE,VIT 25
The learning program summarizes the raw data in a model.
A model is an explicit description of patterns within the data in the form
of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters
Dept of CSE,VIT 28
In systems, experience is gathered by the following steps:
1. Collection of data
2. Once data is gathered, abstract concepts are formed out of
that data. Abstraction is used to generate concepts. This is
equivalent to humans’ idea of objects, for example, we have
some idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable
form of intelligence.
4. The course correction is done by taking evaluation
measures. Evaluation checks the thoroughness of the models.
Dept of CSE,VIT 29
1.3 MACHINE LEARNING IN RELATION TO OTHER FIELDS
Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics
primarily. It is the resultant of combined ideas of diverse fields.
1.3.1 Machine Learning and Artificial Intelligence
1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
1.3.3 Machine Learning and Statistics
Dept of CSE,VIT 30
• Machine learning is an important branch of AI, which is a much broader
subject. The aim of AI is to develop intelligent agents.
• An agent can be a robot,or any autonomous systems.
• Machine learning is the subbranch of AI, whose aim is to extract the
patterns for prediction.
• Deep learning is a subbranch of machine learning. In deep learning, the
models are constructed using neural network technology.
• Neural networks are based on the human neuron models. Many neurons
form a network connected with the activation functions that trigger further
neurons to perform tasks.
1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
Machine learning starts with data. Therefore, data science and machine learning are
interlinked.
Machine learning is a branch of data science. Data science deals with gathering of data for
analysis. It is a broad field that includes:
1. Big data
2. Data mining
3. Data analytics
4. Pattern Recognition
Dept of CSE,VIT 31
Big Data
• Data science concerns about collection of data.
• Big data is a field of data science that deals with data’s following
characteristics:
1. Volume: Huge amount of data is generated by big companies like
Facebook, Twitter, YouTube.
2. Variety: Data is available in variety of forms like images, videos,
and in different formats.
3. Velocity: It refers to the speed at which the data is generated and
processed.
Dept of CSE,VIT 32
Data Mining
• Data mining’s original genesis(birth) is in the business.
• Like while mining the earth one gets into precious resources, it is often believed
that unearthing of the data produces hidden information.
• Nowadays, many consider that data mining and machine learning are same.
• There is no difference between these fields except that data mining aims to
extract the hidden patterns that are present in the data, whereas, machine
learning aims to use it for prediction.
Data Analytics
• Another branch of data science is data analytics.
• It aims to extract useful knowledge from crude data.
• There are different types of analytics.
Among them Predictive data analytics is used for making predictions. Machine
learning is closely related to this branch of analytics and shares almost all
algorithms.
Pattern Recognition
• It is an engineering field.
• Pattern recognition is a data analysis method that uses machine learning
algorithms to automatically recognize patterns and regularities in data.
• It uses machine learning algorithms to extract the features for pattern analysis and
pattern classification.
• One can view pattern recognition as a specific application of machine learning.
Dept of CSE,VIT 33
1.3.3 Machine Learning and Statistics
Dept of CSE,VIT 34
• It is mathematics intensive and models are often complicated
equations and involve many assumptions.
• Statistical methods are developed in relation to the data being
analysed.
• It has strong theoretical foundations and interpretations that
require a strong statistical knowledge.
• Machine learning, comparatively, has less assumptions and
requires less statistical knowledge.
• But, it often requires interaction with various tools to automate
the process of learning.
Dept of CSE,VIT 35
1.4 TYPES OF MACHINE LEARNING
There are four types of machine learning as shown in Figure 1.5.
Dept of CSE,VIT 36
Before discussing the types of learning, it is necessary to discuss about data.
Labelled and Unlabelled Data
• Data is a raw fact. Normally, data is represented in the form of a table.
• Data also can be referred to as a data point, sample, or an example.
• Each row of the table represents a data point. Features are attributes or characteristics of
an object.
• Normally, the columns of the table are attributes.
• Out of all attributes, one attribute is important and is called a label.
• Label is the feature that we aim to predict.
• Thus, there are two types of data – labelled and unlabelled.
Labelled Data
To illustrate labelled data, let us take one example dataset called Iris flower dataset or Fisher’s
Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width of sepals
and petals. The target variable is called class. There are three classes – Iris setosa, Iris
virginica, and Iris versicolor.
The partial data of Iris dataset is shown in Table 1.1
(b) Unlabelled
Note:Dataset In unlabelled data,
Dept of CSE,VIT 37
there are no labels in the dataset
Dept of CSE,VIT 38
In this type of learning already output will be known(like
cat/dog) only input will be mapped to output.
Dept of CSE,VIT 39
Dept of CSE,VIT 40
1.4.1 Supervised Learning
Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or
teacher component in supervised learning.
A supervisor provides labelled data so that the model is constructed and generates test data.
1. During the first stage, the teacher communicates the information to the student that the student
is supposed to master.
The student receives the information and understands it.
During this stage, the teacher has no knowledge of whether the information is grasped by the
student.
2. This leads to the second stage of learning. The teacher then asks the student a set of questions to
find out how much information has been grasped by the student.
Based on these questions, the student is tested, and the teacher informs the student about his
assessment.
This kind of learning is typically called supervised learning.
Dept of CSE,VIT 41
Dept of CSE,VIT 42
Dept of CSE,VIT 43
Dept of CSE,VIT 44
1. Classification
• Classification is a supervised learning method. The input attributes of the
classification algorithms are called independent variables.
• The target attribute is called label or dependent variable. The relationship between
the input and target variable is represented in the form of a structure which is called
a classification model.
• So, the focus of classification is to predict the ‘label’ that is in a discrete form (a value
from the set of finite values).
• An example is shown in Figure 1.7 where a classification algorithm takes a set of
labelled data images such as dogs and cats to construct a model that can later be used
to classify an unknown test image data.
Dept of CSE,VIT 45
In classification, learning takes place in two stages.
1. During the first stage, called training stage, the learning
algorithm takes a labelled dataset and starts learning. After the
training set, samples are processed and the model is generated.
Dept of CSE,VIT 46
Some of the key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naïve Bayes
• Artificial Neural Network and Deep Learning networks
like CNN
Dept of CSE,VIT 47
Regression Models
Dept of CSE,VIT 48
• The regression model takes input x and generates a model in the
form of a fitted line of the form y = f(x).
• In Figure 1.8, linear regression takes the training set and tries to
fit it with a
line – product sales = 0.66 × Week + 0.54.
• Here, 0.66 and 0.54 are all regression coefficients that are
learnt from data.
Dept of CSE,VIT 52
Dept of CSE,VIT 53
• As the name suggests, there are no supervisor or teacher
components.
• In the absence of a supervisor or teacher, self-instruction
is the most common kind of learning process.
• This process of self-instruction is based on the concept of
trial and error.
• Here, the program is supplied with objects, but no labels
are defined.
• The algorithm itself observes the examples and recognizes
patterns based on the principles of grouping.
Dept of CSE,VIT 56
Some of the key clustering algorithms are:
• k-means algorithm
• Hierarchical algorithms
2. Dimensionality Reduction
• Dimensionality reduction algorithms are
examples of unsupervised algorithms.
• It takes a higher dimension data as input and
Dept of CSE,VIT 57
The differences between supervised and
unsupervised learning are listed in the following
Table 1.2.
Table 1.2: Differences between Supervised and Unsupervised
Learning
Dept of CSE,VIT 58
1.4.3 Semi-supervised Learning
• There are circumstances where the dataset has a
huge collection of unlabelled data and some
labelled data.
Dept of CSE,VIT 60
Dept of CSE,VIT 61
Reinforcement Learning
• Reinforcement learning mimics human beings.
• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.
• When the rewards are more, the behavior gets reinforced and
learning becomes possible.
Dept of CSE,VIT 62
1.4.4 Reinforcement Learning
Dept of CSE,VIT 63
Dept of CSE,VIT 64
Reinforcement Learning
• Reinforcement learning mimics human beings.
• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.
• When the rewards are more, the behavior gets reinforced and
learning becomes possible.
Dept of CSE,VIT 65
In this grid game, the gray tile indicates the danger, black
is a block, and the tile with diagonal lines is the goal.
Dept of CSE,VIT 66
• To solve this sort of problem, there is no data.
• The agent interacts with the environment to get
experience.
• In the above case, the agent tries to create a
Summary:
• Compared to supervised learning, there is no supervisor or labelled dataset.
• Many sequential decisions need to be made to reach the final decision.
• Therefore, reinforcement algorithms are reward-based.
67
Dept of cSE,VIT 68
1.4.4 Reinforcement Learning
Dept of cSE,VIT 69
Dept of cSE,VIT 70
Reinforcement Learning
• Reinforcement learning mimics human beings.
• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.
• When the rewards are more, the behavior gets reinforced and
learning becomes possible.
Dept of cSE,VIT 71
1.4.4 Reinforcement Learning
Dept of cSE,VIT 72
Dept of cSE,VIT 73
Reinforcement Learning
• Reinforcement learning mimics human beings.
• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.
• When the rewards are more, the behavior gets reinforced and
learning becomes possible.
Dept of cSE,VIT 74
In this grid game, the gray tile indicates the danger, black
is a block, and the tile with diagonal lines is the goal.
Dept of cSE,VIT 75
• To solve this sort of problem, there is no data.
• The agent interacts with the environment to get
experience.
• In the above case, the agent tries to create a model
by simulating many paths and finding rewarding
paths.
• This experience helps in constructing a model.
Summary:
• Compared to supervised learning, there is no
supervisor or labelled dataset.
• Many sequential decisions need to be taken to reach
the final decision.
• Therefore, reinforcement algorithms are reward-
based, goal-oriented algorithms.
Dept of cSE,VIT 76
Dept of cSE,VIT 77
Dept of cSE,VIT 78
Dept of cSE,VIT 79
1.5 CHALLENGES OF MACHINE LEARNING
Problems that can be Dealt with Machine
Learning :
Computers are better than humans in performing
tasks like computation.
For example, while calculating the square root of
large numbers, an average human may blink but
computers can display the result in seconds.
Dept of cSE,VIT 80
However, humans are better than computers in
many aspects like recognition.
82
Can a model for this test data be multiplication?
That is, y = x1 × x2 . Well! It is true!
But, this is equally true that y may be
y = x1 ÷ x2 , or y = x1 x2.
So, there are three functions that fit the data.
This means that the problem is ill-posed.
To solve this problem, one needs more example
to check the model.
Puzzles and games that do not have sufficient
specification may become an ill-posed problem
and scientific computation has many ill-posed
problems.
Dept of CSE,VIT 83
2. Huge data :
This is a primary requirement of machine learning. Availability of a
quality data is a challenge. A quality data means it should be large and
should not have data problems such as missing data or incorrect data.
Also, machine learning tasks have become complex and hence time
complexity has increased, and that can be solved only with high
computing power.
Dept of CSE,VIT 84
4. Complexity of the algorithms:
The selection of algorithms, describing the
algorithms, application of algorithms to solve
machine learning task, and comparison of
algorithms have become necessary for machine
learning or data scientists now.
Dept of CSE,VIT 85
5. Bias/Variance:
• Variance is the error of the model.
• This leads to a problem called bias/ variance
tradeoff.
• A model that fits the training data correctly
but fails for test data, in general lacks
generalization, is called overfitting.
• The reverse problem is called underfitting
where the model fails for training data but
has good generalization.
• Overfitting and underfitting are great
challenges for machine learning algorithms
Dept of CSE,VIT 86
1.6 MACHINE LEARNING PROCESS
Dept of CSE,VIT 87
1. Understanding the business –
This step involves understanding the objectives and
requirements of the business organization.
Dept of CSE,VIT 88
3. Preparation of data –
• This step involves producing the final dataset
inaccurate results.
• This is a perennial problem (problem that
5. Evaluate –
• This step involves the evaluation of the data mining
results using statistical analysis and visualization
methods.
• The performance of the classifier is determined by
evaluating the accuracy of the classifier.
• The process of classification is a fuzzy issue.
For example:Classification of emails requires
extensive domain knowledge and requires domain
experts.
• Hence, the performance of the Classifier is very crucial.
Dept of CSE,VIT 90
6. Deployment –
This step involves the deployment of results of
the data mining algorithm to improve the
existing process or for a new situation.
Dept of CSE,VIT 91
1.7 MACHINE LEARNING APPLICATIONS
1. Sentiment analysis –
This is an application of natural language processing (NLP)
where the words of documents are converted to sentiments
like happy, sad, and angry which are captured by emoticons
effectively.
Dept of CSE,VIT 93
Dept of CSE,VIT 94
Chapter 2
Understanding Data
2.1 What is data?
• All facts are data.
• Data can be directly human interpretable (such as
numbers or texts) or diffused data such as images or
video that can be interpreted only by a computer.
• Today, business organizations are accumulating vast and
growing amounts of data of the order of gigabytes, tera
bytes, exabytes.
• A kilo byte (KB) is 1024 bytes, one megabyte (MB) is
approximately 1000 KB, one gigabyte is approximately
1,000,000 KB, 1000 gigabytes is one terabyte and 1000000
terabytes is one Exabyte."
Dept of CSE,VIT 95
Data is available in different data sources like flat
files, databases, or data warehouses.
It can either be an :
• operational data or
• non-operational data.
Dept of CSE,VIT 97
Elements of big data
Small data: Data whose volume is less and can
be stored and processed by a small-scale
computer is called ‘small data’.
Big data:is a larger data whose volume is much
larger than ‘small data’ and is characterized as
follows:
1. Volume:Small traditional data is measured in
terms of gigabytes (GB) and terabytes (TB), but
Big Data is measured in terms of petabytes (PB)
and exabytes (EB). One exabyte is 1 million
terabytes.
Dept of CSE,VIT 98
2. Velocity — The availability of IoT devices and Internet power ensures
that the data is arriving at a faster rate.
3. Variety — The variety of Big Data includes:
● Form — There are many forms of data. Data types range from text,
graph, audio, video, to maps.
There can be composite data too, where one media can have many other
sources of data.
Ex: a video can have an audio song.
● Function — These are data from various sources like human
conversations, transaction records, and old archive data.
● Source of data — There are many sources of data.
Broadly, the data source can be classified as open/public data, social
media data, and multimodal data.
Dept of CSE,VIT 99
4. Veracity of data —Deals with aspects like conformity to
the facts, truthfulness, believability, and confidence in
data.
There may be many sources of error such as technical
errors, typographical errors, and human errors.
5. Validity — Validity is the accuracy of the data for
taking decisions or for any other goals that are needed by the
given problem.
6. Value — Value is the characteristic of big data that
indicates the value of the information that is extracted from
the data and its influence on the decisions that are taken
based on it.
Dept of CSE,VIT 100
2.1.1 Types of Data
1. Structured
2. Unstructured
3. Semi structured
1. Structured Data
In structured data, data is stored in an organized manner such as a
database where it is available in the form of a table.
The data can also be retrieved in an organized manner using tools
like SQL.
• Record data
• Data matrix
• Graph data
• Ordered data:- i)Temporal data
ii)Sequence data
iii)Spatial data
Dept of CSE,VIT 101
● Record Data—A dataset is a collection of measurements taken from a
process.
The measurements can be arranged in the form of a matrix.
Rows in the matrix represent an object and can be called entities, cases, or
records.
The columns of the dataset are called attributes, features, or fields. The table
is filled with observed data.
❏ These are the simplest and most commonly available data source.
❏ These flat files are the files where data is stored in plain ASCII or
EBCDIC( Extended Binary Coded Decimal Interchange Code) format.
❏ Minor changes of data in flat files affect the results of the data
mining algorithms.
❏ Hence, flat file is suitable only for storing small dataset and not
desirable if the dataset becomes larger.
Some of the popular spreadsheet formats are listed below:
● CSV files – CSV stands for comma-separated value files where the
values are separated by commas. The first row may have attributes and
the rest of the rows represent the data.
● TSV files – TSV stands for Tab separated values files where values
are separated by Tab.
There are many tools like Google Sheets and Microsoft Excel to process
Dept of CSE,VIT 106
these files.
2. Database System
• It normally consists of database files and a database
management system (DBMS).
• Database files contain original data and metadata. A
relational database consists of sets of tables.
• The tables have rows and columns. The columns represent
the attributes and rows represent tuples.
• A tuple corresponds to either an object or a relationship
between objects.
• A user can access and manipulate the data in the database
using SQL.
This data represents the sequences of data, which represent values or events
obtained over a period (for example, hourly, weekly, or yearly) or repeated time
span.
Vector format can be used to store maps as maps use basic geometric primitives
like points, lines, polygons, and so forth.
It provides a diverse, worldwide online information source. The objective of data mining algorithms is to mine
interesting patterns of information present in WWW.
It is both human and machine interpretable data format that can be used to represent data that needs to be
shared across platforms.
5. Data Stream
It is dynamic data, which flows in and out of the observing environment. Typical characteristics of data stream
are huge volume of data, dynamic, fixed order movement, and real-time constraints.
-a technology that allows you to subscribe to content from multiple sources and then have it all delivered as it's
published.
It is another useful data interchange format that is often used for many machine learning algorithms.
● Incomplete data
● Outlier data
● Data with inconsistent values
● Inaccurate data
● Data with missing values
● Duplicate data
Data transformation
• Routines perform operations like normalization to improve the performance of the
data mining algorithms.
• It is necessary to transform data so that it can be processed. This can be considered as a
preliminary stage of data conditioning.
Normalization is one such technique. In normalization, the attribute values are scaled to fit in
a range (say 0–1) to improve the performance of the data mining algorithm. Often, in neural
networks, these techniques are used. Some of the normalization procedures used are:
1. Min-Max
2. z-Score
Dept of CSE,VIT 128
Min-Max Procedure
It is a normalization technique where each variable V is normalized by
its difference with the minimum value divided by the range to a new
range, say 0–1.
Often, neural networks require this kind of normalization. The formula
to implement this normalization is given as:
Here max-min is the range. Min and max are the minimum and
maximum of the given data, new max and new min are the minimum
and maximum of the target range, say 0 and 1.
Ratio
137
Categorical or Qualitative Data:
The categorical data can be divided into two types. They are nominal type and ordinal
type.
● Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are
symbols and cannot be processed like a number.
For Ex: the average of a patient ID does not make any statistical sense.
Nominal data type provides only information but has no ordering among data.
Only operations like (=, ≠) are meaningful for these data.
For Ex: the patient ID can be checked for equality and nothing else.
Certainly, low is less than medium and medium is less than high, irrespective of
the value.
138
Numeric or Quantitative Data
It can be divided into two categories. They are interval type and ratio type.
139
1. It is measured in the form of numbers. Ex:Measuring
temperature using thermometers.
2. It has rank & order.Ex: While measuring temp 1 degree is
always lower than 3 degree.
3. It is equidistant ,that is it has equally spaced intervals.Ex: diff
between 1 degree celsius and 2 degree celsius is same as the diff
between 4 degree and 5 degree.
4. It does not have any meaning full zero.
5. Interval data can be negative.Ex: -12 degree celsius.
140
None of these example are having meaningful
zero.
141
1. It is measured in the form of numbers.Ex:Distance can be
measured by using measuring device.
2. It has rank & order.Ex: While measuring distances 2Km is always
less than 5Km.
3. It is equidistant ,that is it has equally spaced intervals.Ex: The diff
between 1km & 2Km is the same as the diff between 4Km & 5 Km.
4. It has a meaningful zero.Ex: We can say that you have travelled zero
Kms today(or not travelled).
5. The ratio data can never be negative.Ex: There is nothing like -5Km
in distance.
142
All of these examples are having meaningful
zero.
143
Another way of classifying the data is to classify it as:
Discrete Data
Discrete data consists of distinct, separate values that can be counted.
It is typically represented as whole numbers and does not have decimal or fractional values.
Example: The number of students in a class (e.g., 25 students), the number of cars in a parking lot, or employee
ID numbers.
Continuous Data
Continuous data can take any value within a given range and includes decimal or fractional values.
It is measured rather than counted and can be infinitely divided into smaller parts.
Example: A person's height (e.g., 170.5 cm), weight (e.g., 65.2 kg), or temperature (e.g., 36.7°C).
Third way of classifying the data is based on the number of variables used in the dataset. Based on that, the
data can be classified as univariate data, bivariate data, and multivariate data. This is shown in Figure 2.2.
144
Univariate data:
The dataset has only one variable. A variable is
also called as category.
Bivariate data:
Indicates that the number of variables used are
two.
Multivariate data :Uses three or more
variables.
145
Univariate data:
Univariate data consists of one variable. It is used to describe and analyze a single
characteristic or attribute of a dataset.
Examples of Univariate Data:
● Histograms
● Box plots
● Bar charts
146
Bivariate Data
Bivariate data consists of two variables and examines the relationship between them. It helps
determine whether there is a correlation or association between the variables.
Visualization Methods:
● Scatter plots
● Line graphs
147
Multivariate Data
Multivariate data consists of three or more variables. It is used to
analyze the relationships among multiple factors simultaneously.
148
2.5 UNIVARIATE DATA ANALYSIS AND
VISUALIZATION
• Univariate analysis is the simplest form of statistical
analysis.
• As the name indicates, the dataset has only one variable.
• A variable can be called as a category.
• Univariate does not deal with cause or relationships.
The aim of univariate analysis is to describe data and
find patterns.
• Univariate data description involves finding the
frequency distributions, central tendency measures,
dispersion or variation, and shape of the data.
149
2.5.1 Data Visualization
• To understand data, graph visualization is must.
• Data visualization helps to understand data. It helps to present information and data
to customers.
• Some of the graphs that are used in univariate data analysis are bar charts, histograms,
frequency polygons and pie charts.
• Advantages of the graphs: are presentation of data, summarization of data,
description of data, exploration of data, and to make comparisons of data.
• Let us consider some forms of graphs now:
Bar chart
A Bar chart (or Bar graph) is used to display the frequency distribution for variables.
• Bar charts are used to illustrate discrete data. The charts can also help to explain the
counts of nominal data. It also helps in comparing the frequency of different groups.
• The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5}
is shown below in Figure 2.3.
150
151
152
Histogram
• It plays an important role in data mining for showing
frequency distributions.
• The histogram for students' marks {45, 60, 60, 80, 85} in
the group range of 0–25, 26–50, 51–75, 76–100 is given
below in Figure 2.5. One can visually inspect from
Figure 2.5 that the number of students in the range 76–
100 is 2.
E 153
• Histogram conveys useful information like nature of data
and its mode(value that appears most frequently).
• Mode indicates the peak of dataset.
• In other words, histograms can be used as charts to show
frequency, skewness(lack of straightness) present in the
data, and shape.
Dot Plots
These are similar to bar charts. They are less clustered as
compared to bar charts, as they illustrate the bars only with
single points.
The dot plot of English marks for five students with ID as {1,
2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure
2.6. The advantage is that by visual inspection one can find out
who got more marks.
154
155
2.5.2 Central Tendency
• One cannot remember all the data. Therefore, a
condensation or summary of the data is necessary.
• This makes the data analysis easy and simple. One
such summary is called central tendency.
• Thus, central tendency can explain the characteristics
of data and that further helps in comparison.
• Mass data have tendency to concentrate at certain
values, normally in the central location. It is called
measure of central tendency (or averages).
• This represents the first order of measures. Popular
measures are mean, median and mode.
156
1. Mean – Arithmetic average (or mean) is a measure of
central tendency that represents the ‘center’ of the dataset.
It can be found by adding all the data and dividing the sum
by the number of observations.
Mathematically, the average of all the values in the sample
(population) is denoted as x’(x bar).
Let x1,x2,…,xn be a set of ‘N’ values or observations, then
the arithmetic mean is given as:
157
• Weighted mean – Weighted Mean is an average computed
by giving different weights to some of the individual values.
If all the weights are equal, then the weighted mean is the
same as the arithmetic mean Hence, different weightage can
be given to items.
158
2. Median – The middle value in the distribution is called
median. If the total number of items in the distribution is odd,
then the middle value is called median.
If the numbers are even, then the average value of two items in
the centre is the median.
It can be observed that the median is the value where xi is divided
into two equal halves, with half of the values being lower than the
median and half higher than the median. A median class is that
class where (N/2)th item is present.
In the continuous case, the median is given by the formula:
159
3. Mode – Mode is the value that occurs more frequently in the
dataset.
In other words, the value that has the highest frequency is called
mode.
Mode is only for discrete data and is not applicable for
continuous data as there are no repeated values in continuous
data.
Normally, the dataset is classified as unimodal, bimodal, and
trimodal with modes 1, 2, and 3, respectively.
160
2.5.3 Dispersion
The spread out of a set of data around the central tendency (mean, median or
mode) is called dispersion.
Dispersion is represented by various ways such as range, variance, standard
deviation, and standard error.
These are second order measures. The most common measures of the dispersion
data are listed below:
Range – Range is the difference between the maximum and minimum of values of
the given list of data.
Standard Deviation – The mean does not convey much more than a middle point.
For example, the following datasets {10, 20, 30} and {10, 50, 0} both have a mean
of 20. The difference between these two sets is the spread of data.
Standard deviation is the average distance from the mean of the dataset to each
point. The formula for sample standard deviation is given by:
161
Quartiles and Inter Quartile Range (IQR)
Quartiles divide a dataset into four equal parts. The three
quartiles are:
● Q₁ (First Quartile) – 25% of the data is below this value.
● Q₂ (Second Quartile / Median) – 50% of the data is below
this value.
● Q₃ (Third Quartile) – 75% of the data is below this value.
● Inter Quartile Range (IQR) measures the spread of the
middle 50% of the data. It is calculated as:
IQR=Q3−Q1
162
163
Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28,
31, 34}, find the IQR.
164
165
166
Five-point Summary and Box Plots
169
1. Positive Skew (Right-Skewed Distribution)
● The tail is longer on the right side.
● Most data points are concentrated on the left.
● Example: Income distribution (a few people earn extremely high salaries, creating a
right tail).
170
171
2. Kurtosis
Kurtosis also indicates the peaks of data.
If the data is high peak, then it indicates higher kurtosis and vice versa.
High Kurtosis: The data has heavy tails, meaning there are more extreme outliers.
Low Kurtosis: The data has light tails, meaning fewer extreme values.
172
Some of the other useful measures for finding the shape of
the univariate dataset are :
• Mean absolute deviation (MAD) and
• Coefficient of variation (CV)
Mean Absolute Deviation (MAD)
MAD is another dispersion measure and is robust to outliers.
Here, the absolute deviation between the data and mean is taken. Thus,
the absolute deviation is given as:
173
Coefficient of Variation (CV)
Coefficient of variation is used to compare datasets with different units.
CV is the ratio of standard deviation and mean, and %CV is the percentage of coefficient of variation.
The ideal way to check the shape of the dataset is a stem and leaf plot.
A stem and leaf plot are a display that helps us to know the shape and distribution of the data.
In this method, each value is split into a 'stem' and a 'leaf'. The last digit is usually the leaf, and digits to
the left of the leaf mostly form the stem. For example, marks 45 are divided into stem 4 and leaf 5 in
Figure 2.9.
174
It can be seen from Figure 2.9 that the first column is the stem
and the second column is the leaf.
For the given English marks, two students with 60 marks are
shown in the stem and leaf plot as stem-6 with 2 leaves with 0.
175
Q-Q plot
A Q-Q plot can be used to assess the shape of the dataset.
The Q-Q plot is a 2D scatter plot of an univariate data against
theoretical normal distribution data or of two datasets – the
quartiles of the first and second datasets.
The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given
below in Figure 2.10.
E 176