[go: up one dir, main page]

0% found this document useful (0 votes)
38 views176 pages

ML Module1 Notes

The document provides an overview of Machine Learning, detailing its significance, types, and relationship with other fields like Artificial Intelligence and Data Science. It emphasizes the need for machine learning in processing large data sets for better decision-making in businesses and outlines the machine learning process, including data collection and model training. Additionally, it explains the distinction between supervised and unsupervised learning, highlighting the role of labeled data in supervised learning algorithms.

Uploaded by

bhanushetty0709
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views176 pages

ML Module1 Notes

The document provides an overview of Machine Learning, detailing its significance, types, and relationship with other fields like Artificial Intelligence and Data Science. It emphasizes the need for machine learning in processing large data sets for better decision-making in businesses and outlines the machine learning process, including data collection and model training. Additionally, it explains the distinction between supervised and unsupervised learning, highlighting the role of labeled data in supervised learning algorithms.

Uploaded by

bhanushetty0709
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 176

MACHINE LEARNING (BCS602)

Module1 notes

Dept of CSE, VIT 1


Dept of CSE, VIT 2
Tesla car

Dept of CSE, VIT 3


Dept of CSE,VIT
4
Dept of CSE,VIT 5
Dept of CSE,VIT 6
Dept of CSE,VIT 7
“Machine learning is a set of
techniques to make
computers better at doing
things that
humans(traditionally)can do
better than machine”.

Dept of CSE,VIT 8
Dept of CSE,VIT 9
Dept of CSE,VIT 10
Dept of CSE,VIT 11
Dept of CSE,VIT 12
edges

Dept of CSE,VIT 13
edges

in

Dept of CSE,VIT
14
Dept of CSE,VIT 15
Dept of CSE,VIT 16
Dept of CSE,VIT 17
Dept of CSE,VIT 18
Module-1
Chapter1:
• Introduction: Need for Machine Learning,
• Machine Learning Explained,
• Machine Learning in Relation to other Fields, Types of Machine Learning,
• Challenges of Machine Learning,
• Machine Learning Process,
• Machine Learning Applications.
Chapter2:
• Understanding Data – 1: Introduction,
• Big Data Analysis Framework,
• Descriptive Statistics,
• Univariate Data Analysis and Visualization.
Chapter-1, 2 (2.1-2.5)

Dept of CSE,VIT 19
Introduction to Machine Learning
“Computers are able to see, hear and learn”.
1.1 NEED FOR MACHINE LEARNING
Business organizations use huge amount of data for their daily activities.
Earlier, the full potential of this data was not utilized due to two reasons.
One reason was data being scattered across different archive systems and organizations not
being able to integrate these sources fully.
Secondly, the lack of awareness about software tools that could help to discover the useful
information from data.
Not anymore! Business organizations have now started to use the latest technology, machine
learning, for this purpose.
Machine learning has become so popular because of three reasons:
1. High volume of available data to manage:
Big companies such as Facebook, Twitter, and YouTube generate huge amount of data that
grows at a phenomenal rate. It is estimated that the data approximately gets doubled every
year.
2. Second reason is that the cost of storage has reducec odn.g
The hardware cost has also dropped. Therefore, it is easier now to capture, process, store,
distribute, and transmit the digital information.
3. Third reason for popularity of machine learning is the availability of complex algorithms
now. Especially with the advent of deep learning, many algorithms are available for machine
learning.

Dept of CSE,VIT 20
Before starting the machine learning journey, let us establish these terms - data, information,
knowledge, intelligence, and wisdom. A knowledge pyramid is shown in Figure 1.1.

Data:
All facts are data. Data can be numbers or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data with data sources such as flat
files, databases, or data warehouses in different storage formats.
Information:
Processed data is called information. This includes patterns, associations, or relationships
among data.
For example:sales data can be analyzed to extract information like which is the fast selling
product.
Knowledge:
Condensed information is called knowledge.
For example, the historical patterns and future trends obtained in the above sales data can be called
knowledge.
• Unless knowledge is extracted, data i sDoe pf t noof ISuEs,TeO. 21
• Similarly, knowledge is not useful unless it is put into action.
Intelligence:
Intelligence is the applied knowledge for actions. An
actionable form of knowledge is called intelligence.

Wisdom:
The ultimate objective of knowledge pyramid is wisdom
that represents the maturity of mind that is, so far,
exhibited only by humans.
Here comes the need for machine learning.
The objective of machine learning is to process these
archival data for organizations :
• To take better decisions to design new products,
• To improve the business processes,
• and to develop effective support systems.
Dept of CSE,VIT 22
1.2 MACHINE LEARNING EXPLAINED
Machine learning is an important sub-branch of Artificial Intelligence (AI).
“Machine learning is the field of study that gives the computer's ability to
learn without being explicitly programmed.”
In conventional programming,
• after understanding the problem,
• a detailed design of the program such as a flowchart or an algorithm
needs to be created
• and converted into programs using a suitable programming language.

This approach could be difficult for many real-world problems such as


puzzles, games, and complex image recognition applications.

Initially, artificial intelligence aims to


• understand these problems and
• develop general purpose rules manually.
• Then, these rules are formulated into logic and
• implemented in a program to create intelligent systems.

Dept of CSE,VIT 23
This idea of developing intelligent systems by using logic and reasoning by
converting an expert’s knowledge into a set of rules and programs is called
an expert system.
Example:
An expert system like MYCIN was designed for medical diagnosis after
converting the expert knowledge of many doctors into a system.
Disadvantages:
• This approach did not progress much as programs lacked real
intelligence.
• The above approach was impractical in many domains as programs still
depended on human expertise and hence did not truly exhibit
intelligence.
Then, machine learning came into the form of data driven systems.
• In data-driven approach, data is used as an input to develop intelligent
models.
• The models can then be used to predict new inputs.
• Thus, the aim of machine learning is to learn a model or set of rules
from the given dataset automatically so that it can predict the unknown
data correctly.
Dept of CSE,VIT 24
Figure 1.2: (a) A Learning System for Humans (b) A Learning System for Machine
Learning
Area in sq Price(y)
ft(x)

500 5lacs

700 7lacs

1000 10lacs

As humans take decisions based on an experience, computers make models based


on extracted patterns in the input data and then use these data-filled models for
prediction and to take decisions.
For computers, the learnt model is equivalent to human experience.

In statistical learning(the process of finding patterns in data to understand how things relate to
each other), the relationship between the input x and output y is modeled as a
function in the form y = f(x).
Where:
f is the learning function that maps the input x to output y. In machine learning,
this is simply called mapping of input to output.

Dept of CSE,VIT 25
The learning program summarizes the raw data in a model.
A model is an explicit description of patterns within the data in the form
of:
1. Mathematical equation
2. Relational diagrams like trees/graphs
3. Logical if/else rules, or
4. Groupings called clusters

• In summary, a model can be a formula, procedure or


representation that can generate data decisions.

• The difference between pattern and model is that the pattern


is local and applicable only to certain attributes but the model
is global and fits the entire dataset.
• For example:A model can be helpful to examine whether a
given email is spam or not.

• NOTE:The point is that the model is generated automatically


from the given data.
Dept of CSE,VIT 26
Dept of CSE,VIT 27
Another definition of Machine Learning

• “A computer program is said to learn from experience E, with respect to task T


and some performance measure P, if its performance on T measured by P
improves with experience E”.
For example: The task T could be detecting an object in an image. The machine
can gain the knowledge of object using training dataset of thousands of images.
This is called experience E.
• So, the focus is to use this experience E for this task of object detection T.
• The ability of the system to detect the object is measured by performance
measures like precision and recall.
• Based on the performance measures, course correction can be done to improve the
performance of the system.

Dept of CSE,VIT 28
In systems, experience is gathered by the following steps:
1. Collection of data
2. Once data is gathered, abstract concepts are formed out of
that data. Abstraction is used to generate concepts. This is
equivalent to humans’ idea of objects, for example, we have
some idea about how an elephant looks like.
3. Generalization converts the abstraction into an actionable
form of intelligence.
4. The course correction is done by taking evaluation
measures. Evaluation checks the thoroughness of the models.

Dept of CSE,VIT 29
1.3 MACHINE LEARNING IN RELATION TO OTHER FIELDS
Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics
primarily. It is the resultant of combined ideas of diverse fields.
1.3.1 Machine Learning and Artificial Intelligence
1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
1.3.3 Machine Learning and Statistics

1.3.1 Machine Learning and Artificial Intelligence

Dept of CSE,VIT 30
• Machine learning is an important branch of AI, which is a much broader
subject. The aim of AI is to develop intelligent agents.
• An agent can be a robot,or any autonomous systems.
• Machine learning is the subbranch of AI, whose aim is to extract the
patterns for prediction.
• Deep learning is a subbranch of machine learning. In deep learning, the
models are constructed using neural network technology.
• Neural networks are based on the human neuron models. Many neurons
form a network connected with the activation functions that trigger further
neurons to perform tasks.
1.3.2 Machine Learning, Data Science, Data Mining, and Data Analytics
Machine learning starts with data. Therefore, data science and machine learning are
interlinked.
Machine learning is a branch of data science. Data science deals with gathering of data for
analysis. It is a broad field that includes:
1. Big data
2. Data mining
3. Data analytics
4. Pattern Recognition

Dept of CSE,VIT 31
Big Data
• Data science concerns about collection of data.
• Big data is a field of data science that deals with data’s following
characteristics:
1. Volume: Huge amount of data is generated by big companies like
Facebook, Twitter, YouTube.
2. Variety: Data is available in variety of forms like images, videos,
and in different formats.
3. Velocity: It refers to the speed at which the data is generated and
processed.

• Big data is used by many machine learning algorithms for


applications such as language translation and image
recognition.
• Big data influences the growth of subjects like Deep learning.
• Deep learning is a branch of machine learning that deals with
constructing models using neural networks.

Dept of CSE,VIT 32
Data Mining
• Data mining’s original genesis(birth) is in the business.
• Like while mining the earth one gets into precious resources, it is often believed
that unearthing of the data produces hidden information.
• Nowadays, many consider that data mining and machine learning are same.
• There is no difference between these fields except that data mining aims to
extract the hidden patterns that are present in the data, whereas, machine
learning aims to use it for prediction.
Data Analytics
• Another branch of data science is data analytics.
• It aims to extract useful knowledge from crude data.
• There are different types of analytics.
Among them Predictive data analytics is used for making predictions. Machine
learning is closely related to this branch of analytics and shares almost all
algorithms.
Pattern Recognition
• It is an engineering field.
• Pattern recognition is a data analysis method that uses machine learning
algorithms to automatically recognize patterns and regularities in data.
• It uses machine learning algorithms to extract the features for pattern analysis and
pattern classification.
• One can view pattern recognition as a specific application of machine learning.

Dept of CSE,VIT 33
1.3.3 Machine Learning and Statistics

• Statistics is a branch of mathematics that has a solid theoretical


foundation regarding statistical learning.
• Like machine learning (ML), it can learn from data. But the
difference between statistics and ML is that statistical methods look
for regularity in data called patterns..
• Statistics requires knowledge of the statistical procedures and the
guidance of a good statistician.

Dept of CSE,VIT 34
• It is mathematics intensive and models are often complicated
equations and involve many assumptions.
• Statistical methods are developed in relation to the data being
analysed.
• It has strong theoretical foundations and interpretations that
require a strong statistical knowledge.
• Machine learning, comparatively, has less assumptions and
requires less statistical knowledge.
• But, it often requires interaction with various tools to automate
the process of learning.

Dept of CSE,VIT 35
1.4 TYPES OF MACHINE LEARNING
There are four types of machine learning as shown in Figure 1.5.

Dept of CSE,VIT 36
Before discussing the types of learning, it is necessary to discuss about data.
Labelled and Unlabelled Data
• Data is a raw fact. Normally, data is represented in the form of a table.
• Data also can be referred to as a data point, sample, or an example.
• Each row of the table represents a data point. Features are attributes or characteristics of
an object.
• Normally, the columns of the table are attributes.
• Out of all attributes, one attribute is important and is called a label.
• Label is the feature that we aim to predict.
• Thus, there are two types of data – labelled and unlabelled.
Labelled Data
To illustrate labelled data, let us take one example dataset called Iris flower dataset or Fisher’s
Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width of sepals
and petals. The target variable is called class. There are three classes – Iris setosa, Iris
virginica, and Iris versicolor.
The partial data of Iris dataset is shown in Table 1.1

Figure 1.6: (a) Labelled Dataset

(b) Unlabelled
Note:Dataset In unlabelled data,
Dept of CSE,VIT 37
there are no labels in the dataset
Dept of CSE,VIT 38
In this type of learning already output will be known(like
cat/dog) only input will be mapped to output.

Dept of CSE,VIT 39
Dept of CSE,VIT 40
1.4.1 Supervised Learning
Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or
teacher component in supervised learning.
A supervisor provides labelled data so that the model is constructed and generates test data.

In supervised learning algorithms, learning takes place in two stages.

1. During the first stage, the teacher communicates the information to the student that the student
is supposed to master.
The student receives the information and understands it.

During this stage, the teacher has no knowledge of whether the information is grasped by the
student.

2. This leads to the second stage of learning. The teacher then asks the student a set of questions to
find out how much information has been grasped by the student.
Based on these questions, the student is tested, and the teacher informs the student about his
assessment.
This kind of learning is typically called supervised learning.

Supervised learning has two methods: 1. Classification


2. Regression

Dept of CSE,VIT 41
Dept of CSE,VIT 42
Dept of CSE,VIT 43
Dept of CSE,VIT 44
1. Classification
• Classification is a supervised learning method. The input attributes of the
classification algorithms are called independent variables.
• The target attribute is called label or dependent variable. The relationship between
the input and target variable is represented in the form of a structure which is called
a classification model.
• So, the focus of classification is to predict the ‘label’ that is in a discrete form (a value
from the set of finite values).
• An example is shown in Figure 1.7 where a classification algorithm takes a set of
labelled data images such as dogs and cats to construct a model that can later be used
to classify an unknown test image data.

Dept of CSE,VIT 45
In classification, learning takes place in two stages.
1. During the first stage, called training stage, the learning
algorithm takes a labelled dataset and starts learning. After the
training set, samples are processed and the model is generated.

2. In the second stage, the constructed model is tested with test or


unknown sample and assigned a label.
This is the classification process.

The classification learning algorithm learns with the collection of


labelled data and constructs the model. Then, a test case is
selected, and the model assigns a label.
One of the examples of classification is – Image recognition,
which includes classification of diseases like cancer,
classification of plants, etc.

Dept of CSE,VIT 46
Some of the key algorithms of classification are:
• Decision Tree
• Random Forest
• Support Vector Machines
• Naïve Bayes
• Artificial Neural Network and Deep Learning networks
like CNN

Dept of CSE,VIT 47
Regression Models

• Regression is another supervised learning task that involves predicting continuous or


numerical values based on input features.
• In regression, the goal is to build a model that can estimate the relationship between
independent variables and the dependent variable.
• For instance, predicting housing prices based on factors like location, square footage, and
number of bedrooms, or forecasting sales revenue based on historical data and market
trends.

Dept of CSE,VIT 48
• The regression model takes input x and generates a model in the
form of a fitted line of the form y = f(x).

• Here, x is the independent variable that may be one or more


attributes and y is the dependent variable.

• In Figure 1.8, linear regression takes the training set and tries to
fit it with a
line – product sales = 0.66 × Week + 0.54.

• Here, 0.66 and 0.54 are all regression coefficients that are
learnt from data.

• The advantage of this model is that prediction for product sales


(y) can be made for unknown week data (x).

• For example, the prediction for unknown eighth week can be


made by substituting x as 8 in that regression formula to get y. 49
Dept of CSE,VIT
Dept of CSE,VIT 50
Dept of CSE,VIT 51
1.4.2 Unsupervised Learning

Dept of CSE,VIT 52
Dept of CSE,VIT 53
• As the name suggests, there are no supervisor or teacher
components.
• In the absence of a supervisor or teacher, self-instruction
is the most common kind of learning process.
• This process of self-instruction is based on the concept of
trial and error.
• Here, the program is supplied with objects, but no labels
are defined.
• The algorithm itself observes the examples and recognizes
patterns based on the principles of grouping.

• Grouping is done in ways that similar objects form the


same group.
Examples of unsupervised algorithms:
1. Cluster analysis and
2. Dimensional reduction algorithms are
Dept of CSE,VIT 54
1.Cluster Analysis
• Cluster analysis is an example of unsupervised
learning.
• It aims to group objects into disjoint clusters or
groups.
• Cluster analysis clusters objects based on its
attributes.
• All the data objects of the partitions are similar in
some aspect and vary from the data objects in the
other partitions significantly.
Some of the examples of clustering processes are —
1.segmentation of a region of interest in an image,
2.detection of abnormal growth in a medical image,
and 3.determining clusters of signatures in a gene
database.
Dept of CSE,VIT 55
An example of clustering scheme is shown in
Figure 1.9 where the clustering algorithm takes a
set of dogs and cats images and groups it as two
clusters-dogs and cats.
It can be observed that the samples belonging to
a cluster are similar and samples are different
radically across clusters.

Dept of CSE,VIT 56
Some of the key clustering algorithms are:
• k-means algorithm
• Hierarchical algorithms
2. Dimensionality Reduction
• Dimensionality reduction algorithms are
examples of unsupervised algorithms.
• It takes a higher dimension data as input and

outputs the data in lower dimension by


taking advantage of the variance of the data.
• It is a task of reducing the dataset with few

features without losing the generality.

Dept of CSE,VIT 57
The differences between supervised and
unsupervised learning are listed in the following
Table 1.2.
Table 1.2: Differences between Supervised and Unsupervised
Learning

Dept of CSE,VIT 58
1.4.3 Semi-supervised Learning
• There are circumstances where the dataset has a
huge collection of unlabelled data and some
labelled data.

• Labelling is a costly process and difficult to


perform by the humans.

• Semi-supervised algorithms use unlabelled data


by assigning a pseudo-label.

• Then, the labelled and pseudo-labelled dataset


Dept of CSE,VIT 59
can be combined.
1.4.4 Reinforcement Learning

Dept of CSE,VIT 60
Dept of CSE,VIT 61
Reinforcement Learning
• Reinforcement learning mimics human beings.

• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.

• The agent can be human, animal, robot, or any independent


program.

• The rewards enable the agent to gain experience.

• The agent aims to maximize the reward.

• The reward can be positive or negative (Punishment).

• When the rewards are more, the behavior gets reinforced and
learning becomes possible.

Dept of CSE,VIT 62
1.4.4 Reinforcement Learning

Dept of CSE,VIT 63
Dept of CSE,VIT 64
Reinforcement Learning
• Reinforcement learning mimics human beings.

• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.

• The agent can be human, animal, robot, or any independent


program.

• The rewards enable the agent to gain experience.

• The agent aims to maximize the reward.

• The reward can be positive or negative (Punishment).

• When the rewards are more, the behavior gets reinforced and
learning becomes possible.

Dept of CSE,VIT 65
In this grid game, the gray tile indicates the danger, black
is a block, and the tile with diagonal lines is the goal.

The aim is to start, say from bottom-left grid, using the


actions left, right, top and bottom to reach the goal state.

Dept of CSE,VIT 66
• To solve this sort of problem, there is no data.
• The agent interacts with the environment to get
experience.
• In the above case, the agent tries to create a

model by simulating many paths and finding


rewarding paths.
• This experience helps in constructing a model.

Summary:
• Compared to supervised learning, there is no supervisor or labelled dataset.
• Many sequential decisions need to be made to reach the final decision.
• Therefore, reinforcement algorithms are reward-based.

67
Dept of cSE,VIT 68
1.4.4 Reinforcement Learning

Dept of cSE,VIT 69
Dept of cSE,VIT 70
Reinforcement Learning
• Reinforcement learning mimics human beings.

• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.

• The agent can be human, animal, robot, or any independent


program.

• The rewards enable the agent to gain experience.

• The agent aims to maximize the reward.

• The reward can be positive or negative (Punishment).

• When the rewards are more, the behavior gets reinforced and
learning becomes possible.

Dept of cSE,VIT 71
1.4.4 Reinforcement Learning

Dept of cSE,VIT 72
Dept of cSE,VIT 73
Reinforcement Learning
• Reinforcement learning mimics human beings.

• Like human beings use ears and eyes to perceive the world and take
actions, reinforcement learning allows the agent to interact with the
environment to get rewards.

• The agent can be human, animal, robot, or any independent


program.

• The rewards enable the agent to gain experience.

• The agent aims to maximize the reward.

• The reward can be positive or negative (Punishment).

• When the rewards are more, the behavior gets reinforced and
learning becomes possible.

Dept of cSE,VIT 74
In this grid game, the gray tile indicates the danger, black
is a block, and the tile with diagonal lines is the goal.

The aim is to start, say from bottom-left grid, using the


actions left, right, top and bottom to reach the goal state.

Dept of cSE,VIT 75
• To solve this sort of problem, there is no data.
• The agent interacts with the environment to get
experience.
• In the above case, the agent tries to create a model
by simulating many paths and finding rewarding
paths.
• This experience helps in constructing a model.

Summary:
• Compared to supervised learning, there is no
supervisor or labelled dataset.
• Many sequential decisions need to be taken to reach
the final decision.
• Therefore, reinforcement algorithms are reward-
based, goal-oriented algorithms.
Dept of cSE,VIT 76
Dept of cSE,VIT 77
Dept of cSE,VIT 78
Dept of cSE,VIT 79
1.5 CHALLENGES OF MACHINE LEARNING
Problems that can be Dealt with Machine
Learning :
Computers are better than humans in performing
tasks like computation.
For example, while calculating the square root of
large numbers, an average human may blink but
computers can display the result in seconds.

Computers can play games like chess, GO, and


even beat professional players of that game.

Dept of cSE,VIT 80
However, humans are better than computers in
many aspects like recognition.

But, deep learning systems challenge human beings


in this aspect as well. Machines can recognize
human faces in a second.

Still, there are tasks where humans are better as


machine learning systems still require quality data
for model construction.

The quality of a learning system depends on the


quality of data.This is a challenge.
Dept of cSE,VIT 81
Some of the challenges are listed below:
1. ILL-POSED PROBLEMS – PROBLEMS
WHOSE SPECIFICATIONS ARE NOT
CLEAR :
Machine learning can deal with the ‘well-posed’
problems where specifications are complete and
available. Computers cannot solve ‘ill-posed’
problems.
Consider one simple example (shown in Table 1.3):

82
Can a model for this test data be multiplication?
That is, y = x1 × x2 . Well! It is true!
But, this is equally true that y may be
y = x1 ÷ x2 , or y = x1 x2.
So, there are three functions that fit the data.
This means that the problem is ill-posed.
To solve this problem, one needs more example
to check the model.
Puzzles and games that do not have sufficient
specification may become an ill-posed problem
and scientific computation has many ill-posed
problems.

Dept of CSE,VIT 83
2. Huge data :
This is a primary requirement of machine learning. Availability of a
quality data is a challenge. A quality data means it should be large and
should not have data problems such as missing data or incorrect data.

3. High computation power :


With the availability of Big Data, the computational resource
requirement has also increased.

Systems with Graphics Processing Unit (GPU) or even Tensor


Processing Unit (TPU) (a chip designed to speed up machine learning
workloads) are required to execute machine learning algorithms.

Also, machine learning tasks have become complex and hence time
complexity has increased, and that can be solved only with high
computing power.

Dept of CSE,VIT 84
4. Complexity of the algorithms:
The selection of algorithms, describing the
algorithms, application of algorithms to solve
machine learning task, and comparison of
algorithms have become necessary for machine
learning or data scientists now.

Algorithms have become a big topic of


discussion and it is a challenge for machine
learning professionals to design, select, and
evaluate optimal algorithms.

Dept of CSE,VIT 85
5. Bias/Variance:
• Variance is the error of the model.
• This leads to a problem called bias/ variance
tradeoff.
• A model that fits the training data correctly
but fails for test data, in general lacks
generalization, is called overfitting.
• The reverse problem is called underfitting
where the model fails for training data but
has good generalization.
• Overfitting and underfitting are great
challenges for machine learning algorithms

Dept of CSE,VIT 86
1.6 MACHINE LEARNING PROCESS

Dept of CSE,VIT 87
1. Understanding the business –
This step involves understanding the objectives and
requirements of the business organization.

Generally, a single data mining algorithm is enough for


giving the solution.

This step also involves the formulation of the problem


statement for the data mining process.

2. Understanding the data –


It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis,
and matching of patterns to the selected hypothesis.

Dept of CSE,VIT 88
3. Preparation of data –
• This step involves producing the final dataset

by cleaning the raw data and preparation of


data for the data mining process.
• The missing values may cause problems

during both training and testing phases.


• Missing data forces classifiers to produce

inaccurate results.
• This is a perennial problem (problem that

happens repeatedly) for the classification


models.
• Hence, suitable strategies should be adopted

to handle the missing data.


Dept of ISE,TOCE 89
4. Modelling –
This step plays a role in the application of data mining
algorithm for the data to obtain a model or pattern.

5. Evaluate –
• This step involves the evaluation of the data mining
results using statistical analysis and visualization
methods.
• The performance of the classifier is determined by
evaluating the accuracy of the classifier.
• The process of classification is a fuzzy issue.
For example:Classification of emails requires
extensive domain knowledge and requires domain
experts.
• Hence, the performance of the Classifier is very crucial.

Dept of CSE,VIT 90
6. Deployment –
This step involves the deployment of results of
the data mining algorithm to improve the
existing process or for a new situation.

Dept of CSE,VIT 91
1.7 MACHINE LEARNING APPLICATIONS
1. Sentiment analysis –
This is an application of natural language processing (NLP)
where the words of documents are converted to sentiments
like happy, sad, and angry which are captured by emoticons
effectively.

For movie reviews or product reviews, five stars or one star


are automatically attached using sentiment analysis programs.
2. Recommendation systems –
These are systems that make personalized purchases possible.
For example:
• Amazon recommends users to find related books or books
bought by people who have the same taste like you, and

• Netflix suggests shows or related movies of your taste.


The recommendation systems are based on machine
learning Dept of CSE,VIT 92
3. Voice assistants –

• Products like Amazon Alexa, Microsoft Cortana,


Apple Siri, and Google Assistant are all examples of
voice assistants.

• They take speech commands and perform tasks.

• These chatbots are the result of machine learning


technologies.

4. Technologies like Google Maps and those used by


Uber are all examples of machine learning which offer to
locate and navigate shortest paths to reduce time.

Dept of CSE,VIT 93
Dept of CSE,VIT 94
Chapter 2
Understanding Data
2.1 What is data?
• All facts are data.
• Data can be directly human interpretable (such as
numbers or texts) or diffused data such as images or
video that can be interpreted only by a computer.
• Today, business organizations are accumulating vast and
growing amounts of data of the order of gigabytes, tera
bytes, exabytes.
• A kilo byte (KB) is 1024 bytes, one megabyte (MB) is
approximately 1000 KB, one gigabyte is approximately
1,000,000 KB, 1000 gigabytes is one terabyte and 1000000
terabytes is one Exabyte."
Dept of CSE,VIT 95
Data is available in different data sources like flat
files, databases, or data warehouses.
It can either be an :
• operational data or
• non-operational data.

Operational data is the one that is encountered in


normal business procedures and processes.
For example:Daily sales data is operational data,.

Non-operational data is the kind of data that is used


for decision making.
Dept of CSE,VIT 96
• Data by itself is meaningless. It has to be
processed to generate any information.
• A string of bytes is meaningless. Only when a
label is attached like height of students of a
class, the data becomes meaningful.
• Processed data is called information that
includes patterns, associations, or
relationships among data.
• For example: sales data can be analyzed to
extract information like which product was
sold larger in the last quarter of the year.

Dept of CSE,VIT 97
Elements of big data
Small data: Data whose volume is less and can
be stored and processed by a small-scale
computer is called ‘small data’.
Big data:is a larger data whose volume is much
larger than ‘small data’ and is characterized as
follows:
1. Volume:Small traditional data is measured in
terms of gigabytes (GB) and terabytes (TB), but
Big Data is measured in terms of petabytes (PB)
and exabytes (EB). One exabyte is 1 million
terabytes.
Dept of CSE,VIT 98
2. Velocity — The availability of IoT devices and Internet power ensures
that the data is arriving at a faster rate.
3. Variety — The variety of Big Data includes:
● Form — There are many forms of data. Data types range from text,
graph, audio, video, to maps.
There can be composite data too, where one media can have many other
sources of data.
Ex: a video can have an audio song.
● Function — These are data from various sources like human
conversations, transaction records, and old archive data.
● Source of data — There are many sources of data.
Broadly, the data source can be classified as open/public data, social
media data, and multimodal data.

Dept of CSE,VIT 99
4. Veracity of data —Deals with aspects like conformity to
the facts, truthfulness, believability, and confidence in
data.
There may be many sources of error such as technical
errors, typographical errors, and human errors.
5. Validity — Validity is the accuracy of the data for
taking decisions or for any other goals that are needed by the
given problem.
6. Value — Value is the characteristic of big data that
indicates the value of the information that is extracted from
the data and its influence on the decisions that are taken
based on it.
Dept of CSE,VIT 100
2.1.1 Types of Data
1. Structured
2. Unstructured
3. Semi structured
1. Structured Data
In structured data, data is stored in an organized manner such as a
database where it is available in the form of a table.
The data can also be retrieved in an organized manner using tools
like SQL.
• Record data
• Data matrix
• Graph data
• Ordered data:- i)Temporal data
ii)Sequence data
iii)Spatial data
Dept of CSE,VIT 101
● Record Data—A dataset is a collection of measurements taken from a
process.
The measurements can be arranged in the form of a matrix.
Rows in the matrix represent an object and can be called entities, cases, or
records.
The columns of the dataset are called attributes, features, or fields. The table
is filled with observed data.

● Data Matrix—It is a variation of the record type because it consists of


numeric attributes.
The standard matrix operations can be applied on these data.

● Graph Data —It involves the relationships among objects.


For example, a web page can refer to another web page.
This can be modeled as a graph. The nodes are web pages and the hyperlink is an
edge that connects the nodes.

Dept of CSE,VIT 102


• Ordered Data — Ordered data objects involve attributes that have
an implicit order among them.
The examples of ordered data are:
1. Temporal data — It is the data whose attributes are associated
with time.
For example: the customer purchasing patterns during festival time
is sequential data.
1. Sequence data — It is like sequential data but does not have
timestamps. This data involves the sequence of words or letters.
For example:DNA data is a sequence of four characters :
A T G C(adenine (A), thymine (T), guanine (G), and cytosine (C)).
2. Spatial data — It has attributes such as positions or areas.
For ex: Maps are spatial data where the points are related by
location.
Dept of CSE,VIT 103
2. Unstructured Data
Unstructured data includes video, image, and audio.
• It also includes textual documents, programs,
and blog data.
• It is estimated that 80% of the data are
unstructured data.
3. Semi-Structured Data
Semi-structured data are partially structured and
partially unstructured.
These include data like XML/JSON data and
hierarchical data. Dept of CSE,VIT 104
2.1.2 Data Storage and Representation
• Once the dataset is assembled, it must be stored in a structure that is
suitable for data analysis.
• The goal of data storage management is to make data available for
analysis.
• There are different approaches to organize and manage data in
storage files and systems from flat file to data warehouses.
• Some of them are listed below:
1. Flat files
2. Database systems
3.WWW
4. XML
5. Data stream
6. RSS(Really Simple Syndication)
7. JSON(Java Script Object Notation)

Dept of CSE,VIT 105


1. Flat Files

❏ These are the simplest and most commonly available data source.
❏ These flat files are the files where data is stored in plain ASCII or
EBCDIC( Extended Binary Coded Decimal Interchange Code) format.
❏ Minor changes of data in flat files affect the results of the data
mining algorithms.
❏ Hence, flat file is suitable only for storing small dataset and not
desirable if the dataset becomes larger.
Some of the popular spreadsheet formats are listed below:
● CSV files – CSV stands for comma-separated value files where the
values are separated by commas. The first row may have attributes and
the rest of the rows represent the data.
● TSV files – TSV stands for Tab separated values files where values
are separated by Tab.
There are many tools like Google Sheets and Microsoft Excel to process
Dept of CSE,VIT 106
these files.
2. Database System
• It normally consists of database files and a database
management system (DBMS).
• Database files contain original data and metadata. A
relational database consists of sets of tables.
• The tables have rows and columns. The columns represent
the attributes and rows represent tuples.
• A tuple corresponds to either an object or a relationship
between objects.
• A user can access and manipulate the data in the database
using SQL.

Dept of CSE,VIT 107


Different types of databases are listed below:

1. A transactional database is a collection of transactional records. Each record is


a transaction. A transaction may have a time stamp, identifier, and a set of items,
which may have links to other tables.
2. Time-series database stores time-related information like log files where data is
associated with a time stamp.

This data represents the sequences of data, which represent values or events
obtained over a period (for example, hourly, weekly, or yearly) or repeated time
span.

3. Spatial databases contain spatial information in a raster or vector format

Raster formats are either bitmaps or pixel maps.

For example, images can be stored as raster data.

Vector format can be used to store maps as maps use basic geometric primitives
like points, lines, polygons, and so forth.

Dept of CSE,VIT 108


3. World Wide Web (WWW)

It provides a diverse, worldwide online information source. The objective of data mining algorithms is to mine
interesting patterns of information present in WWW.

4. XML (eXtensible Markup Language)

It is both human and machine interpretable data format that can be used to represent data that needs to be
shared across platforms.

5. Data Stream

It is dynamic data, which flows in and out of the observing environment. Typical characteristics of data stream
are huge volume of data, dynamic, fixed order movement, and real-time constraints.

6. RSS (Really Simple Syndication)

It is a format for sharing instant feeds across services.

-a technology that allows you to subscribe to content from multiple sources and then have it all delivered as it's
published.

7. JSON (JavaScript Object Notation)

It is another useful data interchange format that is often used for many machine learning algorithms.

Dept of CSE,VIT 109


2.2 BIG DATA ANALYTICS AND TYPES OF
ANALYTICS

• The primary aim of data analysis is to assist


business organizations to take decisions.
For ex: A business organization may want to
know which is the fastest selling product, in
order for them to market activities.

• Data analysis is an activity that takes the


data and generates useful information and
insights for assisting the organizations.

Dept of CSE,VIT 110


• Data analysis and data analytics are terms that are
used interchangeably to refer to the same concept.

• However, there is a subtle difference.

• Data analytics is a general term and data analysis


is a part of it.

• Data analytics refers to the process of data


collection, preprocessing and analysis.

• It deals with the complete cycle of data management.

Dept of CSE,VIT 111


There are four types of data analytics:
1. Descriptive analytics
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
Descriptive Analytics
• It is about describing the main features of the data.
• After data collection is done, descriptive analytics deals
with the collected data and quantifies it.
• It is often stated that analytics is essentially statistics.
• There are two aspects of statistics – Descriptive and
Inference.
• Descriptive analytics only focuses on the description
part of the data and not the inference
part(conclusion/assumption).
Dept of CSE,VIT 112
Diagnostic Analytics
• It deals with the question – ‘Why?’.
• This is also known as causal analysis, as it aims to find out
the cause and effect of the events.
For ex: If a product is not selling, diagnostic analytics aims to
find out the reason.
Predictive Analytics
• It deals with the future. It deals with the question – ‘What will happen in
future given this data?’.
• This involves the application of algorithms to identify the patterns to
predict the future.
Prescriptive Analytics
• It is about finding the best course of action for the business
organizations.
• It helps the organizations to plan better for the future and to
mitigate(reduce) the risks that are involved.

Dept of CSE,VIT 113


2.3 BIG DATA ANALYSIS FRAMEWORK
• For performing data analytics, many frameworks are proposed.
• Big data framework is a layered architecture. Such an architecture has
many advantages such as genericness.
A 4-layer architecture has the following layers:
1. Data connection layer
2. Data management layer
3. Data analytics later
4. Presentation layer
Data Connection Layer
• It has data ingestion mechanisms and data connectors. Data ingestion
means taking raw data and importing it into appropriate data structures.
• It performs the tasks of the ETL process.
• It means extract, transform and load operations

Dept of CSE,VIT 114


Data Management Layaer:
It performs preprocessing of data. The purpose of this
layer is to allow parallel execution of queries, and read,
write and data management tasks.

Data Analytic Layer: It has many functionalities such as


statistical tests, machine learning algorithms to understand,
and construction of machine learning models.
This layer implements many model validation mechanisms
too.

Dept of CSE,VIT 115


Presentation Layer: It has mechanisms such as dashboards, and
applications that display the results of analytical engines and machine
learning algorithms.
Thus, the Big Data processing cycle involves data management that
consists of the following steps.
1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning
algorithm
This is an iterative process and is carried out on a permanent basis to
ensure that data is suitable for data mining.

Dept of CSE,VIT 116


2.3.1 Data Collection
The first task of gathering datasets is the collection of data.
It is often estimated that most of the time is spent for collection of good
quality data.
A good quality data yields a better result. It is often difficult to characterize
a ‘Good data’. ‘Good data’ is one that has the following properties:
1. Timeliness – The data should be relevant and not stale or obsolete data.
2. Relevancy – The data should be relevant and ready for the machine
learning or data mining algorithms. All the necessary information
should be available and there should be no bias in the data.
3. Knowledge about the data – The data should be understandable and
interpretable, and should be self-sufficient for the required application as
desired by the domain knowledge engineer.

Dept of CSE,VIT 117


Broadly, the data source can be classified as open/public data, social media
data, and multimodal data.
1. Open or public data source – It is a data source that does not have any
stringent copyright rules or restrictions. Its data can be primarily used for
many purposes. Government census data are good examples of open data:
○ Digital libraries that have a huge amount of text data as well as
document images
○ Scientific domains with a huge collection of experimental data like
genomic data and biological data
○ Healthcare systems that use extensive databases like patient
databases, health insurance data, doctors’ information, and
bioinformatics information
2. Social media – It is the data that is generated by various social media
platforms like Twitter, Facebook, YouTube, and Instagram. An
enormous amount of data is generated by these platforms.
3. Multimodal data – It includes data that involves many modes such as
text, video, audio, and mixed types.
Dept of CSE,VIT 118
2.3.2 Data Preprocessing
In the real world, the available data is ‘dirty’. By this word ‘dirty’, it means:

● Incomplete data
● Outlier data
● Data with inconsistent values
● Inaccurate data
● Data with missing values
● Duplicate data

❏ Data preprocessing improves the quality of the data mining techniques.


❏ The raw data must be preprocessed to give accurate results.
❏ The process of detection and removal of errors in data is called data cleaning.
❏ Data wrangling means making the data processable for machine learning
algorithms.
❏ Some of the data errors include human errors such as typographical errors or
incorrect measurement and structural errors like improper data formats.

Dept of CSE,VIT 119


Data errors can also arise from omission and duplication of
attributes.
Consider, for example, the following patient Table 2.1.
The ‘bad’ or ‘dirty’ data can be observed in this table.

Dept of CSE,VIT 120


● It can be observed that data like Salary = ‘ ’ is incomplete
data.
● The DoB of patients, John, Andre, and Raju, is the
missing data.
● The age of David is recorded as ‘5’ but his DoB indicates
it is 10/10/1980. This is called inconsistent data.
● Salary for John is –1500. It cannot be less than ‘0’. It is an
instance of noisy data.
● Outliers are data that exhibit the characteristics that are
different from other data and have very unusual values.
● The age of Raju cannot be 136. It might be a
typographical error.
● Process of removing all these errors are called data
cleaning,
Dept of CSE,VIT 121
Analysis of Missing Data
The primary data cleaning process is missing data analysis.
Data cleaning routines attempt to fill up the missing values, smoothen the
noise while identifying the outliers and correct the inconsistencies of the
data.
This enables data mining to avoid overfitting of the models.
The procedures that are given below can solve the problem of missing
data:
1. Ignore the tuple – A tuple with missing data, especially the class
label, is ignored. Disadv:This method is not effective when the
percentage of the missing values increases.
2. Fill in the values manually – Here, the domain expert can analyse the
data tables and carry out the analysis and fill in the values manually.
Disadv: This is time consuming and may not be feasible for larger
sets.
Dept of CSE,VIT 122
3. A global constant can be used to fill in the missing
attributes.The missing values may be ‘Unknown’ or be
‘Infinity’. But, some data mining results may give spurious
(not real) results by analysing these labels.

4. Use the attribute mean for all samples belonging to the


same class – Here, the average value replaces the missing
values of all tuples that fall in this group.

5. Use the most possible value to fill in the missing value


– The most probable value can be obtained from other
methods like classification and decision tree prediction.

Dept of CSE,VIT 123


Removal of Noisy or Outlier Data
• Noise is a random error or variance in a measured value. It can be
removed by using binning.
• Binning is a method where the given data values are sorted and
distributed into equal frequency bins.
• The bins are also called as buckets.
• The binning method then uses the neighbor values to smooth the
noisy data.
Some of the techniques commonly used are:
1. Smoothing by means: where the mean of the bin removes the
values of the bins.
2. Smoothing by bin medians: where the bin median replaces the bin
values.

Dept of CSE,VIT 124


3. Smoothing by bin boundaries: where the bin
value is replaced by the closest bin boundary.

The maximum and minimum values are called


bin boundaries.
Binning methods may be used as a discretization
technique.
Example 2.1 illustrates this principle.

Dept of CSE,VIT 125


Example 2.1: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28,
31, 32}. Apply various binning techniques and show the result.
Solution: By equal-frequency bin method, the data should be distributed
across bins.
Let us assume the bins of size 3, then the above data is distributed across
the bins as shown below:
Bin 1 : 12, 14, 19
Bin 2 : 22, 24, 26
Bin 3 : 28, 31, 32
Smoothing by means: the bins are replaced by the bin means. This
method results in:
Bin 1 : 15, 15, 15
Bin 2 : 24, 24, 24
Bin 3 : 30.3, 30.3, 30.3 Dept of ISE,TOCE 126
Smoothing by bin boundaries: the bins' values would be like:
Bin 1 : 12, 12, 19
Bin 2 : 22, 22, 26
Bin 3 : 28, 32, 32
As per the method, the minimum and maximum values of the bin
are determined, and it serves as bin boundary and does not
change.
Rest of the values are transformed to the nearest value. It can be
observed in Bin 1, the middle value 14 is compared with the
boundary values 12 and 19 and changed to the closest value, that
is 12.
This process is repeated for all bins.

Dept of CSE,VIT 127


Data Integration and Data Transformations
Data integration
• Involves routines that merge data from multiple sources into a single data source.
• So, this may lead to redundant data.
• The main goal of data integration is to detect and remove redundancies that arise
from integration.

Data transformation
• Routines perform operations like normalization to improve the performance of the
data mining algorithms.
• It is necessary to transform data so that it can be processed. This can be considered as a
preliminary stage of data conditioning.

Normalization is one such technique. In normalization, the attribute values are scaled to fit in
a range (say 0–1) to improve the performance of the data mining algorithm. Often, in neural
networks, these techniques are used. Some of the normalization procedures used are:

1. Min-Max
2. z-Score
Dept of CSE,VIT 128
Min-Max Procedure
It is a normalization technique where each variable V is normalized by
its difference with the minimum value divided by the range to a new
range, say 0–1.
Often, neural networks require this kind of normalization. The formula
to implement this normalization is given as:

Here max-min is the range. Min and max are the minimum and
maximum of the given data, new max and new min are the minimum
and maximum of the target range, say 0 and 1.

Dept of CSE,VIT 129


Dept of CSE,VIT 130
Dept of CSE,VIT 131
What is the use of z-scores?
• z-scores are used for outlier detection.
• If the data value z-score function is either less than -3 or
greater than +3, then it is possibly an outlier.
• A positive z-score means the value is above the mean,
while a negative z-score means it is below the mean.
Uses of Z-Scores
• Outlier Detection – Values with z-scores beyond ±3 are
often considered outliers.
The major disadvantage of z-score function is that it is
extremely sensitive to outliers as it is dependent on mean.

Dept of CSE,VIT 132


Data Reduction
• Data reduction reduces data size but produces the same
results.
• There are different ways in which data reduction can be
carried out such as data aggregation, feature selection,
and dimensionality reduction.

Dept of CSE,VIT 133


2.4 DESCRIPTIVE STATISTICS
• Descriptive statistics is a branch of statistics
that does dataset summarization.
• It is used to summarize and describe data.
• In other words, descriptive statistics do not
bother too much about machine learning
algorithms and its functioning.
• Data visualization is a branch of study that is
useful for investigating the given data.
Mainly, the plots are useful to explain and
present data to customers.

Dept of CSE,VIT 134


• Descriptive analytics and data visualization
techniques help to understand the nature of the
data, which further helps to determine the kinds
of machine learning or data mining tasks that
can be applied to the data.

• This step is often known as Exploratory Data


Analysis (EDA).

• The focus of EDA is to understand the given data


and to prepare it for machine learning algorithms.
EDA includes descriptive statistics and data
visualization.

Dept of CSE,VIT 135


Dataset and Data Types

• A dataset can be assumed to be a collection of data


objects.
• The data objects may be records, points, vectors,
patterns, events, cases, samples, or observations.
• These records contain many attributes. An attribute can
be defined as the property or characteristics of an
object.
For example, consider the following database shown in
sample Table 2.2.
136
Every attribute should be associated with a value. This
process is called measurement.

The type of attribute determines the data types, often


referred to as measurement scale types. The data types are
shown in Figure 2.1.

Ratio

137
Categorical or Qualitative Data:
The categorical data can be divided into two types. They are nominal type and ordinal
type.

● Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are
symbols and cannot be processed like a number.

For Ex: the average of a patient ID does not make any statistical sense.

Nominal data type provides only information but has no ordering among data.
Only operations like (=, ≠) are meaningful for these data.

For Ex: the patient ID can be checked for equality and nothing else.

● Ordinal Data – It provides enough information and has natural order.


Has a meaningful order but no fixed numerical difference (e.g., Education Level:
High School < Bachelor’s < Master’s).

For Ex: Fever = {Low, Medium, High} is an ordinal data.

Certainly, low is less than medium and medium is less than high, irrespective of
the value.
138
Numeric or Quantitative Data
It can be divided into two categories. They are interval type and ratio type.

● Interval Data – Interval data is a type of numeric (quantitative) data


where the difference between values is meaningful, but zero is arbitrary
and does not represent the absence of a quantity.
For ex: There is a difference between 30 degree and 40 degree.
Operations: Addition and subtraction are meaningful, but ratios are not.
● Ratio Data – For ratio data, both differences and ratio are meaningful.
The difference between the ratio and interval data is the position of zero in
the scale.
For ex: Take the Centigrade-Fahrenheit conversion. The zeroes of both
scales do not match. Hence, these are interval data.

139
1. It is measured in the form of numbers. Ex:Measuring
temperature using thermometers.
2. It has rank & order.Ex: While measuring temp 1 degree is
always lower than 3 degree.
3. It is equidistant ,that is it has equally spaced intervals.Ex: diff
between 1 degree celsius and 2 degree celsius is same as the diff
between 4 degree and 5 degree.
4. It does not have any meaning full zero.
5. Interval data can be negative.Ex: -12 degree celsius.

140
None of these example are having meaningful
zero.

141
1. It is measured in the form of numbers.Ex:Distance can be
measured by using measuring device.
2. It has rank & order.Ex: While measuring distances 2Km is always
less than 5Km.
3. It is equidistant ,that is it has equally spaced intervals.Ex: The diff
between 1km & 2Km is the same as the diff between 4Km & 5 Km.
4. It has a meaningful zero.Ex: We can say that you have travelled zero
Kms today(or not travelled).
5. The ratio data can never be negative.Ex: There is nothing like -5Km
in distance.

142
All of these examples are having meaningful
zero.
143
Another way of classifying the data is to classify it as:

1. Discrete value data


2. Continuous data

Discrete Data
Discrete data consists of distinct, separate values that can be counted.

It is typically represented as whole numbers and does not have decimal or fractional values.

Example: The number of students in a class (e.g., 25 students), the number of cars in a parking lot, or employee
ID numbers.

Continuous Data
Continuous data can take any value within a given range and includes decimal or fractional values.

It is measured rather than counted and can be infinitely divided into smaller parts.

Example: A person's height (e.g., 170.5 cm), weight (e.g., 65.2 kg), or temperature (e.g., 36.7°C).

Third way of classifying the data is based on the number of variables used in the dataset. Based on that, the
data can be classified as univariate data, bivariate data, and multivariate data. This is shown in Figure 2.2.

144
Univariate data:
The dataset has only one variable. A variable is
also called as category.
Bivariate data:
Indicates that the number of variables used are
two.
Multivariate data :Uses three or more
variables.
145
Univariate data:
Univariate data consists of one variable. It is used to describe and analyze a single
characteristic or attribute of a dataset.
Examples of Univariate Data:

● The heights of students in a classroom.


● The daily temperature in a city.
● The test scores of students in a subject.

Here is an example of a univariate dataset, which consists of only one variable:

Example: Heights of 10 students (in cm):


160, 165, 170, 175, 168, 172, 158, 180, 177, 169
Since it involves only one variable, it is univariate data.
Visualization Methods:

● Histograms
● Box plots
● Bar charts
146
Bivariate Data
Bivariate data consists of two variables and examines the relationship between them. It helps
determine whether there is a correlation or association between the variables.

Examples of Bivariate Data:

● The relationship between a student's study hours and exam scores.


● The correlation between temperature and ice cream sales.
● The link between weight and height of individuals.

Visualization Methods:

● Scatter plots
● Line graphs

Here is an example of a bivariate dataset, which consists of two related variables:

147
Multivariate Data
Multivariate data consists of three or more variables. It is used to
analyze the relationships among multiple factors simultaneously.

Example of a Multivariate Dataset: Student Performance Analysis


This dataset includes three variables: Study Hours, Sleep Hours, and
Exam Scores.

148
2.5 UNIVARIATE DATA ANALYSIS AND
VISUALIZATION
• Univariate analysis is the simplest form of statistical
analysis.
• As the name indicates, the dataset has only one variable.
• A variable can be called as a category.
• Univariate does not deal with cause or relationships.
The aim of univariate analysis is to describe data and
find patterns.
• Univariate data description involves finding the
frequency distributions, central tendency measures,
dispersion or variation, and shape of the data.

149
2.5.1 Data Visualization
• To understand data, graph visualization is must.
• Data visualization helps to understand data. It helps to present information and data
to customers.
• Some of the graphs that are used in univariate data analysis are bar charts, histograms,
frequency polygons and pie charts.
• Advantages of the graphs: are presentation of data, summarization of data,
description of data, exploration of data, and to make comparisons of data.
• Let us consider some forms of graphs now:

Bar chart

A Bar chart (or Bar graph) is used to display the frequency distribution for variables.

• Bar charts are used to illustrate discrete data. The charts can also help to explain the
counts of nominal data. It also helps in comparing the frequency of different groups.
• The bar chart for students' marks {45, 60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5}
is shown below in Figure 2.3.

150
151
152
Histogram
• It plays an important role in data mining for showing
frequency distributions.
• The histogram for students' marks {45, 60, 60, 80, 85} in
the group range of 0–25, 26–50, 51–75, 76–100 is given
below in Figure 2.5. One can visually inspect from
Figure 2.5 that the number of students in the range 76–
100 is 2.

E 153
• Histogram conveys useful information like nature of data
and its mode(value that appears most frequently).
• Mode indicates the peak of dataset.
• In other words, histograms can be used as charts to show
frequency, skewness(lack of straightness) present in the
data, and shape.
Dot Plots
These are similar to bar charts. They are less clustered as
compared to bar charts, as they illustrate the bars only with
single points.

The dot plot of English marks for five students with ID as {1,
2, 3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure
2.6. The advantage is that by visual inspection one can find out
who got more marks.

154
155
2.5.2 Central Tendency
• One cannot remember all the data. Therefore, a
condensation or summary of the data is necessary.
• This makes the data analysis easy and simple. One
such summary is called central tendency.
• Thus, central tendency can explain the characteristics
of data and that further helps in comparison.
• Mass data have tendency to concentrate at certain
values, normally in the central location. It is called
measure of central tendency (or averages).
• This represents the first order of measures. Popular
measures are mean, median and mode.

156
1. Mean – Arithmetic average (or mean) is a measure of
central tendency that represents the ‘center’ of the dataset.
It can be found by adding all the data and dividing the sum
by the number of observations.
Mathematically, the average of all the values in the sample
(population) is denoted as x’(x bar).
Let x1,x2,…,xn be a set of ‘N’ values or observations, then
the arithmetic mean is given as:

157
• Weighted mean – Weighted Mean is an average computed
by giving different weights to some of the individual values.
If all the weights are equal, then the weighted mean is the
same as the arithmetic mean Hence, different weightage can
be given to items.

158
2. Median – The middle value in the distribution is called
median. If the total number of items in the distribution is odd,
then the middle value is called median.
If the numbers are even, then the average value of two items in
the centre is the median.
It can be observed that the median is the value where xi is divided
into two equal halves, with half of the values being lower than the
median and half higher than the median. A median class is that
class where (N/2)th item is present.
In the continuous case, the median is given by the formula:

159
3. Mode – Mode is the value that occurs more frequently in the
dataset.
In other words, the value that has the highest frequency is called
mode.
Mode is only for discrete data and is not applicable for
continuous data as there are no repeated values in continuous
data.
Normally, the dataset is classified as unimodal, bimodal, and
trimodal with modes 1, 2, and 3, respectively.

160
2.5.3 Dispersion
The spread out of a set of data around the central tendency (mean, median or
mode) is called dispersion.
Dispersion is represented by various ways such as range, variance, standard
deviation, and standard error.
These are second order measures. The most common measures of the dispersion
data are listed below:
Range – Range is the difference between the maximum and minimum of values of
the given list of data.
Standard Deviation – The mean does not convey much more than a middle point.
For example, the following datasets {10, 20, 30} and {10, 50, 0} both have a mean
of 20. The difference between these two sets is the spread of data.

Standard deviation is the average distance from the mean of the dataset to each
point. The formula for sample standard deviation is given by:
161
Quartiles and Inter Quartile Range (IQR)
Quartiles divide a dataset into four equal parts. The three
quartiles are:
● Q₁ (First Quartile) – 25% of the data is below this value.
● Q₂ (Second Quartile / Median) – 50% of the data is below
this value.
● Q₃ (Third Quartile) – 75% of the data is below this value.
● Inter Quartile Range (IQR) measures the spread of the
middle 50% of the data. It is calculated as:
IQR=Q3−Q1

162
163
Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28,
31, 34}, find the IQR.

164
165
166
Five-point Summary and Box Plots

• The median, quartiles Q1 and Q3, and minimum and maximum


written in the order < Minimum, Q1, Median, Q3, Maximum > is
known as five-point summary.
• Box plots can be used to illustrate data distributions and summary
of data.
• It is the popular way for plotting five-number summaries. A Box
plot is also known as a Box and whisker plot.
• The box contains bulk of the data. These data are between first and
third quartiles.
• The line inside the box indicates location—mostly median of the
data.
• If the median is not equidistant, then the data is skewed. The
whiskers that project from the ends of the box indicate the spread
of the tails and the maximum and minimum of the data value.
167
168
2.5.4 Shape
• The shape of a dataset refers to the overall appearance of its
distribution when plotted on a graph.
• It helps in understanding how data values are spread, whether the
distribution is symmetric or asymmetric, and where most data points
are concentrated.
• Two important measures that describe the shape of a dataset are
skewness and kurtosis.
1. Skewness
● Skewness measures the asymmetry of a dataset's distribution.
● If a distribution is perfectly symmetric, it has a skewness of zero (e.g.,
a normal distribution).
● Positive Skew (Right-Skewed): The right tail is longer, meaning
there are more high-value outliers. Here, mean > median > mode.
● Negative Skew (Left-Skewed): The left tail is longer, meaning there
are more low-value outliers. Here, mean < median < mode.

169
1. Positive Skew (Right-Skewed Distribution)
● The tail is longer on the right side.
● Most data points are concentrated on the left.
● Example: Income distribution (a few people earn extremely high salaries, creating a
right tail).

2. Negative Skew (Left-Skewed Distribution)


● The tail is longer on the left side.
● Most data points are concentrated on the right.
● Example: Test scores (if most students score high but a few score very low, creating a
left tail).

170
171
2. Kurtosis
Kurtosis also indicates the peaks of data.
If the data is high peak, then it indicates higher kurtosis and vice versa.
High Kurtosis: The data has heavy tails, meaning there are more extreme outliers.

Low Kurtosis: The data has light tails, meaning fewer extreme values.

Let x1,x2,…,xN be a set of ‘N’ values or observations. Then, kurtosis is measured


using the formula given below:

172
Some of the other useful measures for finding the shape of
the univariate dataset are :
• Mean absolute deviation (MAD) and
• Coefficient of variation (CV)
Mean Absolute Deviation (MAD)
MAD is another dispersion measure and is robust to outliers.
Here, the absolute deviation between the data and mean is taken. Thus,
the absolute deviation is given as:

173
Coefficient of Variation (CV)
Coefficient of variation is used to compare datasets with different units.
CV is the ratio of standard deviation and mean, and %CV is the percentage of coefficient of variation.

2.5.5 Special Univariate Plots

The ideal way to check the shape of the dataset is a stem and leaf plot.
A stem and leaf plot are a display that helps us to know the shape and distribution of the data.
In this method, each value is split into a 'stem' and a 'leaf'. The last digit is usually the leaf, and digits to
the left of the leaf mostly form the stem. For example, marks 45 are divided into stem 4 and leaf 5 in
Figure 2.9.

174
It can be seen from Figure 2.9 that the first column is the stem
and the second column is the leaf.
For the given English marks, two students with 60 marks are
shown in the stem and leaf plot as stem-6 with 2 leaves with 0.

175
Q-Q plot
A Q-Q plot can be used to assess the shape of the dataset.
The Q-Q plot is a 2D scatter plot of an univariate data against
theoretical normal distribution data or of two datasets – the
quartiles of the first and second datasets.
The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given
below in Figure 2.10.

E 176

You might also like