0 ratings 0% found this document useful (0 votes) 30 views 36 pages FAIML Unit 4 Introduction To ML
This document provides an introduction to machine learning (ML), explaining its definition, techniques, and applications. It covers various types of ML, including supervised, unsupervised, and reinforcement learning, and highlights its importance in modern technology through examples like self-driving cars, fraud detection, and virtual assistants. The document also outlines the historical development of ML and its current relevance in daily life.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save FAIML unit 4 introduction to ml For Later UNIT IV
INTRODUCTION TO ML
[4.1 INTRODUCTION TO MACHINE LEARNING
ters to romatically from Machine
«Machine learning is a growing technology which enables compu
learning uses various algorithms for building mathematica! models ‘and making predictions using historical data Or
information. Currently it(s being used (or various tasks such a5 image recognition, speech recognition. email filtering
Facebook auto-tagging, recommender system and many more,
«This section gives you an introduction to machine leaning akong with the wie range of machine learning techniques
och a Supervised, Unsupersed and Reinforcement learning, You will learn about regression and classification
del, custering mathedk, hidden Markov models and various sequential models
What is Machine Learning?
«inthe real world, we are surrounded by humans
. capability and we have computers or machines w
‘experiences or past data lke a human does? So here comes the rl
ces with their Fearning
ne also: learn from
who can learn everything from their experienc
hich work on out instructions. But can # machi
le of Machine Learning.
Machine
Yes, | can also lear
{om past data with the
help of machine leaming
Fig. 4
lligence that ts mainly
data and past experiences on their own. The term ma
itn a summarized way as
hhine to automaticaly fearn frorh data, improve performance from experiences
+ Machine Learning is sald as @ subset of artfictl intel concerned with the development of
algorithms which allow a computer to learn from the icine
earning was first introduced by Arthur Samuel in 1959. We can define
> Machine learning enables a mad
predict things without being explicitly programmed.
> With the help of sample tistrical data, which is|known as training deta, machine learning algorthens build a
mathematical model that helps in making predictions or decisions without being ‘explicitly programmed. Machine
learning brings computer science and statistics together for creating predictive models. Machine learning
aaacthacts of uses the algorithms that learn frors historical data, We will provide more information, then higher
will be the performance.
> Amachine has the abildy to learn i t'can improve its performance by gaining more data
‘now Daes Machine Learning Work?
and
A Machine Learning system ieams from historical data, builds the prediction models and whenever it receives new
+ data, predicts the output for it The accuracy of predicted output depends upon the amount of data, as the huge
amount of data helps to build a better model which predicts the output more accurately.
aay
ee——— go instead of writing & cag #509)
20 i aciten bude bane X, _
me eco exis”
earning
pipetine in
itoln ier
aatai! with 5
piagram
Team fom
Features of Machine Learning: dataset
2 achine loaring uses date to debect various patterns in a given
e In-car lea from past data and improve automaticaly
. data-driven technology.
: ioc Se rnch smn to data ming ax it so deat wth the huge amour OF
Need far Machine Learning
. aerator machune iar is increasing day by day. The reason behind the need (OF
crrelte ot doing tasks that are too complex fora person to implement directly. As a humans 6 7
cree runnst acrese the huge amount of data manually. so for this, we need some computer #75:
the machine tearing to make things easy for Us
. Paci, cache leering algorthens by proving them the huge amount of data dnc let them explore Une CDS
construct the models and predict the required avtput automatically. The performance of the machine learning
sGromarn depends on the amount of date and ican be determined by the cost function. with the help of maching,
leaming, we can save both time and money.
«the importance of machine leaming can be easily understood by its use cases. Currently, machine hearning is used in
soll-diking cars, cyber fraud detection, face recognition and friend suggestion by Facebook, ete Various top
companies such as Netfix and Amazon have build machine learning models that ace using 3 vast amount of data to:
analyze the user interest and recommend product accordingly,
Following are some key points which show the importance of Machine Learning:
> Rapid increment in the production of data.
> Solving complex problems, which are difficult for a human.
> Decision making in various sector including Finance.
> Finding hidden patterns and extracting useful information from data
Classification of Machine Learning
the gate
itis
machine learning is that
have some limitations
‘Supervised
©
of
machine leaming 77% onorament
Fig. 4.3. - INTRODUCTION TO ML
‘At a broad level, machine ,
Hearing can be classitied into three types:
1. Supervised tearing canes 460) Explain supervised and
2 Unsupervised learning unsupervised Telntercement
3. Reiniorcement tear: with example -
A teaming rg,
Learning,
Supervised learning is a type af machine leaning method in which we provide sample labelled data to the machine
learning system in order te train it and on that basis it predicts the output.
The system creates a mociel using fabelted data to understand the datasets and learn about each data, once the
training ancl processing are done then we test the model by providing a sample data to check whether it's predicting
the exact output or not
* The gaal of supervised learning is Yo map input data with the output data The supervised learning is based on
supervision and it i the same as when @ student learns things in the supervision of the teacher. The example of
Supervised leaming is spam fitering
© Supervised learning can be grouped further in two categories of algorithms:
> Classification
> Regression
(JO Unsiipervised Learning
+ Unsupervised learning isa learning method in which a machine leas without any supervision.
‘The training is provided to the machine with the set of data that has not been labelled, classified or categorized and
the algorithm needs to act on that data without any supervision. The goal of unsupervised learning is to restructure
the input data into new features or 3 group of objects with similar patterns.
s In unsupervised learning we don't have @ predetermined result. The machine tries to find usefull insights from the
huge amount of data It can be further classifieds into two categories of algorithms:
> Custering
UO Reintorcement Leaming
‘© Reinforcement leaening is @ feedback-based learning method, in which learning agent gets 2 reward for each right
action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and improves its
performance. In reinforcement learning, the agent interacts with the environment and explores it. The goal of an agent:
{5 to get the most reward points and hence. it improves its performance.
* _ The robotic dog, which automatically learns the mnavernent of his arms, is an example of Reinforcement learning.
4.1.1 History of ML
+” Before some years (about 40-50 years), machine learning was science fiction, but today iti the part of our daily life.
Machine learning is making our day to day life easy from self-driving cars to Amazon virtual assistant “Alera”
However, the idea behind machine leaming is sa ol and has a long history. Below some milestones are given wich
hhave occurred in the history of machine learning:
1940's 1950's 1960's 1970's 1960" 1990 2000's 2010s
Fig. 44au researchers and this duration
vas te tough time for Ab and ced tne or A wen GS
fhine translation occurred 806
‘The First "AIT Winter:
‘+ The duration of 1974 to 1980
bn this duration, fare af mac
reduced funding by the governmen
achive Learning trom Theory to Reality
in Li rh vet sep a tw pote 6
an adaptive fer.
. ant ss an Ces Rey me nel mee Ne SN
ant how ta corey pronounce 20000 words 0 one week soa
. ttn ge cmp wen ci oe ps cs ep any SE
Panne the fst computer which had beaten a human ches €XD¢t.
1 to the researches.
echoas over phone hnes sid
able to: teach
YouTube videos.
: Youbet ot togenoasima’ lee the Ting et was the fst Chabot who consnced Me
‘of human jucges that it was not a machine
. ana eres ws dp neural etwor created by Facebook and they claimed that it could recogni # Pe
with the same precision as @ human can do
. 2o16: Aiphaco beat he word's number second player Le Sedo} at Go game. In 2047 Wt beat the number one F
of this game Kejie
«2017. In 2017, te Aiphabet’ Jigsaw team bult an iteligent system that was able to learn the antine trating,
to read millions of comments of different websites to leam to stop online trolling
Machine Learning at Present:
+ Now machine leaming has got a great advancement in its research and it is
Present everywhere around
1 eso amt ay more Wc Spee,
sc riekecaoes aig te as ,
= Modem machine learning models can be used for making various predictions, including weather ict
tpediction stock market analysis ete. “ palateOF A Bc MU (SE, A Be ML) — so anernagucTion 10 Mt
sng rachine ning yurst ne bse ne soi 1 a uc a A
the concepts of machine Yeamings
> Fundamental knowledge of prohabiity and ness algebra
. Knowledge of Cok expucsy drat of single oral and mueniae ONY
12 Examples of Machine -
, rapidly day by day. We ove using
machine learning in our daily life even
joene enast trending real-world applications of Machine LeaTiog:
Fig AS
Image Recognition:
Image recognition is one af
places digital mages: ec The papular usecase of image fecognitian and
suggestion:
Facebook provides us a feature @f auto fiend tagging suggestion
friends, then we automaticaly get @ tagging suggestion
jeatting's face detection and recagetion algerith.
fk fs based on the Facebook project named "Deep Face” which is
identification in the picture
‘Speech Recognition
While using Google, we get 27
application of machine learning.
srrvc enguon a procs of conning vce stun adits duo known #5 Spek 0
ae a coin” M pret macie ling ag ily sed by ores Hk t
speach recognition. ons assistant, Si, Cortana and Alea are using speedh recogriton technology 19 flow the
voice instructions.
sre mort common applications of machine ening. ts ud to ienfy Obes PTA
face detection is, Automatic fiend tagging
responsible for face recognition and person
eption of "Search by voice” it comes under speech recognition and its 2 popularFUNDAMENTALS OF Al ML (SEAL M)_——
3. Traffic Prediction:
Google Maps
«we want to vita new place, we wake Hep oF
‘and predicts the traffic conditions: alc eed, owen
* 1n predicts the traffic conditions such a5
two ways
> aoc fhe veld eo Googe Mao #9 08 Er
same time. from the user and sendy
> ‘Average time has taken on past days at the . sr aakesifoeation
. trey wo i aing Googe Map tpg ths EP 0, ™AE better:
back to its database to improve the performance. a
* m ies such a
Machine learing is widely nent compari an
¥ we d various e-commerce and entertain getti
tain vite Wirenever we search gor some product O” i ‘ot machine
{or Protmmerere for the sme product whl inte suring on te STE paeee
ests the product #5 Per
. Me nds th er mares ig ous mci mang ais and 8
: ete. and this is also
° a ge Nein we fd sone recommendations fr ettvmen sees MoM
done with the help of machine learning.
plays a significant role
5. Self-Driving Cars:
Machine learning
monn oasis ax. it is using
. ications of machine learning is se
S saaeg o tek te wa popular car manufacturing company is working on self-driving
ann Tearing method a the car mds to eet prope ana objects we cing
‘& Email Spam and Malware Filtering:
© Whenever we receive 2 new email, it fitered automaticaly as important, normal
important mai in our inbox with the important symbol and spar emails in our Spam
this is Machine learning. Below are some spam fiters used by Grit
> Content Fiter
> Header fier
> General biacklists filter
> Rules-based filters
> Petmission filters
: Some sachin leaning algorthms such as Mult-Layer Perceptron. Decision tree and Naive Bayes classifier are used
for email spam fitering and malware detection.
7. Virtual Personal Assistant:
«We have various virtual personal assistants such 2s Google assistant, Alexa, Cortana, Sir. As the name suggests, they
help usin finding the information using our voice instruction. These assistants can help usin various ways just by out
voice instructions such as Play music, call someone, Open an email, Scheduling an appointment, etc.
‘+ These virtual assistants use machine learning algorithms as an important part.
% These assistant record our voice instructions, send it over the server on a cloud and decode it using ML algorithms anv
act accordingly.
8. Online Fraud Detection:
* Machine learning is making. our online transaction safe and secure by detecting fraud transaction. Whenever
perform some online transaction, there may be various ways that a fraudulent transaction can take place such as fal
accounts, fake ids and steal money in the middle of a transaction. So to detect this, Feed Forward Neural netwo
helps us by checking whether itis a genuine transaction ar a fraud transaction.
‘and spam. We always receive 3
box and the technology behind:jDAMENTALS OF
AMENTALS OF AL nm neon an IWTROBUCHION TO ML
enh - - e
= en neh transaction, the ouput is converted Into some hash valies and Unewe values becorne the inptt 1"
anen: 4 dalect For each genuine transaction, ahere is a specific pattern ‘which gets change for the fraud wrarvaenon
. sit and makes our aniline trangaetions more necute
9. Stock Market Trading:
# — Tearing is widely used in stock market trading. Inthe stock market. thave is aways aes of up and crs I
flees: 90 tr {hs machine feaming’s long short term memory neural network 8 ised for the predictlon of stock
ket trends,
10, Medical Diagnosis:
* In medical sckenca, machine learning Is used for diseases diagnoses. With this, medical technology !# growing very fast
and ble to build 3D models that cn prt the exact pation f fais ih ‘he baie,
«+ {ips in finding brain tumors another brain-reated diseases asl.
11, Automatic Language Translation:
© Nowadays, if we visit "snow place and we ave rt aware of the anguage then it is not problem at all, as for this also
rary known languages, Google's GNMT (Googie Neurdt Machine
srsachine Fearring hops us by converting the text
sat yaton) provide ts eature, which a Neural Machine Learning at eansates the text into. our familiar language
‘and it calted as automatic translation.
automatic tr Je learning algorithm, which is used with
= Mechine learning ise subset of Al, which enables #
ke predictions. Machine learning contains 2 set af algorithms that work on
‘build the model &
from past experiences and ral
trom Prt deta, Data is fed to these algorithms to tain them and on the basis of training, they
perform a spaciic task
‘of machine learning
lems lke Regression, Cl
Fig. 4.6: Types:
«These ML algorithms help to solve diferent business Pro!
and Associations, etc.
«Boned on the methods and way of fearing, machine leaming
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
sification, Forecasting, Clusterin
divided into mainly four types, which are:a — - INTRODUCTION
8 —— actine 1
nomsannsornase 8M —— a we ayes
Sw et re aoe eae
4 , More precinusly,
1 Sopa ci A ing or EN aba red 10 OP he machine Io
2 sts name soggest bled”
‘the machines using ‘of the inputs 3° ‘output
nee nt species at 0 comesponsing dog images, So, |
the tail of cat
we input the
ts and
input Lee a x sire of
6
Casification dasification h
: problems in which the output variable is categorical, suc
[ ee a een eae ercavon sgorts predict the catagories present i the
Yes" or No, Male oF Female,
ae ie ae rond eames of asscatonalgthns re Spam Detection, Era tering etc
‘+ Some popular diassification algorithms are given below:
2» Random Forest Algorithm
> Decision Tree Algorithm
> Logistic Regression Algorithm
> Support Vector’ Machine Algorithm
Gi) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship between input ar
‘ouiput variables, These are used to predict continuous output variables, such as market trends, weather prediction, ett
* Some propular Regression algorithms are given below:
> Simpie Linear Regression Algorithm
2 Multivariate Regression Algorithm
> Decision Tree Algorithm
> Lasso Regression
° Since it
Supervised learning work with the labelled dataset so we can have an exact idea about the classes of objects.
+ These algorith
iain ims are helpful in predicting the output on the basis of prior experience.
JES goths ae not ate to sole complex tasks
‘may predict the wrong output if the test data is different from the traini
requires ots of computational time to train the algorithm, mee
SE eee t™SSupervised Learning algorithins are used in image segmentation. In this process, image
med on different image data with pre-defined labels
Medical Diagnosis: Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by
using Medical images and past labelled data with labels for disease conditions, With such a process, the machine can
‘identify a disease for the new patients.
Fraud Detection: Supervised Learning classification algorithms are used for identifying fraud transactions, fraud
‘Customers, etc. It is done by using historic data to identify the patterns that can lead to possible fraud.
Spam Detection : In spam detection & filtering, classification algorithms are used. These algorithms classify an email
‘2S spam or not spam. The spam emails are sent to the spam folder.
Speech Recognition: Supervised leaming algorithms are also used in speech recognition. The algorithm is trained
with voice data and various identifications can be done using the same, such as voice-activated passwords, voice
commands, ete.
2. Unsupervised Machine Learning
‘+ Unsupervised learning i different from the Supervised learning technique: as its name suggests, there is no need for
supervision. It means, in unsupervised machine learning, the machine is trained using the unlabelled dataset and the
machine predicts the output without any supervision.
* Inunsupervised learning, the models are trained with the data that is neither classified nor labelled and the model acts
on that data without any supervision
+ The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according to the
similafities, patterns and differences. Machines are instructed to find the hidden patterns from the input dataset
* Let's take an exemple to understand it more preciously, suppose there is a basket of fruit images and we input it into
the machine learning model. The images are totally unknown to the model and the task of the machine isto find the
patterns and categories of the objects.
* So, now the machine will discover its patterns and differences, such as colour difference, shape difference and predict
the output when its tested with the test dataset.
Categories of Unsupervised Machine Learning
* Unsupervised Learning can be further classified into two types, which are given below.
() Gustering
(i) Association
(i) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a way to group the
objects into a cluster such that the objects with the most similarities remain in one group and have fewer or no
similarities with the objects of other groups. An example of the clustering algorithm is grouping the customers by their
Purchasing behaviour.
Some of the popular clustering algorithms are given below:
> K-Means Clustering algorithm
> Mean-shift algorithm
> DBSCAN Algorithm
> Principal Component Analysis
> Independent Component Analysis
(il) Association
Association rule learning is an unsupervised leaming technique, which finds interesting relations among variables
Within a large dataset. The main aim of this learning algorithm is to find the dependency of one data item ove
another data item and map those variables accordingly so that it can generate maximum profit. This algorithm
mainly applied in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth algorithm.SUNDAMENTALS OF AL & Mi (SE, AL 8
ndvantages and Disadvantages of Unsupervised
Advantages. vo the supervise nes ecause tse AHS rc
. These algorithms can be used for complicated teks compares :
Of the unlabelled dataset. - elbelled dataset i ease 25 COMPETE 0 thy
+ Unsupervised sigoriths ace preferable fot vrious tasks a geting te
labelled dataset.
Disadvantages: rithms are not
: ete nner sien cant es sre athe cae ct bles we
‘trained with the exact output in ot th
. Fane aa oon nang more et st wots wh unable dnt at SO ae
the output.
Applications of Unsupervised Leaming oy nt in document network
e Network Analysis: Unsupervised learning is used for identifying plagiari
analysis of text data for scholaly articles.
«Recommendation Systeme: Recommendation systems widely use unsupervised lkamiag technlaues
recommendation applications for different web applications and e-commerce websites,
* nomaly Detection: Anomaly detection i papular appliation of unsupervised leaming, which can tery ars
data points within the dataset It is used to discover fraudulent transactions.
. Singstar Value Decomposition: Singular Vale Decomposition or SVO is used to exract particu information fom
the database. For example, extracting information of each user located at a particular location.
3. Semi-Supervised Learning
Semi-Supervsed learning is a type of Machine Leaming aigorithm that fies between Superised and Unsupervised
machine learning. i represents. the intermediate ground between Supervised (With Labeled training data) and
Unsupervised learning (with no labelled training data) algorithens and uses the combination of labelled and unlabelled
datasets during the training period.
'¢ Although Semi-supervised leaming is the middie ground between supervised and unsupervised fearning and operates
fon the data that consists ofa few labels. it mestly consists of unlabelled data, As labels are costly, but for corporate
purposes, they may have few labels. It is completely different from supervised and unsupervised leaming as they are
based on the presence & absence of labels.
? To overcome the drawbacks of supervised learning and unsupervised leaming algorithms, the concept of Sem
supervised learning is introduced. The main aim of semi-supervised learning isto effectively use all the available dat
rather than only labelled data lite in supervised learning, initially, similar data is clustered along with an unsupervise
learning algorithm and further, it helps to label the unlabelled data into labelled data tis because labelled data is
comparatively more expensive acquisition than unlabelled data
¢ We can imagine these algorithms with an example. Supervised learning is where a student is under the supervision
an instructor at home and college. Further, if that student is self-analysing the same concept without any help fro
the instructor, it comes under unsupervised leattiing. Under semi-supervised learning, the student has to revise hims
after analyzing the same concept under the guidance of an instructor at college.
Advantages and Disadvantages of Semi-Supervised Learning
Advantages:
+ Itis simple and easy to understand the algorithm,
+ Itis highly efficient.
+ Itis used to solve drawbacks of Supervised and Unsupervised Learning algorithms
Disadvantages:
‘¢ Iterations results may not be stable,
* We cannot apply these algorithms to network-level data,
* Accuracy is low.
for buildingjah a — ae ——_—__HTROBUETION Fo ME
Reinforcement
cay ane worls on a feedback-based process, in which an Al agent (A software component)
performance, Aer’, surrounding by hiding & tral taking action. laining Irom experionces and improving its
Soon. how ‘Gets fewarded for each good action and get punished for each bad action; hence the goal of
'arning agent is to maximize the rewards,
{rein
he forcement learning. there is no labelied data like supervised learning and agents learn from their experiences
™ einforcement learning process fs similar to a human being: for example, « child learns various things by
ixperiances in his day-to-day lile, An example of reinforcement learning is to play a game, where the Game is the
Environment, maves of an agent at each step define states and the goal of the agent is fo get a high acore. Agent
receives feedback in terms of punishment and rewards.
Die to its way of working, reinforcement learning is employed in different fields such as Game theory, Operation
Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision Pracess(MOP), In MDP, the agent
constantly interacts with the environment and performs actions; at each action, the environment responds and
generates a new sta
Categories of Reinforcement Learning
Reinforcement (earning is categorized mainly inta two types of methods/algorithns:
() Positive Reinforcement Leaming: Positive reinforcement learning specifies increasing the tendency that the
required behaviour would oceur again by adding something. It enhances te strength of the behaviour of the
agent and pasitively impacts it
(W) Negative Reinforcement Learning: Negative reinforcement leatning works exactly opposite to the positive RL. It
increases the tendency that the specific behaviour would occur again by avoiding the negative condition.
* Real-world Use cases of Reinforeement Learning
Video Games:
“* RL algorithms are much popular in gaming applications. It ts used to gain super-human performance. Some popular
games that use RL algorithms are AlphaGO and AlphaGO Zero
Resource Management:
+The “Resource Management with Deep Reinforcement Learning’ paper showed that how to use RL in computer to
automatically learn and schedule resources to wait for different jobs in order to minimize average jab slowdown.
Robotics:
+ RL is widely being used in Robotics applications. Robots are used in the industrial and manufacturing area and these
tobots are made more powerful with reintarcement learning. There are different industries that have their vision of
building inteltigent robots using Al and Machine learning technology.
Text Mining
+ Text-mining, one of the great applications of NLP, is now being implemented with the help of Reinforcement Learning
by Sales force company.
Advantages and Disadvantages of Reinforcement Learning
Advantages
© Ithelps in solving complex real-world prablems which are dificult to be solved by general techniques.
©The learning model of RL is similar to the learning of human beings; hence most accurate results can be found.
+ Helps in achieving long term results.
Disadvantage
2 RLalgorithms are not preferred for simple problems.
© RL algorithms require huge data and computations.
3 Too much reinforcement learning can lead to an overload of states which can weaken the results
The curse of dimensionality limits reinforcement learning for real physical systems.— pain withou!
mo od the life cycle of
ject The main purpos
me ead pr
= " = - coin steps \
ata preparation . 400? expen
ona rs . 2 cmoerin’ °
SH pnalyse Date
> Train the model
> Test the model
> Deployment
oO
Gathering
0) Deployment data
@
a Data
model
™ Machine jeaing
Train gy
model ona
wrangling
© data
@rats
OF AL MLSE Al a maty aay) INTRODUCTION TO ML
step includes the below tasks
> Identity various data sources
> Collect data
> Integrate the data obtained from different sources
By performing the above task we get a coherent set of data, also called as a dataset. It will be used in further steps.
Data Preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step where we put our data into
a suitable place and prepare it to use in our machine learning training,
In this step. first, we put all data together and then randomize the ordering of data
This step can be further divided into two processes:
(i) Data Exploration:
> It is used to understand the nature of data that we have to work with. We need to understand the characteristics,
format and quality of data
> A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends and
outliers,
(ii) Data Pre-Processing:
= Now the next step is pre-processing of data for its analysis.
Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is the process of cleaning
the data, selecting the variable to use and transforming the data in a proper format to make it more suitable for
analysis in the next step. It is one of the most important steps of the complete process. Cleaning of data is required to
address the quality issues
It is not necessary that data we have collected is always of our use as some of the data may not be useful. In real
worid applications, collected data may have various issues, including:
> Missing Values
> Duplicate data
> Invalid data
> Noise
So, we use various filtering techniques to clean the data
It is mandatory to detect and remove the above issues because it can negatively affect the quality of the outcome,
Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
> Selection of analytical techniques
> Building models
> Review the result
The aim of this step is to build a machine learning mode! to analyze the data using various analytical techniques and
review the outcome. It starts with the determination of the type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster analysis, Association, ete, then build the model using prepared
data and evaluate the model.
Hence, in this step, we take the data and use machine learning algorithms to build the model.
‘Train Modet
Now the next step is to train the madel, in this step we train our model to improve its performance for better outcome
of the problem.
We use datasets to train the model using various machiine learning algorithms. Training a model is required so that i
Gan understand the various patterns, rules and, features.—aae_—
po
ss. 8 ne model. In this st %
test U ep.
then we we eget
UNOAMENTALS OF AI & ME
6. Text Model been trained on 2 9
+ nce ur mine ening eS dng vest dataset tH the #
Rrineccney ator mde by rowing 9 eH SNE al as par
the model determines the percentage —
Donegan deploy the model in the real-world system,
Tala wp of machine earring ie yc is deployment. where we COPTT amment with acceptable speed, th
11 above prepared model is producing an accurate Teall # PST et we will check whether it is improving yw
deploy Ye model in the teal system But before deploying the project. we WY CNG A one tor a project
petlormance using available data or not. The deployment phase is similar to makING
[44 ALAND Mt
+ Neil itsligance ead machine learning are the part of computer science thal is correlated with each other. Thes
mince eet ae the most trending technologies which are used for creating intelligent systems.
Coma ees ie teated techectogies and sometimes people use them as a synonym for each other, but sti
ent terms in various cases
Ona tread lee we can diferentiate both Al sre) tat ox:
© allie a Bioger concept tae
ven dataset.
equirement Of Project OF Problern, ?Machine leaming is about extracting knowledge ftom the tata It can be defined as,
‘Machine learning is 4 subfield of artilicial intelligence. which enatiles machines to learn fram past dats or experiences,
without being explicitly programmed
‘Machine learning enables a computer system te make predictions or take some decisions sing Nistorical data without
being explicitly programmed. Machine learning uses a massive amount of structured and semi-structured data so that
4 machine learning model can generate accurate result or give predictions based on that dare
‘Machine learning works on algorithm which learn by it's own Using historical data. It works ony for
such as if we are creating a machine learning model to detect pictures of dogs, it will only give result for dog images.
but it we provide » new data tke cat image then it will become unresponsive, Machine learning is being used i
various places such as for online recommender system, for Google search algorithens. Email spar filter, Facebook AUto
fiend tagging suggestion. et
1t can be divided into three types:
specific domains
1 Supervised learning
2, Reinforcement earring
3 ised learning @4cc> DisHngwsn betweer artificial T-2mL
le 4.1: Key Differences Between Artificial Intelligence (AI) and Machine Learning (ML)
[Attica Intetigence Machine Learning
‘Artficial intligence is a technology whic enables 2 mache 1 ‘Machine earring is 2 subset of Al which aiows @ machine to
‘simulate human benavior ‘automaticaty lam from past data wihout progcarening expicly
| The goat of Als to mate @ smart computer system ike hurias to | ‘The goa of ML isto alow machines 10 loam Irom dato so that hey can
soe complex problems. ‘ive accurate out
[ tnt we toach machines with data to pertorm a partcutar task and
4 Anse et pen eto at ho
ive an accurate result
[ Machine tearsing and dean ering are he wo man subse 01, | Dewplowing sa main subsatof machine marin
eereseeetoe Vette ect gions ed ee.
‘Al's working to create an intaligent system which can perform various | Machine leeming is working to create machines that can perform only
comple tasks. those specific tasks lr which they ae trained.
Machine earring is mainly concerned about accuracy and paters
“The main applications of machine learning are Oniine recommender
system, Google search elgarthms, Facebook auto find tagging
suggestions, et
(On the basis of capabities, Alcan be divided ito twee types, which | Machine learning can also be divided into mainly twee fypes that are
‘are, Weak Al, General Aland Strong Al. ‘Supervised leaming, Unsupervised learning and Retnforcament
learning
It ineludes fearing and seit-correction when introduced with new data
Al system is concerned about maximizing the chances of suctess,
The main applications of Ai are Siri, cusiomer support using catbosts
Expert System, Online game playing inteligent humanoid rbot, etc.
{includes learning. reasoning and selt-correction.
‘Al completely deals with Structured, semi-structured and unstuctured | Machine leaming deals with Structured and semi-structured data
data.
[4.5 DATASET FOR ML
The key to success in the field of machine leaming or to become a great data scientist is to practice with different
types of datasets. But discovering a suitable dataset for each kind of machine learning project isa dificult task. So, in
|. this section, we will provide the detail of the sources from where you can easily get the dataset according to your
Project.
Before knowing the sources of the machine learning dataset, let's discuss datasets.+
et can contain any dae,
Aa tabular dataset can be unde
variable and each row Ci
Comma Separated File.” OF
of Data in Datasets
(i) Numerical Data: Such as house Pr temperature, ete
Blue/green. etc
i aregea be measured on the basis of comparis
initial level. Th
i) Categorical Data: Such as Ve
data are similar to categorical data but an
yge and process at the
(ii) Ordinal Data: These
ote: A reotworld dataset & of huge size, which 5 difficult to mana
practice machine lering algorthims, we can use 2° dummy dataset
Need of Dataset
«To work with machine leaming projects, we need #
ML/AI models. Calecting and preparing the dataset is
The technology applied behind any ML projects cannot wor
processed,
f the Mt project. the developers ¢:
«During the development of
anplications, datasets are divided into two parts:
> Training dataset:
| Validation set | Test sat
wage amount of data, because, without the data, one ¢3
‘one of the most crucial parts while creating 2” MUAI
k properly if the dataset is not well prepared
lompletely rely on the datasets. In bui
Final performance estimate
Fig. 4.10
Predictive mod!
Note: [i
rhe datasets zte of large size, so to download these datasets,
you must have fast internet on your computeNOAMENTALS OF At 2
oa [AL Bc aL (SE, A Be MALY aan
popular Sources for Machine Learning Datase
for the public to work on it
/ aetow isthe tst of datasets wtch ace freely
/ 1. Kaggle Datasets
/
Coc
earners, It allows
with other machine
atasets for Data Scientists and Machine L
ithe opportunity to work
‘qurces for providing 4
covides
Kaggle is one of the best 5
oF and publish datasets in an easy way HOSS Ps
sult asks.
find, downto:
learning engineers and solve diffic
raset in
Data Science related %
diferent formats that we can easily find onc download
Kaggle provides a high-quality da
The link for the Kaggle dataset is htt
Uct Machine Learning Repository
ps //wews kag
<. This repository contains
raset
the
vachine learning dal
xd by the machin ing community for
f the great sources of Mm:
are widely use’
ning repository is one OF
nd data generators th: e lear
© UCI Machine lear!
databases, domain theories 2
analysis of ML algorithms.
has been widely used
it yy students, professors researchers a5 2 primary source of machine
sification,
t, Poker
* Since the year 1987,
learning dataset.
the datasets as per t!
It also contains som
g such as Regression, Clas
s of machine learnin
et, Car Evaluation dataset
he problems and task:
sets such as the Iris datas
«It classifies
Clustering, et e of the popular data
” Hand dataset, et.
he UCI machine learnint
is https//archiveics uc.edu/mi/index phe.
g repository
© The link for tf
ou
es, These datasets
izations,
vent organ!
st
access and share oa
we can ser dommes sures Mt provided 0” pe
can be accessed through AW! ; _ sm
Ee ‘AWS resou!
researches, businesses OF song tov 078
build various services it ott _
+ Anyone can — cenen ‘on dato analysis eather than on aca} rate _— rai
ape ih examples anid Wa) een
es of datasets Wit oo
+ Pic i eo ua for the required dataset ‘Anyone can add any
search box using which we can Sear |
,
of Open Data on AW’ 7
‘é The fink for the resource 1 hitps//ragistry.opendata
4. Google's Dataset Search Engine
Cunefieator
;
j
ptember 5, 2018, This source helps !
ilable for use. ;
\olbox google. com/datasetsearch 4
«Google dataset search engine is a search engine launched By Google on Se
fesearchers to get online datasets that are freely ava
«The link for the Google dataset search engine is https://to
5, Microsoft Datasets
* The Microsoft has launched the "Microsoft Research Open data’
various areas such as natural language processing, computer vision and domain-specific sciences,
." repository with the collection of free datasets ins
© Using this resource, we can download the datasets to use on the current device or we can also directly use it on
se the
* The link to download or use the dataset from this resource is https://msropendata.com/.Als OF
ao AL& MLSE. at amy
(49)
INTRODUCTION TO ML
ey pee
Le ste Tg]
scaeeseeaeeeteiiamenmnenaeaesel
* Awesome public dataset collection provides high-quality datasets that are arranged in @ well-organized manner within
a list according to topics such as Agriculture, Biology, Climate, Complex networks, etc: Most of the datasets are
available free, but some may not, so it is better to.check the license before downloading the dataset
+ The link to download the dataset from Awesome public dataset collection is
https://github, com/awesomedata/awesome-public-datasets.
7. Government Datasets
+ There are different sources to get goverrment-related data, Vaious counting publish government data for public use
collected by them from different departments
these datasets is to increase transparency of govern
of government datasets:
© The goal of providing iment work amnong the people and to us
the data in an innovative approach, Below are some units
> Indian Government dataset
> US Government Dataset
> Northern Ireland Public Sector Datasets
y= European Union Open Data Portal
& Computer Vision Datasets
«Visual data provides multiple numbers of the gre
Classification, Video classification. Image Segmentation,
© orimage processing, then you can refer 0 this source.
«The link for downloading the dataset from this source is https//www. yisualdataio/. /
to computer visions such as Ima
wat detaset that are specific
1 to build a project on deep learit
‘etc, Therefore, if you wantvy and real-world datasets
rovides both #
veral dataset APL
predefined funct
ources.
Teaming enthusiasts. This source P
ackage and using 9°"
be baded using 5°
any
ions such as
But these
great source for machine
from sklear.datasets Ps
tained
= Seikitslearn is 2
file from external 5
These datasets can be ol
sckitleam can
ete, rather than importing
the toy dataset available OF
win XD)
y), Yoad iis
Joad_boston(ireture XI
ot suitable for real-world projects
wurce is hetps://sci idatasets/index htm! }
datasets are
mntoad datasets from this $01 iKit-learnorg/stable/
+ The link 10 dow
fe learning model. It is
[a.6 DATA PRE-PROCESSING
‘ocess of preparing the raw data and 1m
sine loaming model
» Data pre-processing is # pr aking it suitable for 2 machin
ted data.
the first and crucial step while creating @ mac
itis not always a case that we come across the clean and format
© When creating 3 machine learning project
‘And white doing any operation with data, itis mandatory fo ‘dean it and put in a formatted way. So for this, we Use
data pre-processing task
Why Do We Need Data Pre-processing?
‘A real-wotld data generally contains ncises, missing values and maybe in an unusable format which cannot be directly
is required tasks for cleaning the data and making it suitable
usad for machine learning models. Data pre-processing i
for a machir mt
machine learing model which alo increases the acuraty and efficiency ofa machine learning modelmy INTRODUCTION TO ML f
> Encoding Categorical Data
> Splitting dataset into training and test set
> Feature scaling
1, Get the Dataset .
To create a machine tearning model, the fist thing we required is a dataset as a machine learning model completely
‘works on data, The collected data for a particular problem in a proper format is known as the dataset.
* Dataset may be of different formats for different purposes, such as. it we want to create a machine leaming model tor
business purpose, then dataset will be different with the dataset required for a liver patient. So each dataset is
Wifferent from another dataset. To use the dataset in our code, we usually put it into a CSV file. However, sometimes,
We may also need to use an HTML or xlsx file
What is a CSV File?
* CSV stands for “Comma-Separated Values” files: itis a file format which allows us to save the tabular data, such as
Spreadsheets. Its useful for huge datasets and can use these datasets in programs.
* Here we will use a demo dataset for data pre-processing and for practice, it can be downloaded from here,
“https//wwew superdatascience.com/pages/machine-learning. For real-world prablems, we can download datasets
Online from various sources such as, httns//www kagole.com/ucimU/datasets, https://archiveics uci edu/mi/index php
ete.
* We'can also create our dataset by gathering data using vafious APL with Python and put that data into a csv file:
Importing Libraries
* Ih onder to pertorm data pre-processing using Python, we need to import some predefined Pythion libraries. These
Nbraries are used to perform some specific jobs. There are three specific libraries that we will use for data pre-
Processing, which are:
(2) Numpy: Numpy Python library is used for including any type of mathematical operation in the code. It is the
fundamental package for scientific calculation in Python. It also supports to add large, multi-dimensional arrays
‘and matrices. So, in Python, we can import it as:
> Import Numpy as nm: Here we have used nm, which is a short name for Numpy and it will be used in the whole
»
program,
(b) Matplottib: The second library is matplotlib, which is 3 Python 20 plotting library and with this library, we need to
import 2 sub-library pyplot. This library is used to plot any type of charts in Python for the code. It will be
imported as below:
> Import matplotlib.pyplot as mpt: Here we have used mpt as a short name for this library.
(©) Pandas: The last library is the Pandas library, which is one of the most famous Python libraries and used for
importing and managing the datasets. It is an open-source data manipulation and analysis library. It will be
imported as below:
Here, we have used pd as a short name for this library.
import numpy as nm
import matplotlib pyplot as mtp
import pandas as pd
Now we need to import the datasets which we have collected for our machine learning Project. But before importing z
sdataset, we need to set the current directory as a working directory. To set a working directory in Spyder IDE, we neet
to follow the below steps:apropos ncccnm ene *
.
3 ere 08 em
nee a*\MENTALS OF
> ALG MUISE, AL e ML) wan INTRODUCTION TO ML ¢
feat a cor eae tings rd om wk ne at dng thon can ste car he
icking an the format option.
ntracting Dependent and independent Variables:
+ Ih machine feerning, it is important to distinguish the matrix of features (independent vanables) and dependent
vatiablos from dataset. n our dataset, there are three independent variables that are Country, Age ancl Salary and one
5. dependent variable which is purchayed.
Extracting Independent Variable;
+ To extract an independent variable, we will use tloct | method of Pandas library,
and columns from the dataset.
x= dato_set.iloc(',-1} values
+ tn the abave code, the frst colom() is
have used ©1, because we don't want to take the
we will get the matrix of feat
© By executing the above code, we will get output as
> [['Indio’ 38.0 68000.0)
(France? 43.0 450000)
{(Germany’ 30.0 $4000.0)
('France’ 48,0 65000,0}
(Germany: 40.0 nan}
[ndia’ 35.0 $800.0)
((Germany’ rian 530000)
[ France’ 49.0 790000)
{'ndia’ 50.0 8800.0}
('France’ 37.0 77000.0))
* As we can seein the above output, there are only three variables.
Extracting Dependent Variable:
+ Toextract dependent variables, again, we will use Pandas .iloc(] method
y= data_set.iloc[:,3] values
+ Here we have taken all the rows with the last column only, It will give the array of dependent variables.
It is used to exteact the required rows
wed to take all the rows and the second colon) Is for all the columns. Here we
1 column as it contains the dependent variable, So by doing this,
VV¥YVY¥YYYY¥YY
+ By executing the above code, we will get output as:
Dutput:
array(('No’, ‘Yes’, ‘No’, ‘No’ Yes’, ‘No’, Yes!)
dtypesobject)
Jote: ff you are using Python language for machine learning, then extraction is mandatory, but for R language it fs not
equired.
. Handling Missing Data:
‘The next step of data pre-processing Is to handle missing data in the datasets, If our dataset contains some missing
data, then it may cfeate a huge problem for our machine Yearning model, Hence it Is necessary to handle missing
values present in the dataset.
fays to Handle Missing Data:
nere are mainly two ways to handle missing data, which are:
> By Deleting the Particular Row: The first way is used to commonly deal with null values, In this way, we jus’
delete the specilic row or column which consists of null values. But this way is not so efficient and removing dat
may lead to Joss of information which will not give the accurate output.framed, 480, 6500001
[Geemany’, 40.0. (65222.222222222221,
[rindia’, 35.0, $8000.01,
[Germany ‘aL LED TL12121114, 5300001
[France 49.0, 790000},
india’, $0.0, 880000},
(France, 37.0, 7700.0}, utype object
values have been replaced with the means
dataset: there are two categorical variable, Country |
‘As we ean see in the above output
5. Encoding Categorical Data:
‘¢ Categorical data is data which
and Purchased.
«Since machine leaming model completely works on
‘categorical variable, then it may create trouble
variables into numbers.
For Country Variable:
country variables into categorical data. So te do this, we will use
Firstly, we wil convert the
pre-processing library.
#Catgorical data
#for Country Voriable
the missing
thas some categories such as, in oUF
mathematics and numbers, but if
while building the model. So it is necessary
from sklearn pre-processing import LabelEncoder
label_encoder_x= LabelEncoder()
x{:, 0} lobel_encoder_x.fit_transform(xt:, 0])
Output:
Out(asp
array({[2, 38.0, 680000},
10, 43.0, 45000.9},
1,300, 5400.0),
10, 48.0, 6500.0},
‘of rest column values.
‘our dataset would have &
to encode these categorical
LabelEncoder() class from7 a -
labelencoder_ys Labelncoder()
¥F labelencoder_y fit_transtorm(y)
‘+ For the secanel categorical variable, we will only use labetencoder abject of LableEncoder class. Here we are nat Using
Onertotincoxter class because the purchased variable has only two categories yes or no. and which are automatically
encoded into 0 and 1
‘Output:
Qut{L7} array{t0, 2. 0,0,1, 1,0, 1,0. 4p
‘It can also be seen as:
6. Splitting the Dataset into the Training Set and Test Set
© It machine teaming data pre-processing, we divide: our dataset into a training set and test set. This is one of
crucial steps of data pre-processing as by doing this, we can enhance the performance of our machine leaning mod
‘© Suppose, if we have given training to our machine leatring model by a dataset and we test it by a completely di
dataset. Then, it will create difficulties for our mode! to understand the correlations between the models.OF Al AL (SE. AB ML) 425, erRODUCTION TO ME
(1. 490, 65222.22222222222),
(2. 350, 580000,
(1 41.aantaaaiaiii4, s30000),
10, 49.0, 7900000}.
(2,500, 889000).
{0 37.0, T7000.04, dtype=abject)
Explanation:
. met xh np anc co sien ny Ts me eS
‘vanables into digits
. ot a ene i cm nse ea
On anny ma ony mon a ems oe a nS
aoc te war ct Sts reene oe. wee ny uy
‘Qummy Variables:
. mer es wrest oe rete Poe
Da tn maroon 0 in ony ens ate na ees aS
number of categories.
«to our dataset we have 3 categories $0 it will produce:
wil use OneHatEncoder class of pre-processing MBTaFY
+ Sefor Country Variable
from shlecrn.pre-precessng import LabelEncoder, OneHotEncoder
+ label encader_x= LabelEncoder{)
xf;, OF labelencoder_xfit_transtorm(xt OD)
hres columns having 0 and I values. For Dummy Encoding we
‘onehot_encoder= OneHotEncoder(categorical_features= fon
oo onehot encoder fit__transform(x).toarrey0
Output:
‘array(tg0.00000000e+00, 0.00000000¢+00, 1.9090000e+00, 3.80000000e+01, 6800000002 +04),
{1.00000000e+60, 0.00000000e+ 00, 0.00000000e+08, 4300000008 +02, 4.sp00000e+04},
{n-00090000e +00, 0:90000000e +00, o-o0000000e+98 -g0090000e +1, 6 50000000e+04),
{e.00000000e400, 1.90000000€+00, 0:90000000e+00 4.09000000e+01, 6 52222222e+04),
{0.00000000e +00, .90000000e+00, L n000000e+00. 3.50000000e+G1, 5:200000002+04},
{9.00000000e +00, 1. ‘00000000 +100, 0.00600000<+00, ‘AT1I1 Le +01, 5 30000000e+04},
‘f:00000000e+09, :00000000e+00, .00900000e+00, .4:90000000e +01, 7.90000000e+04],
{p.oone0d0de-+-00, .00000000e+00, 1 00090000e+00- s,o0900000e-01, 8.80000000e+04),
tr.ec09000e+60, en0000006e+00, .90000000e+00. 3 7oq00000e+01, 7.70000000e+ 0411)
he saswe can seein the above outta the variables are encoded into numbers and and divided inte three colum
‘ tt canbe seen more cleat in the variable explore section, by clicking an x epHon 2= ————————E—EE~
“FUURAAENETALS OF AT AMAL (SE, ALAMO)
reese te tee
* As we can see, the age and salary column vahies afe not on the same scale. A machine leaming model is based on
Eucidean distance and it we do not scale the variable, then it will cause some issue in our machine learning model
Euclidean destance w gren 3s
Fig 4.12 Euclidean distance between A and B = \/(X Xi) sw
3 we compute any two values from age and salary, then salary values wil dominate the age values and it wil produc
30 incomrect result. So to remove this sue, we need to petfotm featute scaling lr machine learning
* There are two ways to perform feature scaling in machine learning
Standardization
Now value Crete
Ne jeg Amanda) €—— Moan
a
Now vale ( Cigna! vee,
WG
_ Here, we will use the standardization method for our datuset.1a (429)
feature scaling, we wil import StandardScaler class of stlearn pre-processing Werary as
from sklsam.pre-processing import Standardscaler
"Now, we will create the object of StandardScaler lass fo
transform the training dataset
St_2x= StondardSeaier()
X_troin= st_x.Fit_transform(x_troin)
+ For test dataset, we will directly apply transform0) fun
training set
x_test= st_xtransform(x_test)
Output:
By executing the above lines of code, we will get the
x tain:
1 independent variables or Yeatyres And then we wil it and
ction instead of fit.transform) because it is already done 19
scaled values for x train and x test a5:
x test:
‘As we can see in the above output, all the variables are scaled between values -1 to 1.
Note: Here, we have not scaled the dependent variable because there are only two values D and 1. But if these variables
will ave more range of values, then we will also need to scale those variables,ots ‘
rg re cane)
1 Variable
-tvalues
#impo'
data_set=
jpextracting Independen
xc data_setioct
sing dat withthe mean value)
stranding missing dataReplacing ssid
+ Tmputer
from sklearn pre-processing import TP “s capsenian
imputer= Tmputer(missing_values 2'NaN', strategy="mean’, axis )
spitting imputer object to the independent varibles x
imputerimputer= imputer.fit(xl:. 13D)
#Replocing missing data with the calculated mea
I imputer.transform(x{:, 1:3])
n value
xf,
#for Country Variable
from sklearn pre-processing import LabelEncoder, OneHotEncoder
label_encoder_x= LabelEncoder()
x{:, OJ: label_encoder_x.fit_transform(x{:, OD)
#Encoding for dummy variables
onehot_encoder= OneHotEncoder(categorical_features= [0})
x onehot_encoder.fit_transform(x).toarray()
##encoding for purchased variable
labelencoder_y= LabelEncoder()
y= labelencoder_y.fit_transform(y)
# Splitting the dataset into training and test set.
from sklearn.mode|_selection import train_test_split
os
Pe
‘
pe ‘ A
trate, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random state=0)IDAMENT/
OAMENTALS OFALSML(SE AtgeML) (430) eae ——
#Feature Scaling of datasets
from sklearn pre-processing import StandardScaler
st_x= StandardScaler()
X_train: St_xfit_transform(x_train)
A test= st_x.transform(x_test)
In the above code, we have included all the data pre-processing steps together. But there are some steps or lines of code
‘which are not necessary for all machine leaming models. So we can exclude them from our code to make it reusable for all
models.
«(47 POSITIVE AND NEGATIVE CLASS
Confusion Matrix:
* The confusion matrix, precision, recall and Fl score gives better intuition of prediction results as compared to
accuracy. To understand the concepts, we will limit this article to binary classification only.
What is a Confusion Matrix?
+ It isa matrix of size 22 for binary classification with actual values on one axis and predicted on another.
QBCo DiFFesence fowl
bdbween positive Negative Positive
cass ® negelive
class Negative True negative | False negative
yo | [es
f Positive False positive True positive
Fig. 4.13
~ Confusion Matrix
Let's understand the confusing terms in the confusion matric true positive, true negative, false negative and false
positive with an example.
Example
+ Amachineleaming mode is trained to predict tumor in patients, The test dataset consists of 100 people.
Negative Positive
5 | Negative 60 8
ET pains |e 10
Fig. 414
Confusion Matrix for Tumor Detection
‘© True Positive(TP): Model correctly predicts the positive class (prediction and actual both are positive). In the above
‘example, 10 people who have tumors are predicted positively by the model.
True Negative (TN): Model correctly predicts the negative class (prediction and actual both are negative). In the
above example, 60 people who don't have tumors are predicted negatively by the model.
False Positive (FP): Model gives the wrong prediction of the negative class (predicted-positive, actual-negative). In
the above example, 22 people are predicted as positive of having a tumor, although they don’t have a tumor. FP is
also called a TYPE I error.fine learning algorithms.
meant 25 of unsupervised fearing algorithms
neaoteg of association tule mining with suitable exam,
tages of association nile mining
learning algorithms.
semi supervised learning algorithms.
ent learning with suitable examp|
11. Enlist the various advantages and disadvan'
12. Explain any three popular sem! supervised machine
1. Enlist the various advantages and disadventages of
14. What is reinforcement learning? Explain the working of reinforce ,
Fnlict the various sdvantages and disadvantages af reinforcement learning.
Draw and explain the life cycle of machine leaming.
Differentiate between Al and ML
18, What are the different types of data?
Explain the need of training and testing the datasets for ML models.
20. Enlist the popular sources of ML datasets.
What is the need of data pre-processing? Descrive the various steps involved in data pre-processing
. What is confusion matrix? Define the fallawing terms
® True positives
b. False positives
© True negatives
d, Falge negatives
© Precision
£ Recall
23. What is cross validation? Deséribe the various techniques to perform cross validation
24. Differentiate between cross validation and test-train split
25, Describe the limitations of cross validation,
Et
oeENTALS
MESITAAS OF ALG Mi SEAL tatty aan
| we train our model y |
se the petton ea, a a8 raring accuracy sao vy High, Bu we provi anew dataset ot her it wil
tet and tag aNE®. So we aways ty ta make a machine ening model which performs well with the training
with the test dataset Here, we can define these datasets as
+} Diane — “+
‘Training set ‘Yost set
Hig Aan
* Training Set: A subset of dataset to train the machine learning model and we already know the output,
* Test set: A subset of dataset to test he machine learning model and by using the test set, model predicts the output,
+ For splitting the dataset, we will use the below lines of cove:
from skleornmodel_selection import train_test._split
*_train, x_test, y_train, y_tests troin_test_split(x, y, test_size= 0.2, rondom_state=0)
Explanation:
+ In the above code, the first line js used for splitting arrays of the dataset into random train and test subsets
‘+ In the second line, we have used four variables for our output that are
> x .train: features for the training data
> x test: features for testing data
> y. trains: Dependent variables for training data
> y.test: Independent variable for testing data
“en tralntest_split() function, we have passed four parameters in which liest two are for arrays of data and test size is for
Specifying the size of the test set. The test.size maybe S, .3 or 2, which tells the dividing ratio of training and testing
sets.
* The last parameter random state is used to set a seed for a random generator so that you always get the same result
‘and the most used value for this is 42.
Output:
‘By executing the above code, we will get 4 different variables, which can be seen under the variable explorer section.
‘As we can see in the above image, the x and y variables are divided into 4 different variables with correspondin
values
7. Feature Sealing
Feature scaling is the final step of data pre-processing in machine learning. It is a technique to standardize |
independent variables of the dataset in 2 specific range. n feature scaling, we put our variables in the same range <
in the same scale so that no any variable dominate the other variable.+ False Negative (FN): Mostel ywrons
ggearamy) AE a i actual posi
cs he pst a peter “So
‘xara 8 people wv have tumors are gaditd as regal ate (P0), False Negative Rate (FPR) ty. ;
‘Wi the help ofthese Your vais, wo can calelat Trve Posie Na
favo (TNR) and Fae Negative Rate (NR) 7
TPR * Rajon Ponte”
FN a.
FAR © Zam poate * 19+ FN
mW,
TR * Tani negate “TW + FP
ae
FPR © val Negative * TH+ FP
© Even it date is tbalanced, we car Squre Out that our model ic working well oF not. For that the values of TPR ang,
TNR should be high, FPR and FNR should be as tow as pastibte,
+ With the help of TP, TN, PN and FP, other performance metrics can be calculated.
Precisién, Recall
* Both precision anki reeall are crucial for information retrieval, where postive class mattered the most as compared to
negative, Why?
* While searching something on the web, the madel does. not care about something irrelevant and not retrieved (this ts
the true negative case). Therefore only TP, FP, FN are used in Precision and Recll,
Precision
Out of al the positive predicted, what percentage i try postive
Pe
Prection = 3 Fp
The precision value lies between 0 and 1
Recall
‘Out of the total positive, what percentage is predicted positive? Its the same as TPR (true positive rate).
TP.
Real = a
How are precision and recall useful? Let's see through examples
Example 1: Credit card fraud detection
Confusion Matrix for Credit Card Fraud Detection
We do nat want t0 miss any fraud transactions. Therefore, we want False-Negative to be a6 low as possible, In th
situations, we can compromise with the low precision, but recall should be high. Similarly, in the medical applica
we don’t want to miss aly patient. Therefore we focus oni having a high recall
‘So far, we have discussed when the recall is important than precision. But, when is the precision more important t
recall?Confaion Maia fr Spam Deecon
* In the detection of spam mai itis okay i any spam mail remains undetected (alse negative), but what if we miss
Reserve a subset of the dataset asa validation set.
Provide the training to the mode! using the training dataset
> Now, evaluate model performance using the validation set. If the model performs well with the validation set,
perform the further step, else check for the issues.
Methods used for Cross-Validation
‘© There are some common methods that are used for cross-validation. these methods are given below:
1. Valiedation Set Approach 2. Leave-P-out cross-validation
3. Leave one out cross-validation 4. K-fold cross-validation:
5, Stratified k-fold cross-validation
1 Validation Set Approach
© We divide our input dataset into. a training set and test or validation set in the validation set approach. Both the
subsets are given 50% of the dataset.ow
core dataset to trait UF MOSEL soy
‘ve the undar-ftted modet ="
FUNDAMENTALS OF AL 8 Na (SE, AI BML) aan
ae hat we are just sing 8
S Sit tha a le i See nat tne taser 1x also tends 10.
Leave-P-Out Cross-Validation if there are total n data points in,
= Me tac ow wt tn ig Se cae 9s pana ag Pee
~
input dataset, then #-p data points wil be used as the training js calqilated to know the effectine, ce Sy
oan eevee: provers tc cepeated tor al the samples and the average erTor is © ves So
modet the large B. oF
Py There nna dsadventage of this technique dat st can be computationally dict for =
2. Leave One Out Cross-Validation
‘oe
take 1 dataset out of
Thin method i similar tothe leave-p-out cross-validation, but instead of p, we need 16 Te tS aon or
means, in this approach, for each learning set, only one data point is reserved ‘we get n different training set ang’ *
train the model. This process repeats for each data point, Hence for n samples, we 9 * .
test set. it has the following features
> 9 this approach, the bias is minimum as all the data points are used
> The process is executed for n times: hence execution time is high
> This approach leads ta high variation in testing the effectiveness of the model, as we iteratively check 2G2inst ong
data poset
4 Fold Cross-Validation
told cross-validation approach divicles the input dataset into K groups of samples of equal sizes. These samples are
called folds. For each learning set, the prediction function uses k-1 folds and the rest of the folds are used for the test
‘Set. This approach is a very popular CV approach because it is easy to understand and the output is less biased than |
other methods.
‘The steps for k-fold cross-validation are:
2 Split the input dataset into K groups
> For each group:
Take one group as the reserve or test data set
Uxe remaining groups as the training dataset
Fit the model on the Watning set and evaluate the performance of the mode! using the test set
+ Let's take an example of 5:folds cross-validation. So, the dataset is gréuped inte § folds. Orv 1" iteration, the first fold
is reserved for test the model and rest are used to train thve model. On 2 iteration, the second fold is used ta test the)
model and rest are aised:to train the model. This process will continue Until each fold is not used for the test fold.
Consider the below diagram:
Fold 1 Fou 1 Fo 1
[row] {eve 2] ire]
[Some roa] [Bao] ffoaa) > [ee
Fo 4 Fold 4 Fone 4]
Fig. 427KeFolg TRE DUCTION Fo Hat
Ths technique wt Validation
{PRED kW proceso TO sosltion wth om Ht change This approach watson aienon
complete dataset Ty, rearanging the dats Lo enine tha each fold or group te a good mpreventave of the
Ais ‘eat withthe bias and variance, it sone of the best approachen
otha houses Ta th am pe of hung kes, ich Ith ste ts can be uch igh Yan
Holdout tackle such situations, a stratified k-old cross-validation technique is sett
This method
i” ' the slmpletcrom-yaidation technique among ll In his method, we need io remave a subset of the
aig data and use it to get precicon tests by taining it on the rest pao the datset.
The error that octurs inthis proces el haw well ovr mod wl parfonn with the unnown detane. Although this
Speroech 1 ple Yo perform. it stil faces the isu of high vaviance and it abo predces misleading rests
sometimes.
Comparison of Cross-Validation to Teain/Test Split in Machine Learning,
© Traln/Test Spl: Te input daa is divided into two pats, that ae trbning sat and test set cn a rato f 70:30, 8:20,
te. It provides # high variance, which is one of the biggest disadvantages.
> Training Data: The training data is used to tain the model andl the dependent variable is known.
> Test Data: The test data i used to make the predictions fom the model that is already trained on the taining
data. This has the same features as training data but not the part of that
Cross-Velidation Dataset: It is used to overcome the disadvantage of waltvtest split by splitting the dataset into
Sroups of tralin/test spits and averaging the result It can be used IF we want to optimize our modeb that has been
‘ained on the taining dataset for the best performance. It is more efficient as compared to train/test split as every
Sbservation is used for the training and testing bath.
Limitations of Cross-Vatidation
“There are some imitations ofthe cross-validation technique, which are given below.
* For the ideal conditions, it provides the optimum output. But for the inconsistent data. it may produce a drastic result
So, itis one of the big disadvantages of cross-validation, as there is no cerainty of the type of data in machine
learning.
In predictive modeling, the data evolves over a period, due to which, it may face the differences between the training
Set and validation sats. Such as if we create a madel for the preidiction of stock market values and the data i trainec
‘on the previous 5 years stock values, but the realistic future values for the next 5 years may drastically different, sa iti
difficult to expect the correct output for such situations.
Applications of Cross-Validation
* This technique can be used to compare the performance of different predictive modelling riethods.
It has great scope in the medical research field,
W can also be used for the meta-analysis, as it is already being used by the date scientists in the field of medic
statisties.
EXERCISE
What is machine learning? Explain the working machine leaming models with suitable diagram,
What are the different features of ML?
Give the classification of machine learning techniques with suitable exarnples.
Describe any five applications of ML in brief
What are the diferent types of leanings in MIL?
Explain any three popular supervised machine learning algorithms.
Enlist the various advantages and disadvantages of supervised learning algorithms.
Seba pe