[go: up one dir, main page]

0% found this document useful (0 votes)
44 views66 pages

CH 01 Intro To ML - Updated

Uploaded by

1032210687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views66 pages

CH 01 Intro To ML - Updated

Uploaded by

1032210687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

MODULE – 1

INTRODUCTION
TO
MACHINE LEARNING
MODULE 1 - SYLLABUS
❖ Machine Learning terminology
❖ Types of Machine Learning
❖ Issues in Machine Learning
❖ Application of Machine Learning
❖ Steps in developing ML application
❖ How to choose the right algorithm
❖ Hypothesis Testing
Introduction
❖ Machine Learning is said as a subset of Artificial Intelligence (AI)
❖ It is mainly concerned with the development of algorithms which allow a
computer to learn from the data and past experiences on their own.
❖ It improve performance from experiences, and predict things without
being explicitly programmed.
Introduction
❖ Using sample historical data (training data), ML algorithms
build a mathematical model that helps in making
predictions
❖ It brings computer science and statistics together for
creating predictive models
❖ The more we will provide the information, the higher will
be the performance.
Working of Machine Learning
❖ Learn
❖ Make decisions
❖ A ML system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the
output for it

❖ The accuracy of predicted output depends upon the amount of


data.
Traditional Programming Vs. Machine Learning
Need of Machine Learning
❖ It is capable of doing tasks that are too complex for a person.
❖ Machine learning is widely used in many industries, including
healthcare, finance, and e-commerce
❖ Using Machine Learning, we can save both time and money.
❖ Machine learning is an important tool for data analysis and
visualization.
❖ Use Case:
❖ Self-driving cars,
❖ Cyber fraud detection,
❖ Friend suggestion by facebook,
❖ Facial recognition systems,etc.
Advantages of Machine Learning
❖ Rapid increment in the production of data
❖ Solving complex problems, which are difficult for a human
❖ Decision making in various sector including finance
❖ Finding hidden patterns and extracting useful information from
data
AI ,ML, DL
Types of Machine Learning
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised learning
❖ Supervised learning is the types of machine learning.
❖ Machines are trained using well "labelled" training data, and on
basis of that data, machines predict the output.
❖ Training data is provided to the machines work as the supervisor
that teaches the machines to predict the output correctly.
Supervised learning
❖ Supervised learning is a process of providing input data as well
as correct output data to the machine learning model.
❖ The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output
variable(y).
❖ You are given reviews of few netflix series marked as positive,
negative and neutral. Classifying reviews of a new netflix series is
an example of Supervised learning
Supervised learning
Types of Supervised learning
Types of Supervised learning
Classification:
➢ Classification algorithms are used to solve the classification problems in
which the output variable is categorical, such as "Yes" or No, Pass or Fail,
etc.
➢ The classification algorithms predict the categories present in the dataset.
➢ Some real-world examples of classification algorithms are Spam
Detection, Email filtering, etc.
1. Binary 2. Multi-Class
Classification Classification

Classification
Types

3. Multi-Label 4. Imbalanced
Classification Classification
1. Binary Classification:

• Binary classification refers to those


classification tasks that have two class labels.
• Examples include:
• Email spam detection (spam or not).
Classificatio • Conversion prediction (buy or not).

n Types
Popular algorithms :

• Logistic Regression
• k-Nearest Neighbors
• Decision Trees
• Support Vector Machine
• Naive Bayes
2. Multi-Class Classification

• Multi-class classification refers to those


classification tasks that have more than two class
labels.
• Examples include:
• Face classification.
• Plant species classification.
Classificatio
• Optical character recognition.

n TypesPopular algorithms :

• k-Nearest Neighbors.
• Decision Trees.
• Naive Bayes.
• Random Forest.
• Gradient Boosting.
3. Multi-Label Classification

• Multi-label classification refers to those


photo classification

classification tasks that have two or more class


labels, where one or more class labels may be
predicted for each example.
• Examples include:
• photo classification, where a given photo may

Classificatio
have multiple objects in the scene and a model
may predict the presence of multiple known
objects in the photo, such as “bicycle,” “apple,”

n Types “person,” etc

Popular algorithms :

• Multi-label Decision Trees


• Multi-label Random Forests
• Multi-label Gradient Boosting
4. Imbalanced Classification

• Imbalanced classification refers to classification


tasks where the number of examples in each
class is unequally distributed.

• Examples include:
• Fraud detection.
Classificatio
• Outlier detection.
• Medical diagnostic tests.
n Types
Popular algorithms :

• Cost-sensitive Logistic Regression.


• Cost-sensitive Decision Trees.
• Cost-sensitive Support Vector Machines.
Types of Supervised learning
Regression:
➢ Regression algorithms are used to solve regression problems in
which there is a linear relationship between input and output
variables.
➢ These are used to predict continuous output variables, such as
market trends, weather prediction, etc.
Supervised learning Applications
Weather
prediction

Speech Sales
recognition forecasting

Medical Stock price


diagnosis analysis

Spam
Filtering
Supervised Learning Possible Classifiers
Unsupervised learning
➢ It is a learning method in which a machine learns without any
supervision.
➢ Unsupervised learning is a type of machine learning that uses
unlabeled data to train machines.
➢ In unsupervised learning, the models are trained with the data
that is neither classified nor labelled, and the model acts on that
data without any supervision.
➢ Unlabeled data doesn’t have a fixed output variable.
➢ The model learns from the data, discovers the patterns and
features in the data, and returns the output.
Unsupervised learning
❖ The main aim of the unsupervised learning algorithm is to group
or categories the unsorted dataset according to the similarities,
patterns, and differences.
❖ Machines are instructed to find the hidden patterns from the
input dataset.
Unsupervised learning
❖ Unsupervised Learning can be further classified into two types,
❖ Clustering
❖ Association
Unsupervised learning
Association
❖ The clustering technique is used when we want to find
the inherent groups from the data.
❖ It is a way to group the objects into a cluster such that
the objects with the most similarities remain in one
group and have fewer or no similarities with the
objects of other groups.
❖ An example of the clustering algorithm is grouping the
customers by their purchasing behavior.
Unsupervised learning
Association
❖ Association rule learning is an unsupervised learning
technique, which finds interesting relations among variables
within a large dataset.
❖ The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map
those variables accordingly so that it can generate maximum
profit.
❖ This algorithm is mainly applied in Market Basket analysis,
Web usage mining, continuous production, etc.
Application - Unsupervised learning
❖ Customer segmentation:
Based on customer behavior, likes, dislikes, and interests, you
can segment and cluster similar customers into a group.
❖ Recommendation Systems:
Recommendation systems widely use unsupervised learning
techniques for building recommendation applications for
different web applications and e-commerce websites.
Reinforcement learning
❖ It is a feedback-based learning method, in which a learning
agent gets a reward for each right action and gets a penalty for
each wrong action.
❖ Reinforcement learning follows trial and error methods to get
the desired result.
❖ After accomplishing a task, the agent receives an award. An
example could be to train a dog to catch the ball.
❖ If the dog learns to catch a ball, you give it a reward, such as a
biscuit.
Applications
❖ Reinforcement learning algorithms are widely used in the
gaming industries to build games.
❖ It is also used to train robots to do human tasks.
Popularly used Algorithms
Supervised Unsupervised Reinforcement
Learning Learning Learning
Linear regression
K-means clustering
Logistic regression Q-learning
Hierarchical clustering
Support Vector Machines

Naive Bayes
DBSCAN.
Decision tree Deep Q-learning Neural
Networks.
Apriori
Random Forest
Steps in developing ML application
❖ Data Collection – Quality data
❖ Data Pre-processing - Preparing/cleaning the Data
❖ Model Selection - Analyze Data/choose algorithm
❖ Model Training - Train the Data
❖ Evaluation - Test Algorithm
❖ Performance Tuning - Use in Application
❖ Prediction - Output
Issues of Machine learning
❖ Choosing the right algorithm
❖ Size of the Data set
❖ Poor quality of Data
❖ Implementation speed
❖ Lack of skilled Resources
Bias/ Variance

Bias: Difference in predicted value and actual value

High Bias: difference is more

Variance: how the predicted values are scatter with


respect to each other

Low variance: values are not much scattered. They are


in groups.
Bias/Variance
❖ Bias: It is the error due to the model’s inability to represent the true relationship between input
and output accurately.
❖ When a model has poor performance both on the training and testing data means high bias
because of the simple model, indicating underfitting.
❖ Variance: It is the error due to the model’s sensitivity to fluctuations in the training data. It’s the
variability of the model’s predictions for different instances of training data.
❖ High variance occurs when a model learns the training data’s noise and random fluctuations
rather than the underlying pattern.
❖ As a result, the model performs well on the training data but poorly on the testing data,
indicating overfitting
Issues in Machine Learning
❖ 1. Poor Quality of Data
Noisy data, incomplete data, inaccurate data, and unclean
data lead to less accuracy in classification and low-quality
results.
Issues in Machine Learning contd..
2. Overfitting of Training Data
❖ Whenever a machine learning model is trained with a huge amount of data, it starts capturing
noise and inaccurate data into the training data set. It negatively affects the performance of the
model.
❖ Overfitting occurs when machine learning model tries to cover all the data points or more than
the required data points present in the given dataset.
❖ The model tries to cover all the data points present in the scatter plot. It may look efficient, but
in reality, it is not so. Because the goal of the regression model is to find the best fit line, but here
we have not got any best fit, so, it will generate the prediction errors.
❖ The overfitted model has low bias and high variance.
❖ Methods to reduce overfitting:
▪ Cross-Validation
▪ Training with more data
▪ Removing features
▪ Early stopping the training
▪ Regularization
▪ Ensembling
3. Underfitting of Training Data
❖ Whenever a machine learning model is trained with fewer
amounts of data, and as a result, it provides incomplete and
inaccurate data and destroys the accuracy of the machine
learning model.
❖ This generally happens when we have limited data into the data
set, and we try to build a linear model with non-linear data.
❖ Methods to reduce Underfitting:
▪ By increasing the training time of the model.
▪ By increasing the number of features.
Overfitting & Underfitting

Overfitting: Model with good training set. Underfitting: Not good for
But may wrong for testing set training as well as testing set
Real-life Example of overfitting and
underfitting
❖ Task: To identify whether the object is ball or not
❖ Parameters:
❖ Sphere-This feature is checking if the object is of a spherical shape.
❖ Play-This feature is checking if one can play with it.
❖ Eat-This feature is checking if one cannot eat it.
❖ radius=5 cm-This feature is checking if an object's size is 5 cm or less than it.
❖ Overfiiting Case:
❖ If object (ball) with 10 cm radius is passed to classifier, it will classify it as not ball.
Because classifier is very much specific with features value.
❖ Underfitting case:
❖ If object (Orange) is passed to classifier , it will classify it as ball. Because it is very
much generalized with less number of parameter i.e. if object is sphere in shape it
is ball.
Example:
Sphere play Eat Radius Class
yes yes No 5 Ball
yes yes No 3 Ball
yes no yes 5 Fruit
yes no yes 10 Fruit
yes yes No 10 Ball

Underfitting: model is buit on two parameters only (Sphere & radius). So when fruit is passed to
model it may classify it as ball
Overfitting: Model is specific with all parameters value (yes,yes,No,5), so when ball with 10 radius
will pass it may classify it as fruit
Issues in Machine Learning contd..
❖ 4. Lack of Training Data
❖ we need to ensure that Machine learning algorithms are trained
with sufficient amounts of data.
❖ 5. Imperfections in the Algorithm When Data Grows
❖ So you need regular monitoring and maintenance to keep the
algorithm working. This is one of the most exhausting issues
faced by machine learning professionals.
How to choose a right Algorithm??
❖ Understand the project goal
❖ Type of Dataset
❖ Nature of the problem
❖ Nature of the Algorithm
❖ Performance comparison
How to choose a right Algorithm??
Understand the project goal: what kind of an output do you need?
❖ Do you need an algorithm for prediction based on the previous
data?
Turn to supervised forecasting algorithms.
❖ Are you looking for an image recognition model that will work
with poor-quality photos?
Dimensionality reduction in combination with classification will
help you with it.
❖ Do you need to teach your model to play a new game?
A reinforcement algorithm will be your best bet.
How to choose a right Algorithm??
Type of Dataset
❖ Numerical data set
SVM, Logistic regression, Decision tree , etc.
❖ Image Data set
CNN (Convolutional Neural Network)
❖ Speech/Text Data Set
RNN (Recurrent Neural Network)
How to choose a right Algorithm??
Analyze Your Dataset
❖ What is your data like?
❖ Is it raw, just collected from wherever, and requires processing?
❖ Is it biased, dirty, and unstructured?
❖ Do you have enough data or is additional collecting (or even
collecting from scratch) required?
❖ Do you need to spend time preparing your data for the training
process or are you good to go?
✓ Supervised algorithm
✓ UnSupervised algorithm
How to choose a right Algorithm??
Performance Comparison -Evaluate the Speed and Training Time
❖ Do you need it fast even if it means lower quality of training
(and, respectively, predictions)?
❖ Can you allocate the required time for proper training?
Hypothesis Testing
o Hypothesis Testing is a type of statistical analysis in
which you put your assumptions about a population
parameter to the test.
o It is used to estimate the relationship between 2
statistical variables.
o Example:
❖ A doctor believes that 3D (Diet, Dose, and Discipline) is
90% effective for diabetic patients.
How Hypothesis Testing Works?
❖ An analyst performs hypothesis testing on a statistical sample to present
evidence of the plausibility of the null hypothesis.
❖ Measurements and analyses are conducted on a random sample of the
population to test a theory.
❖ Analysts use a random population sample to test two hypotheses: the null
and alternative hypotheses.
❖ The null hypothesis is typically an equality hypothesis between population
parameters; for example, a null hypothesis may claim that the population
means return equals zero.
❖ The alternate hypothesis is essentially the inverse of the null hypothesis
(e.g., the population means the return is not equal to zero).
❖ As a result, they are mutually exclusive, and only one can be correct. One of
the two possibilities, however, will always be correct.
Null Hypothesis and Alternate Hypothesis
❖ The Null Hypothesis is the assumption that the event will not occur. A null
hypothesis has no bearing on the study's outcome unless it is rejected.
❖ H0 is the symbol for it, and it is pronounced H-naught.
❖ The Alternate Hypothesis is the logical opposite of the null hypothesis. The
acceptance of the alternative hypothesis follows the rejection of the null
hypothesis.
❖ H1 is the symbol for it.
Let's understand this with an example.
❖ A sanitizer manufacturer claims that its product kills 95 percent of germs on
average.
❖ To put this company's claim to the test, create a null and alternate
hypothesis.
❖ H0 (Null Hypothesis): Average = 95%.
❖ Alternative Hypothesis (H1): The average is less than 95%.
❖ Another straightforward example to understand this concept is
determining whether or not a coin is fair and balanced.
❖ The null hypothesis states that the probability of a show of
heads is equal to the likelihood of a show of tails.
❖ In contrast, the alternate theory states that the probability of a
show of heads and tails would be very different.
Types of Hypothesis Testing
❖ Z Test
To determine whether a discovery or relationship is statistically significant, hypothesis testing uses a
z-test. It usually checks to see if two means are the same (the null hypothesis). Only when the
population standard deviation is known and the sample size is 30 data points or more, can a z-test
be applied.
❖ T Test
A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.
❖ Chi-Square Test
You utilize a Chi-Square test for hypothesis testing concerning whether your data is as predicted. To
determine if the expected and observed results are well-fitted, the Chi-square test analyzes the
differences between categorical variables from a random sample. The test's fundamental premise is
that the observed values in your data should be compared to the predicted values that would be
present if the null hypothesis were true.
Hypothesis Testing Formula (Z test)
❖ Z = ( x̅ – μ0 ) / (σ /√n)
❖ Here, x̅ is the sample mean,
❖ μ0 is the population mean,
❖ σ is the standard deviation,
❖ n is the sample size.
Hypothesis Testing Calculation With Examples (Z test)
❖ Let's consider a hypothesis test for the average height of women in the
United States. Suppose our null hypothesis is that the average height
is 5'4". We gather a sample of 100 women and determine that their
average height is 5'5". The standard deviation of population is 2.
❖ To calculate the z-score, we would use the following formula:
❖ z = ( x̅ – μ0 ) / (σ /√n)
z = (5'5" - 5'4") / (2" / √100)
z = 0.5 / (0.045)
z = 11.11
❖ We will reject the null hypothesis as the z-score of 11.11 is very large
and conclude that there is evidence to suggest that the average height
of women in the US is greater than 5'4".
Type 1 and Type 2 Error
❖ A hypothesis test can result in two types of errors.
❖ Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis
despite being true.
❖ Type 2 Error: A Type-II error occurs when the null hypothesis is accepted when it is
false, unlike a Type-I error.
❖ Example:
❖ Suppose a teacher evaluates the examination paper to decide whether a student
passes or fails.
H0: Student has passed
H1: Student has failed
❖ Type I error will be the teacher failing the student [rejects H0] although the student
scored the passing marks [H0 was true].
❖ Type II error will be the case where the teacher passes the student [do not reject
H0] although the student did not score the passing marks [H1 is true].
Steps of Hypothesis Testing
❖ Step 1: Specify Your Null and Alternate Hypotheses
❖ It is critical to rephrase your original research hypothesis (the
prediction that you wish to study) as a null (Ho) and alternative (Ha)
hypothesis so that you can test it quantitatively. Your first hypothesis,
which predicts a link between variables, is generally your alternate
hypothesis. The null hypothesis predicts no link between the
variables of interest.
❖ Step 2: Gather Data
❖ For a statistical test to be legitimate, sampling and data collection
must be done in a way that is meant to test your hypothesis. You
cannot draw statistical conclusions about the population you are
interested in if your data is not representative.
Steps of Hypothesis Testing contd..
❖ Step 3: Conduct a Statistical Test
❖ Other statistical tests are available, but they all compare within-group
variance (how to spread out the data inside a category) against between-
group variance (how different the categories are from one another).
❖ If the between-group variation is big enough that there is little or no overlap
between groups, your statistical test will display a low p-value to represent
this. This suggests that the disparities between these groups are unlikely to
have occurred by accident.
❖ Alternatively, if there is a large within-group variance and a low between-
group variance, your statistical test will show a high p-value.
❖ Any difference you find across groups is most likely attributable to chance.
The variety of variables and the level of measurement of your obtained data
will influence your statistical test selection.
Steps of Hypothesis Testing contd…
❖ Step 4: Determine Rejection Of Your Null Hypothesis
❖ Your statistical test results must determine whether your null hypothesis should
be rejected or not. In most circumstances, you will base your judgment on the p-
value provided by the statistical test. In most circumstances, your preset level of
significance for rejecting the null hypothesis will be 0.05 - that is, when there is
less than a 5% likelihood that these data would be seen if the null hypothesis were
true. In other circumstances, researchers use a lower level of significance, such as
0.01 (1%). This reduces the possibility of wrongly rejecting the null hypothesis.
❖ Step 5: Present Your Results
❖ The findings of hypothesis testing will be discussed in the results and discussion
portions of your research paper, dissertation, or thesis. You should include a
concise overview of the data and a summary of the findings of your statistical test
in the results section. You can talk about whether your results confirmed your
initial hypothesis or not in the conversation. Rejecting or failing to reject the null
hypothesis is a formal term used in hypothesis testing. This is likely a must for your
statistics assignments.
Level of Significance

❖ The alpha value is a criterion for determining whether a test


statistic is statistically significant.
❖ In a statistical test, Alpha represents an acceptable probability
of a Type I error. Because alpha is a probability, it can be
anywhere between 0 and 1.
❖ In practice, the most commonly used alpha values are 0.01, 0.05,
and 0.1, which represent a 1%, 5%, and 10% chance of a Type I
error, respectively (i.e. rejecting the null hypothesis when it is in
fact correct).
P-Value
❖ A p-value is a metric that expresses the likelihood that an observed
difference could have occurred by chance. As the p-value decreases the
statistical significance of the observed difference increases. If the p-value is
too low, you reject the null hypothesis.
❖ Here you have taken an example in which you are trying to test whether the
new advertising campaign has increased the product's sales. The p-value is
the likelihood that the null hypothesis, which states that there is no change
in the sales due to the new advertising campaign, is true.
❖ If the p-value is .30, then there is a 30% chance that there is no increase or
decrease in the product's sales. If the p-value is 0.03, then there is a 3%
probability that there is no increase or decrease in the sales value due to the
new advertising campaign.
❖ As you can see, the lower the p-value, the chances of the alternate
hypothesis being true increases, which means that the new advertising
campaign causes an increase or decrease in sales.
Why is Hypothesis Testing Important in ML?
❖ Provides evidence-based conclusions: It allows researchers to make
objective conclusions based on empirical data, providing evidence to
support or refute their research hypotheses.
❖ Supports decision-making: It helps make informed decisions, such as
accepting or rejecting a new treatment, implementing policy
changes, or adopting new practices.
❖ Adds rigor and validity: It adds scientific rigor to research using
statistical methods to analyze data, ensuring that conclusions are
based on sound statistical evidence.
❖ Contributes to the advancement of knowledge: By testing
hypotheses, researchers contribute to the growth of knowledge in
their respective fields by confirming existing theories or discovering
new patterns and relationships.
Limitations of Hypothesis Testing
❖ It cannot prove or establish the truth: Hypothesis testing provides
evidence to support or reject a hypothesis, but it cannot confirm the
absolute truth of the research question.
❖ Results are sample-specific: Hypothesis testing is based on analyzing
a sample from a population, and the conclusions drawn are specific
to that particular sample.
❖ Possible errors: During hypothesis testing, there is a chance of
committing type I error (rejecting a true null hypothesis) or type II
error (failing to reject a false null hypothesis).
❖ Assumptions and requirements: Different tests have specific
assumptions and requirements that must be met to accurately
interpret results.

You might also like