AI Unit 4 QA
AI Unit 4 QA
Machine Learning :
Supervised and unsupervised learning
Decision trees.
Statistical learning models
Learning with complete data - Naive Bayes models.
Ans : (a) Supervised Learning (b) Unsupervised Learning (c) Reinforcement Learning.
Ans :
Ans : These are used in Decision Making learning technique. This consists of a vector of input attributes
X, and a single Boolean output y. Example: Set of examples ( X1 , Y1)……( X6 , Y6).
Positive examples are in which goal is true . Negative examples are in which goal is false.
Complete set is called Training Set.
Ques 4. Compare the Decision tree method with Naïve Baye’s Learning.
Ans : (i) Naïve Baye’s learns little less efficiently as compared to decision tree learning.
(ii) Naïve Baye’s learning works well fro wide range of applications as compared to decision tree.
(iii) Naïve Baye’s Scale well to very large problems. E.g : If n Boolean attributes , then 2n + 1
Parameters are required.
Inductive Learning: This technique requires the use of inductive inference, a form of invalid but useful
inference. We use inductive learning when we formulate a general concept after seeing a number of
instances or examples of the concept. E.g : When we learn the concept of color or sweet taste after
experiencing sensation associated with several objects.
Deductive Learning: This is performed through a sequence of deductive inference steps using known facts.
From the known facts , new facts or relationships are logically delivered. E.g : If we have an information
that weather is Hot and Humid then we can infer that it may Rain also. Another example may be , let
P → Q & Q → R , 𝑡ℎ𝑒𝑛 𝑤𝑒 𝑐𝑎𝑛 𝑖𝑛𝑓𝑒𝑟 𝑡ℎ𝑎𝑡 𝑃 → 𝑅
Ques. 8 Write Short notes on the following: (a) Statistical Learning (b) Naïve Baye’s Model
Ans : (a) Statistical Learning Technique: In this technique main idea is data and hypothesis. Here data is
evidence i.e. instantiations of some or all random variables describing the domain. Bayesian learning
calculates probabilities of each hypothesis given the data and makes prediction.
Let D: data set, with observed value d as an output. Then the probability of each hypothesis is obtained by
(b) Naïve Baye’s Model: This is the most common Bayesian network model used in machine learning.
In this model the class variable C ( to be predicted) is the root and attribute Xi are leaves. Model is called
Naïve because it assumes that attributes are conditionally independent of each other, given the class.
Once the model has been trained using maximum likelihood technique, it can be used to classify new
examples for which the class variable C is unobserved. For the observed attributes x1 , x2 ,……xn, the
Ques.9 What is learning with complete data? Explain Maximum Likelihood Parameter Learning
with Discrete Model in detail.
Ans . Statistical learning methods are based on simple task parameter learning with complete data.
Parameter learning involves finding the numerical parameters for a probability model with a fix
structure. E.g: In Bayesian network conditional probabilities are obtained for a given scenario. Data are
complete when each point contains values for every variable in a specific learning model.
Maximum Likelihood Parameter Learning : Suppose we buy a bag of lime and cherry candy from a
new manufacturer whose lime–cherry proportions are completely unknown—that is, the fraction could be
anywhere between 0 and 1. Parameter 𝜃 is proportion of cherry candies.
Hypothesis is : h , proportion of limes = 1 - 𝜃
If we assume that all proportions are equally likely a priori, then a maximum-likelihood approach is
reasonable. If we model the situation with a Bayesian network, we need just one random variable, Flavor
(the flavor of a randomly chosen candy from the bag). It has values cherry and lime, where the probability
of cherry is . Now suppose we unwrap N candies, of which c are cherries and l = N - c are limes
Likelihood of above data set is as given below:
So maximum likelihood is value of 𝜃 that maximizes above equation .Computing log likelihood:
By taking logarithms, we reduce the product to a sum over the data, which is usually easier
to maximize.) To find the maximum-likelihood value ofθ, we differentiate L with respect to
θ and set the resulting expression to zero:
1. Write down an expression for the likelihood of the data as a function of the parameter(s).
2. Write down the derivative of the log likelihood with respect to each parameter.
3. Find the parameter values such that the derivatives are zer
Now for Maximum Likelihood Estimation , simplify it by taking Log , to come up with addition form :
Now compute I order partial derivatives w.r.t θ, θ1, θ2 , Equate it to zero , we will get values of parameters.
Ques.10 Write short notes on
(a) Continuous model for Maximum likelihood Estimation
(b) Learning with Hidden Variables.
(c) EM Algorithm.
Ans : (a) Continuous model for Maximum likelihood Estimation : Continuous variables are very common
in real-world applications, it is important to know how to learn continuous models from data. The principles for
maximum-likelihood learning are identical to those of the discrete case. In learning the parameters of a Gaussian
density function on a single variable. That is, the data are generated as follows:
The parameters of this model are the mean and the standard deviation. Let the observed values be x1, X2 … xN .
:
Now setting theI order partial derivative equal to zero we obtain:
The maximum-likelihood value of the mean is the sample average and the maximum likelihood value of
the standard deviation is the square root of the sample variance.
(b) Learning with Hidden Variables : Many real world problems have hidden variables (also called
Latent Variables), which are not observable in given data set samples.
Example : (i) In medical diagnosis, records mostly consist of symptoms , treatment used and outcom
of the treatment. But seldom have direct observation of disease itself.
(ii) A scenario of traffic congestion prediction at office hours ( Hidden variables can be an
unobservable “ Rainy Day” causing very less traffic at peak hours.
Example : Let Bayesian Network for heart disease ( a hidden variable ) is as given in below figure :
. In figure (a): Each variable has three possible values and is labeled with the number
of independent parameters in its conditional distribution. In figure (b): The
equivalent network with Heart Disease removed. Note that the symptom variables are no
longer conditionally independent given their parents. Therefore Latent Variables can dramatically reduce
the number of parameters required to specify a Bayesian Network. This can reduce he amount of data
needed to learn the parameters.
(c) EM Algorithm( Expectation Maximization Algorithm) : This algorithm is used to solve the
problems arised in Laerning with hidden variables. Basic idea is to pretend that we know the
parameters of model and then infer the probability that each data point belongs to each component is
fitted to entire data set with each point weighted by the probability that it belongs to that component.
Expectation maximization the process that is used for clustering the data sample.
EM for a given data, has the ability to predict feature values for each class on the basis of
classification of examples by learning the theory that specifies it.
It works on the concept of, starting with the random theory and randomly classified data along
with the execution of below mentioned steps. Compute expected values of each hidden
variables for each examples and then re-computing the parameters using the expected values as
if they were observed values. Let X is the observed values in all examples. Z is the set of all
hidden variables. 𝜃 is all parameters for probability model. 𝜽 = { 𝝁, 𝜮 }
E- Step: In this computation of sum (i.e. expectation of Log likelihood of completed data w.r.t.
P ( Z = z | x , 𝜃𝑖 ) , which is posteriori over hidden variables.
M – Step: In this step we find new values of the parameters that maximize the Log Likelihood
of data given the expected values of hidden indicator variables.
EM algorithm increases the Log Likelihood of data at every iteration. Under certain conditions
EM can be proven to reach a local maximum in likelihood. So EM is like Gradient Based Hill
Climbing Algorithm.
Ques. 11 Explain Re-inforcement learning technique in detail .Also Mention its applications in the
field of Artificial intelligence.
Ans : Re-inforcement learning : This type of learning technique is used for agents learning when there is
no teacher telling the agent what action to take in each circumstances.
Example 1 : Let a chess playing agent by supervised learning given examples of game situations along
with the best moves for those situations. He can also try random moves , so agent can eventually build a
predictive model of its environment. Issue is that “Without some feedback about what is good and bad ,
agent will have no grounds for deciding which move to select.” Agent needs to know that something good
has happened when it wins and that something bad has occurred. This kind of Feedback is called
Reward or Re-inforcement .
A General Learning Model of Reinforcement Learning:
Policy: This defines learning agent’s behavior at a particular time. It is a mapping from perceived states of
environment to actions to be taken when present in those states. Policy can be a simple function , a look up
table or a search process too.
Reward Function: This is used to define a goal. It maps each perceived state action pair of environment to
a single number; a reward point that indicates desirability of that state. Objective is to maximize total reward
function received in long run. Reward functions are stochastic/random.
Value function: Reward function indicates what is good in an immediate sense, a value function specifies
what is good in the long run. Value of a state is total amount of reward an agent can expect to accumulate
over the future.
Model: this represents behavior of the environment . Models are used for planning, i.e a way of deciding for
a course of actions by considering future situations.
Application areas of Reinforcement learning are as mentioned below:
1) The most recent version of Deep Mind’s AI system for playing Go) means interest in reinforcement
learning (RL) is bound to increase.
2) RL requires a lot of data, and as such, it has often been associated with domains where simulated
data is available (gameplay, robotics).
3) Automation of well-defined tasks, that would benefit from sequential decision-making that RL can
help automate (or at least, where RL can augment a human expert).
4) Industrial automation is another promising area. It appears that RL technologies from
DeepMind helped Google significantly reduce energy consumption (HVAC) in its own data centers.
5) The use of RL can lead to training systems that provide custom instruction and materials tuned to the
needs of individual students. A group of researchers is developing RL algorithms and statistical
methods that require less data for use in future tutoring systems.
6) Many RL applications in health care mostly pertain to finding optimal treatment policies.
7) Companies collect a lot of text, and good tools that can help unlock unstructured text will find users.
8) A technique for automatically generating summaries from text based on content “abstracted” from
some original text document).
9) A Financial Times article described an RL-based system for optimal trade execution. The system
(dubbed “LOXM”) is being used to execute trading orders at maximum speed and at the best
possible price.
10) Many warehousing facilities used by E - Commerce sites and other supermarkets use these
intelligent robots for sorting their millions of products every day and helping to deliver the right
products to the right people. If you look at Tesla’s factory, it comprises of more than 160 robots that
do major part of
work on its cars to reduce the risk of any
defect.
11) Reinforcement learning algorithms can be built to reduce transit time for stocking as well as
retrieving products in the warehouse for optimizing space utilization and warehouse
operations.
12).Reinforcement Learning and optimization techniques are utilized to assess the security of the electric
power systems and to enhance Microgrid performance. Adaptive learning methods are employed to
develop control and protection schemes.
Ques 12. Discuss Various Types of Reinforcement Learning Techniques.
Ans : Reinforcement learning are of following three types :
(a). Passive Reinforcement (b) Temporal Difference Learning (c) Active Reinforcement learning.
Passive Reinforcement Learning: In this technique agent’s policy is fixed and the task to learn the utilities
of state action pairs. If policy is 𝜋 and state is S , then agent always executes the action 𝜋(𝑆).
Goal is to learn how good policy is i.e to learn th e utility function 𝑈𝜋( S). Passive learning agent is
not aware of the transition model T ( S , a , S’) , which specifies probability of reaching state S’ from
state S after action a.
Passive learning also not knows the Reward Function R (S).
A utility is defined to be the expected sum of rewards obtained if policy 𝜋 is followed.
𝐔𝛑( S) = E [ ∑∞
𝐭=𝟎 𝛄 𝐑 (𝐒𝐭) ∶ 𝛑 , 𝐒𝟎 = 𝐒 ] , where 𝛄 𝐢𝐬 𝐚 𝐝𝐢𝐬𝐜𝐨𝐮𝐧𝐭 𝐟𝐚𝐜𝐭𝐨𝐫.
𝐭
Temporal difference Learning: When a transition occurs from state S to state S’, we update 𝑈𝜋( S) as
Active Reinforcement Learning: The compression achieved by a function approximator allows the
learning agent to generalize from states it has visited to states it has not visited.
E.g : An evaluation function for CHESS that is represented as a weighted linear function of a set of features
or a basis function f1 , f2 , ……. fn.
Example:
Given this classifier, the analyst can predict the response of a potential customer (by sorting
it down the tree), and understand the behavioral characteristics of the entire potential
customers population regarding direct mailing.
Each node is labeled with the attribute it tests, and its branches are labeled with
its corresponding values.
For example, one of the paths in below figure can be converted into the rule :“If customer
age is is less than or equal to or equal to 30, and the customer is “Male” – then the
customer will respond to the mail”.
Application Areas of Decision Tree Learning
1) Variable selection: The number of variables that are routinely monitored in clinical settings has
increased dramatically with the introduction of electronic data storage. Many of these variables are of
marginal relevance and, thus, should probably not be included in data mining exercises.
2) Handling of missing values: A common - but incorrect - method of handling missing data is to
exclude cases with missing values; this is both inefficient and runs the risk of introducing bias in the
analysis. Decision tree analysis can deal with missing data in two ways: it can either classify missing
values as a separate category that can be analyzed with the other categories or use a built decision tree
model which set the variable with lots of missing value as a target variable to make prediction and
replace these missing ones with the predicted value.
3) Prediction: This is one of the most important usages of decision tree models. Using the tree model
derived from historical data, it’s easy to predict the result for future records.
4) Data manipulation: Too many categories of one categorical variable or heavily skewed continuous
data are common in medical research.
Ques 14 : Write Short Notes on the following : (A) Regression Trees
(B) Bayesian Parameter Learning.
Ans : Regression Trees : Regression trees are commonly used to solve the problems where target variable
is numerical / continuous instead of discrete. Regression trees posses following properties :
a) Leaf nodes predict the average value of all instances.
b) Splitting criteria : Minimize the variance of the values in each subset 𝐒𝐢
| 𝑺𝒊 |
c) Standard Deviation Reduction : SDR ( A, S) = SD (S) - ∑ SD (𝑺 )
𝒊| 𝑺| 𝒊
Bayesian Parameter Learning: This learning technique works on parametric variables which are random
having some prior distribution. An optimal learning classifier can be designed using “Class conditional
densities”, p ( x | 𝑤𝑖). In a typical case we merely have some unclear knowledge about situations with given
number of samples and training. Observation of samples converts this to a posteriori density, and true values
of parameters are revised. In Bayesian learning sharpening of Posteriori Density Function is done, causing it
to peak near the true values.
• We assume priors are known: P (i | D) =
p(x i , D)P(i )
P(i). p( | x, D) c
• Also, assume functional independence : i
p(x j , D)P( j )
j 1
• Any information we have about about prior to collecting samples is contained in p(D|).
• Observation of samples converts this to a posterior, p(|D), which we hope is peaked around the true
value of .