23sp - Cs188-Sp23-Midterm-Solutions
23sp - Cs188-Sp23-Midterm-Solutions
First name
Last name
SID
Honor code: “As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others.”
By signing below, I affirm that all work on this exam is my own work, and honestly reflects my own understanding of the
course material. I have not referenced any outside materials (other than one double-sided cheat sheet), nor collaborated
with any other human being on this exam. I understand that if the exam proctor catches me cheating on the exam, that I
may face the penalty of an automatic "F" grade in this class and a referral to the Center for Student Conduct.
Signature:
Point Distribution
1
Q1. [16 pts] Search: Project Groups
You’re taking a class with 𝑛 students, and you want to find a 4-person project group (you and 3 other students). You decide to
formulate your problem as a search problem, as follows:
• State space: An unordered set of students in the group so far. For example, {you, Inky, Blinky} and {you, Blinky, Inky}
are equivalent states.
• Successor function: You pick someone from the class (who is not already in your group), and add them to your group.
If your group already has 4 students, there are no further successor states.
• Start state: The project group is just you, and nobody else.
• Goal test: Your group has exactly 4 students, and everyone in the group gets along. (You can assume that the goal test is
somehow able to determine whether a group gets along.)
# (𝑛 − 1)
# (𝑛 − 1)4
(𝑛−1)
# 3
(𝑛−1) (𝑛−1) (𝑛−1) (𝑛−1)
0
+ 1
+ 2
+ 3
# Infinitely many
Recall that each node of the search graph represents a state, so this question is asking how many states are in this problem.
The state space is a set of people in the group so far. This set could contain 1 person (you), or 2 people (you and one
other), or 3 people (you and two others), or 4 people (you and three others). Therefore, we need to count up how many
possible groups (always including yourself) of size up to 4 could be made from the class of 𝑛 students.
( )
There is 𝑛−10
= 1 possible 1-person group: it’s the group with just you.
(𝑛−1)
There are 1 = (𝑛 − 1) possible 2-person groups: you, plus any of the 𝑛 − 1 other people in class.
( )
There are 𝑛−1 2
possible 3-person groups: you, plus some (unordered) group of 2 people among the 𝑛 − 1 others in class.
(𝑛−1)
There are 3 possible 4-person groups: you, plus some (unordered) group of 2 people among the 𝑛 − 1 others in class.
Note that there are no states with groups with more than 4 people. The successor function will not generate groups with
more than 4 people.
2
(c) [2 pts] What is the maximum branching factor for this search tree?
(𝑛 − 1)
# (𝑛 − 1)4
(𝑛−1)
# 3
(𝑛−1) (𝑛−1) (𝑛−1) (𝑛−1)
# 0
+ 1
+ 2
+ 3
# Could be infinite
The branching factor represents the number of actions available from a state.
From the root (just you in the group), there is a choice of 𝑛 − 1 people to add to your group. Therefore, there are 𝑛 − 1
possible actions from the root, and the branching factor from the root is 𝑛 − 1.
Future nodes deeper in the tree will have fewer choices of students to add to your group. For example, at depth 1, where
the group has two people (you and one other), there is only a choice of 𝑛 − 2 people to add to your group. Therefore, the
branching factors of nodes deeper in the tree will be less than 𝑛 − 1, and the maximum branching factor is from the node
at the root.
(d) [2 pts] What is the maximum depth of this search tree, assuming the root node is at depth 0?
# 1
# 2
3
# 𝑛−1
# (𝑛 − 1)4
# Could be infinite
Each extra layer of the tree corresponds to an additional action, so the maximum depth of the tree corresponds to the
maximum number of actions that could be taken in this problem.
It takes 3 actions to create a group of 4 people, and after that, there are no more actions or successor states available.
Now, suppose that you want 𝑘 students in your project group (𝑘 < 𝑛).
Also, suppose that any group of 𝑘 students passes the goal test.
(e) [2 pts] As 𝑛 and 𝑘 grow very large, which search algorithm will expand fewer nodes to find a solution?
Depth-first tree search
# Breadth-first tree search
# Not enough information
In this problem, all the solutions to the problem are at depth 𝑘 − 1 in the search tree, because it takes 𝑘 − 1 actions to build
a group of 𝑘 students.
Also, any node at depth 𝑘 − 1 is a solution to the problem, because any group of 𝑘 students is a valid solution.
BFS will look through all the nodes at depth 1 (groups of 2), then all the nodes at depth 2 (groups of 3), then all the nodes
at depth 3 (groups of 4), etc. before ever looking at a node at depth 𝑘 − 1 (group of 𝑘).
On the other hand, DFS will repeatedly expand the deepest node, expanding a single depth-0 node, then a single depth-1
node, then a single depth-2 node, etc. until it pops the first depth 𝑘 − 1 node off the fringe and returns it as a solution.
(f) [2 pts] Approximately how many nodes will be expanded by the search algorithm you selected before it finds a solution?
𝑂(𝑘)
# 𝑂(𝑛)
# 𝑂(𝑛𝑘)
# 𝑂(𝑘𝑛 )
3
# 𝑂(𝑛𝑘 )
As discussed in the previous subpart, DFS will expand one node at each depth before finding a solution. The depth of the
tree is 𝑂(𝑘), so that’s how many nodes DFS will expand.
Now, suppose that some pairs of students in the class cannot work together because they dislike each other. We’d like to modify
our search problem definition to account for this.
(g) [4 pts] Which of the following changes would, if implemented individually, ensure that any complete search algorithm
still finds a correct solution, if one exists?
■ Modify the goal test to check that no such pairs exist in the state being tested.
□ Modify the successor function so that it only adds students who can work with anybody.
■ Modify the successor function so that it only adds students who can work with anybody already in the group.
□ Require that states be ordered lists rather than sets.
# None of the above. A search algorithm cannot solve this problem. You would need a CSP algorithm.
The additional information about which students cannot work together could be added to the goal test (first option) or the
successor function (third option).
The second option fails because it will prevent the algorithm from finding some solutions. For example, if half the class
refuses to work with the other half and vice versa, a solution still exists in either half if it is big enough. However, the
successor function will be unable to generate any new successors, because everybody in the class has at least one person
they cannot work with (i.e. there is no person that can work with anybody else in class).
Making states ordered lists does not encode any information about which pairs of students cannot work together, so the
last option is false.
4
Q2. [18 pts] Search: Color a Dinosaur
Consider a computer screen with a grid of squares.
We start with a blank grid and the mouse cursor in the bottom-left square. At each time step, we have several actions to choose
from, each costing 1:
• Move the mouse cursor to an adjacent square (north, south, east, west).
• Click the mouse to fill in the currently selected square.
For example, starting from the bottom-left, the action sequence East, East, Click, North, Click produces the grid in Figure 1(a).
(a) Configuration after East, East, (b) A possible goal configuration. (c) Another possible goal configu-
Click, North, Click. The arrow de- The cursor location is ignored in the ration.
notes the mouse cursor. goal test.
For each piece of information listed below, select the part of the search problem definition where that information should be
stored.
The information about how the world looks when the search problem starts is encoded in the start state.
The current location of the mouse cursor changes depending on the sequence of actions we’ve taken so far. This belongs
in the state space.
The list of squares colored so far changes depending on the sequence of actions we’ve taken so far. This belongs in the
state space.
5
# State space Goal test
# Successor function # Start state
In our goal test, we need to check if the current state (list of squares colored so far) has matches the desired picture (original
list of the squares that need to be colored).
6
For each heuristic provided, select whether it is admissible, and whether it is consistent.
We need to click at least this many more times, so the true cost must be at least as great than this (since we need to also
add extra actions to move the mouse around).
This heuristic drops by at most 1 after each action; each action costs 1; so the triangle inequality is satisfied and the
heuristic is consistent.
In other words, the difference in heuristics between two adjacent states will never exceed 1. Therefore, the difference in
heuristics between two adjacent states will never overestimate the true cost between those two states (which is always 1).
If there is one unfinished square and the cursor is already in the right location, then the true cost is 1 but the heuristic says
2, so it is not admissible. Since consistent implies admissible, it cannot be consistent either.
(g) [3 pts] ℎ3 = maximum Manhattan distance between any two unfinished squares
We need to at least cover the maximum distance between any two unfinished squares. Therefore this is an underestimate
of the true cost and admissible.
The true action cost is 1. The maximum Manhattan distance between any pair of unfinished squares could drop drastically
when a square is finished, which means that this heuristic might overestimate the true cost of an action and does not satisfy
the triangle inequality.
□ ℎ1 dominates ℎ3 .
□ ℎ3 dominates ℎ1 .
■ Neither ℎ1 nor ℎ3 dominates the other.
■ If ℎ1 and ℎ3 are admissible, ℎ∗ dominates max(ℎ1 , ℎ3 ).
# None of the above.
Example where ℎ1 > ℎ3 : Consider a 4x4 block that still needs to be fully colored. Then ℎ1 = 16, but ℎ3 = 1.
Example where ℎ3 > ℎ1 : Suppose we still need to color two squares, spaced 100 squares apart from each other. Then
ℎ1 = 2, but ℎ2 = 100.
Because ℎ1 > ℎ3 is not always true, we cannot say that ℎ1 dominates ℎ3 . Similarly, because ℎ3 > ℎ1 is not always true,
we cannot say that ℎ3 dominates ℎ1 .
If ℎ1 and ℎ3 are both admissible, then max(ℎ1 , ℎ3 ) is admissible. By definition, ℎ∗ is the largest admissible heuristic.
Therefore, we have ℎ∗ > max(ℎ1 , ℎ3 ), which means that ℎ∗ dominates max(ℎ1 , ℎ3 ).
7
□ Greedy tree search with ℎ4 will always find a solution (possibly not optimal) if one exists.
■ Greedy graph search with ℎ4 will always find a solution (possibly not optimal) if one exists.
■ If the goal is the rectangle in Figure 1(c), greedy tree search with ℎ4 will return the optimal solution.
# None of the above.
The first option is false. Consider a case with two unfinished squares, both very far away. Greedy search might move to
a square, and then avoid clicking it (because that would make ℎ4 increase by a lot), and move towards the other square.
Then, once the cursor reaches the other square, greedy search will again avoid clicking this square (which would increase
ℎ4 by a lot), and start moving back toward the original square. The path explored by greedy search will move back and
forth between the two squares forever and never terminate.
If the goal is to draw a rectangle, this greedy search will immediately move the mouse to the nearest unfinished square,
and then click on that square. Then, this search will move to an adjacent unfinished square (we know there has to be
an adjacent unfinished square because this is a rectangle), and click that square. This search will repeatedly move to an
adjacent unfinished square and click that square, moving around the perimeter of the rectangle until the entire rectangle
is drawn. This strategy of moving to the closest unfinished square, then clicking around the rectangle, happens to be the
optimal solution.
For any general picture, this greedy search may not find the optimal solution. For example, with a solid rectangle with 𝐾
squares, the optimal solution is the distance to the rectangle plus 2𝐾 − 1 actions to color it efficiently (e.g., in concentric
layers). (Side note: We get 2𝐾 − 1 from 𝐾 clicks, and 𝐾 − 1 moves to reach all the unfinished squares, since if you move
efficiently, every move will get you to an unfinished square.) Greedy might color a line across the middle of the rectangle;
then, no matter what, it will eventually have to cross over the already-colored middle line to finish the pattern, which is
inefficient.
8
Q3. [18 pts] Logic: Map Coloring
We’d like to express the map coloring problem in propositional logic. Consider the following map:
We’d like to assign colors to nodes 𝑋, 𝑌 , and 𝑍 such that no two adjacent nodes have the same color.
Let the proposition symbol 𝑁𝐶 represent node 𝑁 being assigned color 𝐶. For example, 𝑋𝑅 = 𝑡𝑟𝑢𝑒 means that node 𝑋 is colored
Red.
(a) [2 pts] How many symbols are needed to express this problem in propositional logic?
Your answer should be an integer.
(b) [1 pt] For a general map coloring problem with 𝑛 nodes and 𝑐 colors, how many symbols are needed to express the problem
in propositional logic?
Your answer should be an expression, possibly in terms of 𝑛 and/or 𝑐.
𝑛𝑐
Now we’ll formulate several logical sentences that represent this map coloring problem.
(c) [2 pts] Below is a sentence that represents part of this map coloring problem. Fill in the blanks with ∨ or ∧.
( ) ( ) ( )
𝑋𝑅 ______ 𝑋𝐺 ______ 𝑋𝐵 ______ 𝑌𝑅 ______ 𝑌𝐺 ______ 𝑌𝐵 ______ 𝑍𝑅 ______ 𝑍𝐺 ______ 𝑍𝐵
( ) ( ) ( )
𝑋𝑅 ∨ 𝑋𝐺 ∨ 𝑋𝐵 ∧ 𝑌𝑅 ∨ 𝑌𝐺 ∨ 𝑌𝐵 ∧ 𝑍𝑅 ∨ 𝑍𝐺 ∨ 𝑍𝐵
(d) [2 pts] What does the sentence in the previous subpart represent? You can answer in 10 words or fewer.
(e) [2 pts] Write a logical sentence that encodes the constraint “𝑋 and 𝑌 cannot have the same color”. Your sentence does
not need to be in CNF.
In English: ¬(𝑋𝑅 ∧ 𝑌𝑅 ) says that (𝑋𝑅 ∧ 𝑌𝑅 ), the case where 𝑋 and 𝑌 are both Red, cannot be true. The clauses for “both
can’t be Green” and “both can’t be Blue” follow the same reasoning.
9
We AND the three clauses together because all three conditions must be true: 𝑋 and 𝑌 can’t both be Red, and they can’t
both be Green, and they can’t both be Blue.
Alternate answer (listing the allowed pairwise assignments to 𝑋 and 𝑌 ):
(𝑋𝑅 ∧ 𝑌𝐺 ) ∨ (𝑋𝐺 ∧ 𝑌𝑅 ) ∨ (𝑋𝑅 ∧ 𝑌𝐵 ) ∨ (𝑋𝐵 ∧ 𝑌𝑅 ) ∨ (𝑋𝐺 ∧ 𝑌𝐵 ) ∨ (𝑋𝐵 ∧ 𝑌𝐺 )
(f) [2 pts] Suppose we’ve additionally added sentences that encode “𝑋 and 𝑍 cannot have the same color,” and “𝑌 and 𝑍
cannot have the same color.”
This problem is still missing some sentence(s). Describe the missing sentence(s) (you can answer in 10 words or fewer):
Note that the sentence in subpart (c) only guarantees that each node takes on at least one color (and not zero colors). It
could be the case that a node takes on two colors, but this sentence is still satisfied: for example, 𝑋𝑅 and 𝑋𝐺 could both
be true. The sentence would still evaluate to true, but this would correspond to 𝑋 being colored both Red and Green,
which isn’t allowed.
Therefore, we need a sentence that says each node can have at most one color (and not more than one color).
(g) [1 pt] How many symbols are assigned to true in an assignment that solves the problem?
Your answer should be an integer.
For each of the 3 nodes, exactly one of the symbols (corresponding to one of the colors) will be assigned to true, which
tells us what color is assigned to that node.
Pacman would like to prove the following statement: “If 𝑋 is colored Red in this problem, then 𝑌 will not be colored Red,”
using only a SAT solver.
(h) [3 pts] In addition to the sentences defining the problem from subparts (c), (e), and (f), which other clauses should be part
of the input to the SAT solver to ensure that Pacman gets the right answer? Select all that apply.
■ (𝑋𝑅 )
□ (¬𝑋𝑅 )
■ (𝑌𝑅 )
□ (¬𝑌𝑅 )
□ (¬𝑋𝑅 ∨ ¬𝑌𝑅 )
□ (𝑋𝑅 ∨ 𝑌𝑅 )
# None of the above.
We want to prove that 𝐾𝐵 ⊧ 𝛼, where 𝛼 is (𝑋𝑅 ⟹ ¬𝑌𝑅 ). To do that, we give the SAT solver 𝐾𝐵 ∧ ¬𝛼 and show it’s
unsatisfiable. Now
𝐾𝐵 ∧ ¬𝛼 ≡ 𝐾𝐵 ∧ ¬(𝑋𝑅 ⟹ ¬𝑌𝑅 ) ≡ (𝐾𝐵 ∧ 𝑋𝑅 ∧ 𝑌𝑅 )
so we add both 𝑋𝑅 and 𝑌𝑅 .
Another way to think of this solution is to note that we’re proving the statement via contradiction. In other words, we’re
trying to find a scenario where 𝑋 is colored Red and 𝑌 is also colored Red. This corresponds to the two clauses selected
in the solution. If the SAT solver is unable to find such a contradiction, then we know that the statement is true. However,
if the SAT solver does find a scenario where 𝑋 is Red and 𝑌 is Red, then we’ve found a contradiction and the statement
must be false.
10
■ A correct SAT representation of a discrete, finite CSP has exactly the same number of satisfying assignments
as the CSP has distinct solutions.
# None of the above.
The process we used to convert the map coloring problem to a SAT problem can be generalized to all discrete, finite
CSPs. We create symbols representing every possible assignment to every variable. Then we add sentences saying that
each variable can only take on exactly one value. Finally, we write sentences with the symbols that encode the constraints
between variables. We know that it will always be possible to encode constraints with the symbols because in the worst
case, we could explicitly write the constraint by listing out all possible assignments that satisfy the constraint, and then
express this list of possible assignments as a logical sentence.
To represent a SAT problem as a CSP, we make the symbols into variables. Each variable takes on either true or false as
values. Then, we encode the clauses in the SAT problem as constraints in the CSP.
If we use the process described above to represent a CSP as a SAT problem, then each solution to the CSP corresponds
to exactly one satisfying assignment of the SAT problem. In other words, for each satisfying assignment we find, we can
read off that assignment to get exactly one assignment to all the variables that satisfies all constraints, i.e. one solution to
the CSP.
11
Q4. [15 pts] CSPs: The Zookeeper
You are a newly appointed zookeeper, and your first task is to find rooms for all of the animals.
The zoo has three animals: the Iguana (𝐼), Jaguar (𝐽 ), and Koala (𝐾).
Each animal needs to be assigned to one of four rooms: the North room (N), East room (E), South room (S), or West room (W),
subject to the following constraints:
1. The jaguar cannot share a room with any other animal.
2. The iguana and koala must be in different rooms.
3. The koala can only be in the East room or the South room.
(a) [2 pts] Consider the first constraint: “The jaguar cannot share a room with any other animal.”
Can this constraint be expressed using only binary constraints on the three variables 𝐼, 𝐽 , 𝐾?
# Yes, it can be expressed as 1 binary constraint.
Yes, it can be expressed as 2 different binary constraints.
# Yes, it can be expressed as 4 different binary constraints.
# No, this is necessarily a unary constraint.
# No, this is necessarily a higher-order constraint.
Recall that a binary constraint relates two variables (in this problem, two animals). We can write this constraint as two
different binary constraints: The jaguar cannot share a room with the iguana. The jaguar cannot share a room with the
koala. 𝐽 ≠ 𝐼 and 𝐽 ≠ 𝐾.
(b) [3 pts] Suppose we enforce unary constraints, and then assign the jaguar to the South room. The remaining values in each
domain would be:
Iguana: North, East, South, West
Jaguar: South
Koala: East, South
In the table below, mark each value that would be removed by running forward-checking after this assignment.
(c) [3 pts] Regardless of your answer to the previous subpart, suppose we’ve done some filtering and the following values
remain in each variable’s domain:
Iguana: East, West
Jaguar: South
Koala: East
12
Are all the arcs in this CSP consistent?
# Yes, all arcs are consistent.
# No, only 𝐾 → 𝐼 is not consistent.
No, only 𝐼 → 𝐾 is not consistent.
# No, 𝐾 → 𝐼 and 𝐼 → 𝐾 are both not consistent.
# No, only 𝐽 → 𝐼 is not consistent.
# No, only 𝐼 → 𝐽 is not consistent.
# No, 𝐽 → 𝐼 and 𝐼 → 𝐽 are both not consistent.
The 𝐼 → 𝐾 arc is not consistent. If we assign Iguana to East, there is no valid assignment to Koala that would satisfy the
constraints (because of constraint 2).
All of the other arcs in this CSP are consistent. We can go through and check the arcs listed as options in this subpart:
𝐾 → 𝐼 is consistent because if we assign East at 𝐾, we can still assign West at 𝐼.
𝐽 → 𝐼 is consistent because if we assign South at 𝐽 , we can still assign East or West at 𝐼.
𝐼 → 𝐽 is still consistent because if we assign East at 𝐼, South is a valid assignment at 𝐽 . Also, if we assign West at 𝐼,
South is a valid assignment at 𝐽 .
For completeness, here are the remaining arcs:
𝐽 → 𝐾 is consistent. If we assign South at 𝐽 , it’s fine to assign East at 𝐾.
𝐾 → 𝐽 is consistent. If we assign East at 𝐾, it’s fine to assign South at 𝐽 .
13
The constraints, repeated here for your convenience:
(d) [2 pts] Regardless of your answer to the previous subpart, suppose we start over and just enforce the third constraint. Then
the remaining values in each domain are:
What does the minimum remaining values (MRV) heuristic suggest doing next?
(e) [2 pts] Again, consider the CSP after just enforcing the third constraint:
Which assignment would the least constraining value (LCV) heuristic prefer?
Assign North to Jaguar.
# Assign East to Jaguar.
# LCV is indifferent between these two assignments.
LCV suggests assigning the value that deletes the fewest values in the domains of other variables.
Assuming we use forward-checking to filter, assigning North to Jaguar causes Iguana to lose North as an option. Assigning
East to Jaguar causes both Iguana and Koala to lose East as an option.
Assigning North to Jaguar deletes the fewest values, so LCV gives this assignment higher priority.
Intuitively, North is a less constraining option than East for 𝐽 , because North leaves 𝐾 with two options, while East only
leaves 𝐾 with one option. Both North and East for 𝐽 leave 𝐼 with three options.
14
Recall that a binary constraint relates two variables (in this problem, two animals). We can write this constraint as three
different binary constraints: ¬(𝐽 = 𝑤 ∧ 𝐼 = 𝑤) and ¬(𝐽 = 𝑤 ∧ 𝐾 = 𝑤) and ¬(𝐾 = 𝑤 ∧ 𝐼 = 𝑤).
In other words, we go through all pairs of variables, and enforce that no pair of variables can both be assigned to West.
There are 3 possible pairs of variables that need to be considered in this problem.
“The iguana and jaguar cannot both be in the West room.”
“The jaguar and koala cannot both be in the West room.”
“The iguana and koala cannot both be in the West room.”
15
Q5. [16 pts] Games
Consider the following game tree where upward triangle nodes are max nodes, and downward triangle nodes are min nodes:
B C
D E F
Y 5 3 6 14
𝐷 = 4, 𝐸 = 3, 𝐹 = 6
𝐵 = 4, 𝐶 = 6
𝐴=4
(b) [3 pts] What range of values for 𝑌 would cause the minimax value at the root 𝐴 to be equal to 𝑌 ?
Your answer should be an inequality using ≤, such as 𝑌 ≤ 188, or 188 ≤ 𝑌 ≤ 288, or 388 ≤ 𝑌 ≤ 388, or “impossible”
(if there are no such values for 𝑌 ).
3≤𝑌 ≤5
A useful way to work through these questions is to consider different values of 𝑌 , and eventually generalize to different
ranges of 𝑌 . For example, you could start with a very negative value like 𝑌 = −1000 and see if the minimax value at 𝐴
is equal to 𝑌 . Then, you could start generalizing by noting that every value of 𝑌 < 3 actually results in the same behavior
(there was nothing special about −1000 compared to any other negative number; the important thing was that it was less
than all the other terminal node values).
If 𝑌 < 3, then the value at 𝐷 will be equal to 𝑌 . Therefore, we have 𝐷 < 3. Then, the max node at 𝐵 will choose the
branch to 𝐸 (to get value 3) and never choose the branch 𝐷 (which gets value < 3). In summary, if 𝑌 < 3, the branch to
𝑌 will not be chosen, and the value at the root cannot be 𝑌 .
The only way for the max node at 𝐵 to choose the branch to 𝐷 over the branch to 𝐸 is if 𝐷 ≥ 3. Therefore, we at least
have 𝑌 ≥ 3. (In other words, since we ruled out all 𝑌 < 3 values in the previous paragraph, we’ve narrowed down our
range to 𝑌 ≥ 3.)
The next interesting value to consider is 𝑌 = 5, since this is the point at which different branches start getting chosen. (In
other words, 𝑌 = 3, 𝑌 = 3.001, 𝑌 = 4, etc. will not change the branches chosen.) In particular, when 𝑌 exceeds 5, 𝐷
start to choose the 5 branch instead of the 𝑌 branch. In summary, if 𝑌 > 5, then 𝐷 will choose the 5 branch instead of 𝑌 ,
which prevents 𝑌 from being the value at the root. Therefore, we also have 𝑌 ≤ 5.
At this point, we’ve narrowed down our range to 3 ≤ 𝑌 ≤ 5. The last step is to verify that all the values in this range
actually cause the root value to equal 𝑌 .
From above, we already reasoned that 3 ≤ 𝑌 ≤ 5 will cause the value at 𝐵 to be 𝑌 . We know that 𝐶 = 6, so if 3 ≤ 𝑌 ≤ 5,
the min node at 𝐴 will choose 𝐵 over 𝐶. This confirms that the root value at 𝐴 will be 𝑌 (and not 6 from choosing 𝐶),
and we’re done.
(c) [4 pts] Assume that 𝑌 = 4. If we run minimax with alpha-beta pruning, visiting nodes from left to right, which terminal
nodes will we not visit because of pruning? Select all that apply.
16
□ 𝑌 □ 6
□ 5 □ 14
□ 3 None of the above (we visit all terminal nodes).
One useful way to work through alpha-beta pruning is to think: if hypothetically, this terminal node had a very positive
or very negative value, would that affect the branches chosen in the tree? If the value at the terminal node could change
the branches chosen, then we have to check that terminal node and we cannot prune it.
We have to visit all the terminal nodes in the left-most branch (𝑌 and 5) to determine the value of 𝐷 first. Without checking
this first set of terminal nodes, we don’t have any bounds on the min or max nodes to prune with.
We have to visit the terminal node 3. This value could have been something very positive like 1000, which would have
changed the value at 𝐵. (𝐵 would have chosen to go to 𝐸 instead of 𝐷.)
We have to visit the terminal nodes 6 and 14, because these could have been very negative values like −1000, which would
have changed the value at 𝐴. (𝐴 would have chosen to go to 𝐶 instead of 𝐵.)
The quickest way to note that this is true is to note that 𝐴 must eventually take on one of the terminal node values, since
we’re just repeatedly applying mins and maxes to the terminal node values.
Therefore, 𝐶 is upper-bounded by the largest of the terminal node values, which happens to be max(𝑌 , 14).
(e) [2 pts] The root of this tree is a min node. You would like to represent this same game with a new tree that has a max
node at the root.
Select all changes that you would need to make in the new tree.
■ Convert all min nodes to max nodes.
■ Convert all max nodes to min nodes.
□ Add a layer of chance nodes to the new tree.
■ Multiply each of the terminal node values by −1.
# None of the above - it is impossible to represent this game with a new tree with a max node at the root.
This game has two agents: one agent is represented by the min nodes, and the other agent is represented by the max nodes.
This game is zero-sum. In a zero-sum game, the min and max agents have exactly opposite utilities. We denote this in
the game tree by writing out the utilities for one player (the max agent) at the terminal nodes. Then, we assume that the
max agent picks higher numbers and the min agent picks lower numbers. This correctly represents the min agent’s utility
because the min agent’s utility is the negative of the numbers at the terminal nodes, and maximizing the negative of the
numbers shown is equivalent to minimizing the numbers shown. As a concrete example: the min agent prefers 3 over 5.
This represents the fact that “3” on the tree is actually worth −3 to the min agent, and “5” on the tree is actually worth −5
to the min agent. Therefore, the min agent prefers the node labeled 3 because it gives the min agent higher utility.
There was nothing special about which agent we chose to be the max agent and which agent we chose to be the min agent;
the important thing was just that their utilities are opposites of each other. We can reinterpret the min agent as a max agent,
and reinterpret the max agent as a min agent, without changing the game being represented. This involves converting all
min nodes to max nodes, and converting all max nodes to min nodes.
In the original tree, the min agent was trying to minimize the numbers in the tree. Because we’ve swapped this agent to a
max agent looking to maximize the numbers in the tree, we can multiply all the numbers in the tree by −1 to preserve the
same ordering of preferences. (As a concrete example: in the original tree, the min agent preferred 3 over 5. In the new
tree, this is now a max agent that prefers −3 over −5.)
Multiplying the numbers by −1 also preserves the same ordering of preferences for the max agent who has now been
swapped into a min agent. In the original tree, the max agent preferred 5 over 3. In the new tree, this now a min agent
that prefers −5 over −3.
17
(f) [4 pts] Suppose the max nodes 𝐵 and 𝐶 are replaced with chance nodes that select actions uniformly at random. What
range of values for 𝑌 would cause the value at the root 𝐴 to be equal to 𝑌 ?
Your answer should be an inequality using ≤, such as 𝑌 ≤ 188, or 188 ≤ 𝑌 ≤ 288, or 388 ≤ 𝑌 ≤ 388, or “impossible”
(if there are no such values for 𝑌 ).
3≤𝑌 ≤3
Solution 1:
For 𝑌 > 5, 𝐷 = 5, 𝐵 = 4, and 𝐴 = 4, so there is no value of 𝑌 such that 𝐶 = 𝑌 .
For 𝑌 ≤ 5, 𝐷 = 𝑌 and 𝐵 = (𝑌 +3)∕2. This is always less than 6, so 𝐴 = (𝑌 +3)∕2. For 𝐴 = 𝑌 , we must have 𝑌 = (𝑌 +3)∕2,
from which we obtain 𝑌 = 3.
Solution 2:
First, note that because 𝐵 and 𝐶 are chance nodes, it is no longer the case that 𝐴 is always exactly equal to one of the
terminal node values. In particular, 𝐵 is always going to be the average of 3 and the value at 𝐷. Therefore, we either have
𝐵 = (3 + 5)∕2, if 5 is selected at 𝐷, or we have 𝐵 = (3 + 𝑌 )∕2, if 𝑌 is selected at 𝐷.
In the case of 𝐵 = (3 + 5)∕2 = 4, the value at 𝐴 will also be 4. We’re looking for cases where the value at 𝐴 is equal to
𝑌 , so we check if it’s possible that 𝑌 = 4 here. However, in this case, 𝑌 could not have been 4, because we assumed that
𝐷 chose 5 over 𝑌 (and that would not happen if 𝑌 = 4). You could also just plug 𝑌 = 4 into the tree and notice that the
root value is not 4.
In the case of 𝐵 = (3 + 𝑌 )∕2, the value at 𝐴 could also potentially be (3 + 𝑌 )∕2 (depending on how this value compares
to 6, the value at 𝐶). As before, we check if it’s possible that 𝑌 = (3 + 𝑌 )∕2. Solving this equation gives us 𝑌 = 3.
Plugging this into the tree, we see that 𝐴 indeed chooses 𝐵 = 3 over 𝐶 = 6.
Finally, we can confirm that only one value of 𝑌 works, not a range. If we changed 𝑌 even a little bit to 𝑌 = 3.001, then
the value at 𝐵 (and later, 𝐴) would be the average (3 + 3.001)∕2, which is not equal to 𝑌 = 3.001.
18
Q6. [17 pts] MDPs: Flying Pacman
Pacman is in a 1-dimensional grid with squares labeled 0 through 𝑛, inclusive, as shown below:
0 1 2 3 4 5 n-1 n
Pacman’s goal is to reach square 𝑛 as cheaply as possible. From state 𝑛, there are no more actions or rewards available.
At any given state, if Pacman is not in 𝑛, Pacman has two actions to choose from:
• Run: Pacman deterministically advances to the next state (i.e. from state 𝑖 to state 𝑖 + 1). This action costs Pacman $1.
• Fly: With probability 𝑝, Pacman directly reaches state 𝑛. With probability 1 − 𝑝, Pacman is stuck in the same state. This
action costs Pacman $2.
(a) [3 pts] Fill in the blank boxes below to define the MDP. 𝑖 represents an arbitrary state in the range {0, … , 𝑛 − 1}.
𝑠 𝑎 𝑠′ 𝑇 (𝑠, 𝑎, 𝑠′ ) 𝑅(𝑠, 𝑎, 𝑠′ )
𝑖 Fly 𝑖 1−𝑝 −2
𝑖 Fly 𝑛 𝑝 −2
𝑠′ is the resulting successor state after taking an action from a certain state. From state 𝑖, taking Fly either keeps you stuck
in state 𝑖 (row 2), or takes you directly to state 𝑛 (the blank in row 3).
𝑇 (𝑠, 𝑎, 𝑠′ ) represents the probability of starting in 𝑠, taking action 𝑎, and landing in 𝑠′ .
Row 1: The action Run is deterministic, so the probability of being in 𝑖, taking action Run, and landing in 𝑖 + 1 is 100%
or 1.0.
Rows 2-3: Being in 𝑖 and taking action Fly lands you in state 𝑖 with probability 1 − 𝑝, and lands you in state 𝑛 with
probability 𝑝.
Note that 𝑅(𝑠, 𝑎, 𝑠′ ) is negative because we’re trying to minimize costs. Representing costs as negative reward forces an
MDP (which maximizes rewards) to minimize those costs.
As a concrete example: this MDP prefers reward −1 over reward −2 because −1 > −2. This is consistent with the problem
statement, where Pacman would prefer to pay $1 over paying $2.
If we left the rewards positive, this MDP would prefer reward +2 over reward +1. This would be inconsistent with
Pacman’s preference of spending as little money as possible.
Let 𝜋𝑅 denote the policy of always selecting Run, and 𝜋𝐹 denote the policy of always selecting Fly.
Compute the values of these two policies. Your answer should be an expression, possibly in terms of 𝑛, 𝑝, and/or 𝑖.
𝑖−𝑛
This expression represents the expected reward of starting in state 𝑖 and taking action Run indefinitely.
If we start in 𝑖 and Run indefinitely, we will deterministically move to 𝑖 + 1, 𝑖 + 2, … until we arrive in state 𝑛. This takes
𝑛 − 𝑖 Run actions.
Each Run action incurs a reward of −1, and there’s no discount (𝛾 = 1), so the total reward of following this policy is
(−1) ⋅ (𝑛 − 𝑖) = 𝑖 − 𝑛.
19
(c) [2 pts] What is 𝑉 𝜋𝐹 (𝑖)?
∑∞ 𝑘−1 𝑝
Hint: Recall that the mean of a geometric distribution with success probability 𝑝 is 𝑘 = 1 𝑘(1 − 𝑝) = 1∕𝑝.
−2∕𝑝
We’re asked to calculate the expected total reward of starting in state 𝑖 and taking action Fly indefinitely.
At each time step, with probability 𝑝, we land in state 𝑛, and with probability 1 − 𝑝, we stay in the same place. On average,
using the formula given, it will take 1∕𝑝 Fly actions before one succeeds and we reach state 𝑛. Each of those actions costs
−2, and there’s no discount, so the total reward of following this policy is (−2) ⋅ (1∕𝑝) = −2∕𝑝.
Formally, to see how the formula results in 1∕𝑝 Fly actions on average, we can write out all the possible outcomes of
trying Fly indefinitely:
Case I: Our first Fly action succeeds (takes us to 𝑛). This happens with probability 𝑝, and results in a total of 1 action
taken.
Case II: Our first Fly action fails (we’re stuck), but our second Fly action succeeds. This happens with probability (1−𝑝)𝑝,
where the 1 − 𝑝 comes from failing once and the 𝑝 comes from succeeding. This results in a total of 2 actions taken.
Case III: We fail twice, and succeed on our third try. This happens with probability (1 − 𝑝)2 𝑝, and results in 3 actions
taken.
Case IV: We fail three times, and succeed on our fourth try. This happens with probability (1 − 𝑝)3 𝑝, and results in 4
actions taken.
There are infinitely many cases, corresponding to failing more and more times before we succeed.
In total, the expected number of tries before we succeed comes from multiplying each possible number of actions with
the probability of requiring that number of actions:
[1 ⋅ 𝑝] + [2 ⋅ (1 − 𝑝)𝑝] + [3 ⋅ (1 − 𝑝)2 𝑝] + [4 ⋅ (1 − 𝑝)3 𝑝] + …
Writing this as an infinite sum actually yields the exact formula provided to us:
∑∞ 𝑘−1 𝑝 = 1∕𝑝
𝑘=1 𝑘(1 − 𝑝)
This tells us that on average, it takes 1∕𝑝 Fly actions before one succeeds. This should match your intuition: for example,
if Fly succeeds 10% of the time, we expect to try 10 times on average before succeeding once.
Notice that the total reward is independent of 𝑖 (there’s no 𝑖 in the final expression). This should match your intuition that
no matter which state you’re in, the value of taking action Fly indefinitely is the same, because the Fly action behaves
exactly the same regardless of which state you’re in (either you’re stuck in the same place, or you directly reach 𝑛).
(d) [4 pts] Given the results of the two previous subparts, we can now find the optimal policy for the MDP.
Which of the following are true? Select all that apply. (Hint: consider what value of 𝑖 makes 𝑉 𝜋𝑅 (𝑖) and 𝑉 𝜋𝐹 (𝑖) equal.)
Note: ⌈𝑥⌉ is the smallest integer greater than or equal to 𝑥.
20
Without looking at the answer choices, you can also use intuition to realize the same point. For low values of 𝑖, the value
of Run is low (because you’re far from the end), and for higher values of 𝑖, the value of Run gets higher (because you’re
closer to the end). However, no matter which 𝑖 you’re at, the value of Fly is the same (as we discussed in the previous
subpart).
——
Mathematically, 𝑉 𝜋𝑅 (𝑖) = 𝑖 − 𝑛 as a function of 𝑖 is increasing linearly; for higher values of 𝑖, 𝑉 𝜋𝑅 (𝑖) increases as well.
You could draw this as an upward-sloping line on a graph of state 𝑖 (x-axis) vs. value (y-axis).
However, 𝑉 𝜋𝐹 (𝑖) = −2∕𝑝 as a function of 𝑖 is a constant (it doesn’t depend on 𝑖). On the same graph of state vs. value,
you’d have a flat line.
The upwards-sloping line crosses the flat line at 𝑖 = 𝑛 − 2∕𝑝.
In summary: for higher states 𝑖 ≥ ⌈𝑛 − 2∕𝑝⌉, we have 𝑉 𝜋𝑅 (𝑖) > 𝑉 𝜋𝐹 (𝑖), which means that it’s better to Run in these states.
For lower states 𝑖 < ⌈𝑛 − 2∕𝑝⌉, we have 𝑉 𝜋𝑅 (𝑖) < 𝑉 𝜋𝐹 (𝑖), which means that it’s better to Fly in these states.
This should match your intuition: if you’re close to the end, Run is better. If you’re very far away, Fly is better.
——
The final thing we have to check is the 𝑝 < 2∕𝑛 condition that the answer choices suggest.
If we plug in 𝑝 = 2∕𝑛, we get 𝑖 = ⌈𝑛 − 2∕(2∕𝑛)⌉ = ⌈𝑛 − 𝑛⌉ = 0.
If we plug in some lower value like 𝑝 = 1∕𝑛, we get 𝑖 = ⌈𝑛 − 2∕(1∕𝑛)⌉ = ⌈𝑛 − 2𝑛⌉ = −2𝑛.
What does this 𝑖 value represent? It represents the first state where Run becomes better than Fly. But if 𝑝 < 2∕𝑛, we start
getting negative values of 𝑖! For example, when 𝑝 = 1∕𝑛, the first state where Run is better than Fly is −2𝑛, which isn’t
a valid state! In other words, for all states greater than −2𝑛, it’s better to Run than Fly. This actually means that for all
states in the MDP (0 through 𝑛 − 1, which are all non-negative and thus greater than −2𝑛), Run is better than Fly.
This should match your intuition: if the probability of succeeding with Fly is way too small, then it’s better to always Run,
no matter where you’re at.
21
Regardless of your answers to the previous parts, consider the following modified transition and reward functions (which may
not correspond to the original problem). As before, once Pacman reaches state 𝑛, no further actions or rewards are available.
For each modified MDP and discount factor, select whether value iteration will converge to a finite set of values.
(e) [2 pts] 𝛾 = 1
𝑠 𝑎 𝑠′ 𝑇 (𝑠, 𝑎, 𝑠′ ) 𝑅(𝑠, 𝑎, 𝑠′ )
𝑖 Run 𝑖+1 1.0 +5
𝑖 Fly 𝑖+1 1.0 +5
In these questions, it helps to recall that the value of a state is the expected, discounted sum of rewards for starting in that
state and acting optimally. Value iteration is trying to compute a value for every state in the MDP.
Value iteration converges if the value at every state is finite. If the value at some states is infinite, the value iteration will
never converge.
——
In this subpart, 𝑛 is an absorbing state: no matter how you act in this MDP, you eventually have to reach 𝑛 in a finite
number of time steps. Therefore, you only have a limited number of time steps to accumulate reward in this MDP, so the
expected discounted sum of rewards in this MDP is finite.
Intuitively, Pacman is only allowed to move forward, so Pacman will inevitably reach 𝑛 and run out of rewards to collect.
(f) [2 pts] 𝛾 = 1
𝑠 𝑎 𝑠′ 𝑇 (𝑠, 𝑎, 𝑠′ ) 𝑅(𝑠, 𝑎, 𝑠′ )
𝑖 Run 𝑖+1 1.0 +5
𝑖 Fly 𝑖−1 1.0 +5
22