Homework 3 Electronic

Starting from:

$29.99

Homework 3 Electronic

1 Nash Equilibria
Compute all (mixed and pure) Nash equilibria for each of the following normal-form games:
Sample Answer:
(T,L),(T,R),((1/3,2/3),(3/4,1/4))
Especially, if no mixed equilibria, write as:
(T,L),(T,R)
If one player is indifferent for any mixed strategy, write as:
((any),(P,1-P))
2 Strategies
Consider the following game in matrix form with two players. Payoffs for the row player Izzy are
indicated first in each cell, and payoffs for the column player Jack are second.
X Y Z
S 5, 2 10, 6 25, 10
T 10, 12 5, 6 0, 0
(a) This game has two pure strategy Nash equilibria. What are they (justify your answer)? Of the
two pure equilibria, which would Izzy prefer? Which would Jack prefer?
Sample Answer:
(S,X),(S,Y)
Izzy(S,X)
Jack(S,Y)
1
VE 492 : Electronic #3 (Due June 10rd, 2020 at 11:59pm)
(b) Suppose Izzy plays a strictly mixed strategy, where both S and T are chosen with positive
probability. With what probability should Izzy choose S and T so that each of Jack’s three pure
strategies is a best response to Izzy’s mixed strategy.
Sample Answer:
(1/2,1/2)
Note that the former is the probability of S and the latter is the probability of T.
(c) Suppose Jack wants to play a mixed strategy in which he selects X with probability 0.7. With
what probability should Jack plays actions Y and Z so both of Izzy’s pure strategies is a best
response to Jack’s mixed strategy?
Sample Answer:
(1/4,3/4)
Note that the former is the probability of Y and the latter is the probability of Z.
(d) Based on your responses above, describe a mixed strategy equilibrium for this game in which
both Jack and Izzy play each of their actions (pure strategies) with positive probability (you can
rely on the quantities computed in the prior parts of this question).
Sample Answer:
((1/4,3/4),(7/10,3/20,3/20))
Note that the first tuple is the strategy of Izzy and the second one is the strategy of Jack.
(e) If we swap two of Izzy’s payoffs in this matrixin other words, if we replace one of his payoffs r in
the matrix with another of his payoffs t from the matrix, and replace t with r, we can make one
of his strategies dominant. What swap should we make, which strategy becomes dominant?
Sample Answer:
(S,X),(S,Y),S
Note: The two tuples are strategies of which you want to exchange Izzy’s payoffs. And please
write in dictionary order, e.g. (S, Y ) before (T, X). The S is the dominant strategy after the
payoff exchange.
3 Solving MDPs
Consider the gridworld MDP for which Left and Right actions are 100% successful.
Specifically, the available actions in each state are to move to the neighboring grid squares. From
state a, there is also an exit action available, which results in going to the terminal state and collecting
a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful
100% of the time.
Let the discount factor γ = 1. Write the following quantities in one line.
V0(d), V1(d), V2(d), V3(d), V4(d), V5(d)
Sample Answer:
0,0,0,0,1,10
2
VE 492 : Electronic #3 (Due June 10rd, 2020 at 11:59pm)
4 Value Iteration Convergence Values
Consider the gridworld where Left and Right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From
state a, there is also an exit action available, which results in going to the terminal state and collecting
a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful
100% of the time.
Let the discount factor =0.2. Fill in the following quantities.
V
∗
(a) = V∞(a) =
V
∗
(b) = V∞(b) =
V
∗
(c) = V∞(c) =
V
∗
(d) = V∞(d) =
V
∗
(e) = V∞(e) =
Write the following quantities in one line.
Sample Answer:
0,0,0,0,1
5 Value Iteration Properties
Which of the following are true about value iteration? We assume the MDP has a finite number of
actions and states, and that the discount factor satisfies 0 < γ < 1.
A. Value iteration is guaranteed to converge.
B. Value iteration will converge to the same vector of values (V
∗
) no matter what values we use to
initialize V .
C. None of the above.
3
VE 492 : Electronic #3 (Due June 10rd, 2020 at 11:59pm)
6 Policy Iteration
Consider the gridworld where Left and Right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From
state a, there is also an exit action available, which results in going to the terminal state and collecting
a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful
100% of the time.
The discount factor (γ) is 1.
(a) Consider the policy π1 shown below, and evaluate the following quantities for this policy. Write
your answers in one line.
V
(π1)
(a) =
V
(π1)
(b) =
V
(π1)
(c) =
V
(π1)
(d) =
V
(π1)
(e) =
Sample Answer:
0,0,0,0,1
(b) Consider the policy π2 shown below, and evaluate the following quantities for this policy. Write
your answers of the same format in Part 1 in one line.
V
(π2)
(a) =
V
(π2)
(b) =
V
(π2)
(c) =
V
(π2)
(d) =
V
(π2)
(e) =
Sample Answer:
0,0,0,0,1
4
VE 492 : Electronic #3 (Due June 10rd, 2020 at 11:59pm)
7 Policy Iteration
Consider the gridworld where Left and Right actions are successful 100% of the time.
Specifically, the available actions in each state are to move to the neighboring grid squares. From
state a, there is also an exit action available, which results in going to the terminal state and collecting
a reward of 10. Similarly, in state e, the reward for the exit action is 1. Exit actions are successful
100% of the time.
The discount factor (γ) is 0.9.
(a) Policy Iteration
Consider the policy πi shown below, and evaluate the following quantities for this policy. Write
your answers in one line.
V
(πi)
(a) =
V
(πi)
(b) =
V
(πi)
(c) =
V
(πi)
(d) =
V
(πi)
(e) =
Sample Answer:
0,0,0,0,1
5
VE 492 : Electronic #3 (Due June 10rd, 2020 at 11:59pm)
(b) Policy Improvement
Perform a policy improvement step. The current policy’s values are the ones from Part 1 (so
make sure you first correctly answer Part 1 before moving on to Part 2). Write your answers in
one line.
(i). π(i+1)(a) =
A. Exit
B. Right
(ii). π(i+1)(b) =
A. Exit
B. Right
(iii). π(i+1)(c) =
A. Exit
B. Right
(iv). π(i+1)(d) =
A. Exit
B. Right
(v). π(i+1)(e) =
A. Exit
B. Right
Sample Answer:
A,A,A,A,B
8 Wrong Discount Factor
Bob notices value iteration converges more quickly with smaller γ and rather than using the true
discount factor γ, he decides to use a discount factor of αγ with 0 < α < 1 when running value
iteration. Write the options that are guaranteed to be true:
A. While Bob will not find the optimal value function, he could simply rescale the values he finds
by 1 − γ
1 − α
to find the optimal value function.
B. If the MDP’s transition model is deterministic and the MDP has zero rewards everywhere, except
for a single transition at the goal with a positive reward, then Bob will still find the optimal
policy.
C. If the MDP’s transition model is deterministic, then Bob will still find the optimal policy.
D. Bob’s policy will tend to more heavily favor short-term rewards over long-term rewards compared
to the optimal policy.
E. None of the above.
6
VE 492 : Electronic #3 (Due June 10rd, 2020 at 11:59pm)
9 MDP Properties
(a) Which of the following statements are true for an MDP?
A. If the only difference between two MDPs is the value of the discount factor then they must
have the same optimal policy.
B. For an infinite horizon MDP with a finite number of states and actions and with a discount
factor γ that satisfies 0 < γ < 1, value iteration is guaranteed to converge.
C. When running value iteration, if the policy (the greedy policy with respect to the values)
has converged, the values must have converged as well.
D. None of the above.
(b) Which of the following statements are true for an MDP?
A. If one is using value iteration and the values have converged, the policy must have converged
as well.
B. Expectimax will generally run in the same amount of time as value iteration on a given
MDP.
C. For an infinite horizon MDP with a finite number of states and actions and with a discount
factor γ that satisfies 0 < γ < 1, policy iteration is guaranteed to converge.
D. None of the above.
10 Policies
John, James, Alvin and Michael all get to act in an MDP (S, A, T, γ, R, s0).
• John runs value iteration until he finds V
∗ which satisfies
∀s ∈ S : V
∗
(s) = max
a∈A
X
s
0
T(s, a, s0
)[R(s, a, s0
) + γV ∗
(s
0
)]
and acts according to
πJohn = arg max
a∈A
X
s
0
T(s, a, s0
)[R(s, a, s0
) + γV ∗
(s
0
)]
• James acts according to an arbitrary policy πJames.
• Alvin takes James’s policy πJames and runs one round of policy iteration to find his policy πAlvin.
• Michael takes John’s policy and runs one round of policy iteration to find his policy πMichael.
Note: One round of policy iteration = performing policy evaluation followed by performing policy
improvement.
Write all of the options that are guaranteed to be true:
A. It is guaranteed that ∀s ∈ S : V
πJames (s) ≥ V
πAlvin (s).
B. It is guaranteed that ∀s ∈ S : V
πMichael(s) ≥ V
πAlvin (s).
C. It is guaranteed that ∀s ∈ S : V
πMichael(s) > V πJohn(s).
D. It is guaranteed that ∀s ∈ S : V
πJames (s) ≥ V
πJohn(s).
E. None of the above.
7