$29.99
Homework 4 Electronic
1 Model-Based RL:Grid
Assuming we have four observed episodes for training,
1. Episode 1:A,south,C,-1;C,south,E,-1;E,exit,x,+10
2. Episode 2:B,east,C,-1;C,south,D,-1;D,exit,x,-10
3. Episode 3:B,east,C,-1;C,south,D,-1;D,exit,x,-10
4. Episode 4:A,south,C,-1;C,south,E,-1;E,exit,x,+10
What model would be learned from the above observed episodes?
T(A, south , C) =
T(B, east , C) =
T(C, south ,E) =
T(C, south , D) =
(Your answer should be 1,0.5,0.25,0.35 for example)
1
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
2 Direct Evaluation
Consider the situations in problem 1, what are the estimates for the following quantities as obtained by
direct evaluation:
Vb π
(A) =
Vb π
(B) =
Vb π
(C) =
Vb π
(D) =
Vb π
(E) =
(Your answer should be 1,-1,0,0,5 for example)
2
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
3 Temporal Difference Learning
Vˆ π
(A) =
Vˆ π
(B) =
Vˆ π
(C) =
Vˆ π
(D) =
Vˆ π
(E) =
(Your answers should be 1,-1,0,0,5 for example)
3
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
4 Model-Free RL:Cycle
A B C
Clockwise -0.93 1.24 0.439
Counterclockwise -5.178 5 3.14
The agent encounters the following samples,
s a s’ r
A clockwise C -4
C clockwise B 3
Process the sample given above. Fill in the Q-values after both samples have been accounted for.
Q(A,clockwise)=
Q(B,clockwise)=
Q(C,clockwise)=
Q(A,counterclockwise)=
Q(B,counterclockwise)=
Q(C,counterclockwise)=
(You answer should be 1,-1,0,0,5,6 for example)
4
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
5 Q-Learning Properties
In general, for Q-Learning to converge to the optimal Q-values...
A. It is necessary that every state-action pair is visited infinitely often.
B. It is necessary that the learning rate α (weight given to new samples) is decreased to 0 over time.
C. It is necessary that the discount γ is less than 0.5.
D. It is necessary that actions get chosen according to arg maxa Q(s, a).
(You answers should be ABCD for example)
5
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
6 Exploration and Exploitation
For each of the following action-selection methods, indicate which option describes it best.
A: With probability p, select arg maxa Q(s, a). With probability 1-p, select a random action. p = 0.99.
A. Mostly exploration
B. Mostly exploitation
C. Mix of both
B: Select action a with probability
P(a|s) = e
Q(s,a)
τ
P
a
0 e
Q(s,a0)
τ
where τ is a temperature parameter that is decreased over time.
A. Mostly exploration
B. Mostly exploitation
C. Mix of both
C: Always select a random action.
A. Mostly exploration
B. Mostly exploitation
C. Mix of both
D: Keep track of a count, Ks,a, for each state-action tuple,(s,a), of the number of times that tuple has
been seen and select arg maxa [Q(s, a) − Ks,a].
A. Mostly exploration
B. Mostly exploitation
6
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
C. Mix of both
Which method(s) would be advisable to use when doing Q-Learning?
(Your answers should be A,B,C,C,ABCD for example)
7
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
7 Feature-Based Representation: Actions
A. STOP
B. RIGHT
C. LEFT
D. DOWN
Using the weight vector w = [0.2,-1], which action, of the ones shown above, would the agent take from
state A?
A. STOP
B. RIGHT
8
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
C. LEFT
D. DOWN
(Your answer should be A,D for example)
9
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
8 Feature-Based Representation: Update
Consider the following feature based representation of the Q-function:
Q(s, a) = w1f1(s, a) + w2f2(s, a)
with:
f1(s, a) = 1/(Manhattan distance to nearest dot after having executed action a in state s)
f2(s, a) =(Manhattan distance to nearest ghost after having executed action a in state s)
Part 1
Assume w1 = 2, w2 = 5. For the state s shown below, find the following quantities. Assume that the red
and blue ghost are both setting on top of a dot.
Q(s,West)=
Q(s,South)=
Based on this approximate Q-function, which action would be chosen:
A.West
B.South
10
VE 492 : Electronic #4 (Due June 17th, 2020 at 11:59pm)
Part 2
Assume Pac-Man moves West. This results in the state s
0
shown below.
Q(s’,West)=
Q(s’,East)=
What is the sample value (assuming γ=1)?
Sample = [r + γ maxa0 Q (s
0
, a0
)] =
Part 3
Now let’s compute the update to the weights. Let α = 0.5
difference = = [r + γ maxa
0 Q (s
0
, a0
)] − Q(s, a) =
w1 ← w1 + α( difference )f1(s, a) ==
w2 ← w2 + α( difference )f2(s, a) ==
(Your answer should be 1,2,A,1,2,3,1,2,3 for example)
11