$30
Assignment 3
COMP7607: Natural Language Processing
Question 1: A long short-term memory (LSTM) is defined as follows. At time t it receives an input
vector xt ∈ R
k of observations, an input vector ht−1 ∈ R
k
representing the previous hidden state, and a
memory state ct−1 ∈ R
k
from the previous time step. The computes three gates it, ft, and ot controlling,
respectively. It additionally computes a new value for the memory c.t and a new hidden representation
as follows:
it = σ(Ixxt + Ihht−1 + bi)
ft = σ(Fxxt + Fhht−1 + bf )
ot = σ(Oxxt + Ohht−1 + bo)
ct = ft ⊙ ct−1 + it ⊙ g(Cxxt + Chht−1 + bc)
ht = ot ⊙ g(ct)
where σ is the element-wise logistic sigmoid function and g is an element-wise nonlinearity (e.g.,
tanh). The behavior of the network is controlled by the parameters Ix, Ih, Fx, Fh, Ox, Oh, Cx, and Ch
which are all in R
k×k
. The base values h0 = c0 = 0. Finally, a new output is computed:
yt = f(Wht + b)
Question 1a: Please briefly explain the functionality of the three gates it, ft, and ot. How do they control
the LSTM?
Question 1b: Please briefly explain why are vanishing or exploding gradients an issue for RNNs, and
how LSTM addresses this problem.
1
Question 2: Let Q ∈ R
N×d denote a set of N query vectors, which attend to M key and value vectors,
denoted by matrices K ∈ RM×d and V ∈ RM×c
respectively. For a query vector at position n, the softmax
attention function computes the following quantity:
Attn (qn, K, V) = X
M
m=1
exp
q
⊤
n km
PM
m′=1 exp (q⊤
n km′ )
v
⊤
m := V⊤ softmax (Kqn)
which is an average of the set of value vectors V weighted by the normalized similarity between
different queries and keys.
Question 2a: Please briefly explain what is the time and space complexity for the attention computation
from query Q to K, V, using the big O notation.
Question 3: Consider context-free grammar with the following rules (assume that S is the start symbol):
S → NP VP
NP → NP PP
PP → IN NP
VP → V NP
VP → VP PP
NP → we
NP → sushi
NP → chopsticks
IN → with
V → eat
Question 3a: How many parse trees are there under this grammar for the following sentence? Draw
them.
we eat sushi with chopsticks
3
Question 4: Consider a probabilistic context-free grammar with the following rules (assume that S is
the start symbol):
S → NP VP 1.0
VP → Vt NP 0.7
VP → VP PP 0.3
NP → DT NN 0.8
NP → NP PP 0.2
PP → IN NP 1.0
Vi → sleeps 1.0
Vt → saw 1.0
NN → man 0.1
NN → woman 0.1
NN → telescope 0.3
NN → dog 0.5
DT → the 1.0
IN → with 0.6
IN → in 0.4
Question 4a: What’s the most likely parse tree for the following sentence under this PCFG? Show the
CYK chart you developed below.
the man saw the woman with the dog
Question 4b: What’s the (marginal) probability of the following sentence under this PCFG?
the man saw the woman with the dog
Question 4c: For an input of length M and a grammar with R productions and N non-terminals, please
briefly explain what is the time and space complexity of the CYK algorithm using big O notation.
4
Question 5: A trigram language model is also often referred to as a second-order Markov language
model. It has the following form:
P (X1 = x1, . . . , Xn = xn) = Yn
i=1
P (Xi = xi
| Xi−2 = xi−2, Xi−1 = xi−1)
Question 5a: Could you briefly explain the advantages and disadvantages of a high-order Markov
language model compared to the second-order one?
Question 5b: Could you give some examples in English where English grammar suggests that the
second-order Markov assumption is clearly violated?
5
Question 6: The goal of sequence labeling is to assign tags to words, or more generally, to assign
discrete labels to discrete elements in a sequence. Given a sequence of n words x, assign each a label
from L. Let L = |L|. In this question, we will explore different approaches, all of which cast the problem
as:
yˆ = argmax
y∈Ln
Score(x, y; θ)
he raises purses
N −2 −3 −2
V −7 −2 −4
(a) Weights for emission features.
N V ♦
♢ −1 −1 −∞
N −1 −1.5 −2
V −2 −2.5 −0.5
(b) Weights for transition features. The "from"tags
are on the columns, and the "to"tags are on the rows.
Transition weight from START ♢ to STOP ♦ is implicitly set to −∞
Tabelle 1: Consider the minimal tagset { N, V } , corresponding to nouns and verbs. For the following
question parts, assume that the log probabilities (log base 2) are as above.
Question 6a: Local classifier Define score of a word xi getting label y ∈ L in context: score(x, i, y; θ),
for example through a feature vector, f(x, i, y). (Here, "i “ indicates the position of the input word to be
classified.) Train a classifier to decode locally, i.e.,
yˆi = argmax
y∈L
score(x, i, y; θ)
The classifier is applied to each x1, x2, . . . in turn, but all the words can be made available at each position.
Question 6a1: Based on the provided features, what is the tag sequence for this sentence with this
local classifier? Report the tag sequence and probability. (Do not need to calculate START ♢ and STOP ♦
symbol;just tag sequence for the three words)
6
Question 6b: Sequential classifier Define score of a word xi getting label y in context, including previous labels: score
x, i, yˆ1:i−1
, y; θ
. (From here, we won’t always write θ, but the dependence remains.)
Train a classifier, e.g.,
yˆi = argmax
y∈L
score
x, i, yˆ1:i−1
, y
The classifier is applied to each x1, x2, . . . in turn. Each one depends on the outputs of preceding iterations.
Algorithm 1: Beam Search for Sequential Classifier
Data: x (length n), a sequential classifier’s scoring function score, and beam width k
Result: Output: best-scored element of Hn
begin
Let H0 score hypotheses at position 0 , defining only H0(⟨⟩) = 0.;
for i ← 1 to n do
C ←− ∅;
foreach hypothesis yˆ1:i−1
scored by Hi−1 do
foreach y ∈ L do
place new hypothesis yˆ1:iy ← Hi−1 (yˆ1:i
) + score
x, i, yˆ1:i−1
, y
into C.
end
end
Let Hi be the k-best scored elements of C.
end
end
Question 6b1: Greedy search What is the highest posterior probability tag sequence for this sentence
with this sequential classifier? Report the tag sequence and probability. (including the ♦ transition)
Question 6b2: Beam search What is the tag sequence obtained by beam searching with a beam size of
2? Report the tag sequence and probability. (including the ♦ transition)
Question 6b3: For a sequence x in length n, label space L, and beam width k, please explain the time
and space complexity of the beam search algorithm.
Question 6c: Hidden Markov Model The Hidden Markov Model (HMM) approach should remind
you of language models. The HMM is based on augmenting the Markov chain. A Markov chain makes
a very strong assumption that if we want to predict the future in the sequence, all that matters is the
current state. All the states before the current state have no impact on the future except via the current
state.
yˆi = argmax
y∈L
score (x, i, yi−1, y)
Generally, we use the Viterbi algorithm for choosing yˆ. It is an efficient algorithm using dynamic
programming, an algorithmic technique for reusing work in recurrent computations. We begin by solving
an auxiliary problem: rather than finding the best tag sequence, we compute the score of the best tag
sequence.
Algorithm 2: The Viterbi algorithm.
Data: Each scores s (x, i, y, y′
), for all i ∈ {0, . . . , n}, y, y′ ∈ L
Result: yˆ
begin
create a matrix for Viterbi variables v[L, n];
foreach y ∈ L do
v[1, y] = s(x, 0, ♢, y);
bp[1, y] = 0;
end
for i ← 2 to n + 1 do
foreach y ∈ L do
v[i, y] = max
yi−1∈L
s(x, i − 1, yi−1, y) + v[i − 1, yi−1];
bp[i, y] = argmax
yi−1∈L
s(x, i − 1, yi−1, y) + v[i − 1, yi−1];
end
end
yˆn+1 ← ♦ ;
for i ∈ {n, . . . , 1} do
yˆi ← bpi+1 (ˆyi+1)
end
return yˆ;
end
Question 6c1: Fill in the Viterbi matrix v. Report the tag sequence and probability.
he raises purses
N
V
Question 6c2: For a sequence x in length n, label space L, please explain the time and space complexity
of the Viterbi algorithm.
8
Question 7: Could you briefly compare and explain the differences between ELMo, BERT, GPT-2, and
GPT-3 in different aspects such as architecture, parameters, ability, pretraining objectives, data, etc.?
9