CS 224n: Assignment #4

Starting from:

$29

CS 224n: Assignment #4
This assignment is split into two sections: Neural Machine Translation with RNNs and Analyzing NMT
Systems. The first is primarily coding and implementation focused, whereas the second entirely consists of
written, analysis questions. If you get stuck on the first section, you can always work on the second as the
two sections are independent of each other. The notation and implementation of the NMT system is a bit
tricky, so if you ever get stuck along the way, please come to Office Hours so that the TAs can support
you. We also highly recommend reading Zhang et al (2020) to better understand the Cherokee-to-English
translation task, which served as inspiration for this assignment.
1. Neural Machine Translation with RNNs (45 points)
In Machine Translation, our goal is to convert a sentence from the source language (e.g. Cherokee)
to the target language (e.g. English). In this assignment, we will implement a sequence-to-sequence
(Seq2Seq) network with attention, to build a Neural Machine Translation (NMT) system. In this section, we describe the training procedure for the proposed NMT system, which uses a Bidirectional
LSTM Encoder and a Unidirectional LSTM Decoder.
Figure 1: Seq2Seq Model with Multiplicative Attention, shown on the third step of the
decoder. Hidden states h
enc
i
and cell states c
enc
i
are defined in the next page.
1
CS 224n Assignment 4 Page 2 of 7
Model description (training procedure)
Given a sentence in the source language, we look up the subword embeddings from an embeddings
matrix, yielding x1, . . . , xm (xi ∈ R
e×1
), where m is the length of the source sentence and e is the
embedding size. We feed these embeddings to the bidirectional encoder, yielding hidden states and cell
states for both the forwards (→) and backwards (←) LSTMs. The forwards and backwards versions are
concatenated to give hidden states h
enc
i
and cell states c
enc
i
:
h
enc
i = [←−−
h
enc
i
;
−−→
h
enc
i
] where h
enc
i ∈ R
2h×1
,
←−−
h
enc
i
,
−−→
h
enc
i ∈ R
h×1
1 ≤ i ≤ m (1)
c
enc
i = [←−−
c
enc
i
;
−−→
c
enc
i
] where c
enc
i ∈ R
2h×1
,
←−−
c
enc
i
,
−−→
c
enc
i ∈ R
h×1
1 ≤ i ≤ m (2)
We then initialize the decoder’s first hidden state h
dec
0 and cell state c
dec
0 with a linear projection of the
encoder’s final hidden state and final cell state.1
h
dec
0 = Wh[
←−−
h
enc
1
;
−−→
h
enc
m ] where h
dec
0 ∈ R
h×1
,Wh ∈ R
h×2h
(3)
c
dec
0 = Wc[
←−−
c
enc
1
;
−−→
c
enc
m ] where c
dec
0 ∈ R
h×1
,Wc ∈ R
h×2h
(4)
With the decoder initialized, we must now feed it a target sentence. On the t
th step, we look up the
embedding for the t
th subword, yt ∈ R
e×1
. We then concatenate yt with the combined-output vector
ot−1 ∈ R
h×1
from the previous timestep (we will explain what this is later down this page!) to produce
yt ∈ R
(e+h)×1
. Note that for the first target subword (i.e. the start token) o0 is a zero-vector. We then
feed yt as input to the decoder.
h
dec
t
, c
dec
t = Decoder(yt, h
dec
t−1
, c
dec
t−1
) where h
dec
t ∈ R
h×1
, c
dec
t ∈ R
h×1
(5)
(6)
We then use h
dec
t
to compute multiplicative attention over h
enc
1
, . . . , h
enc
m :
et,i = (h
dec
t
)
TWattProjh
enc
i where et ∈ R
m×1
,WattProj ∈ R
h×2h
1 ≤ i ≤ m (7)
αt = softmax(et) where αt ∈ R
m×1
(8)
at =
Xm
i=1
αt,ih
enc
i where at ∈ R
2h×1
(9)
We now concatenate the attention output at with the decoder hidden state h
dec
t and pass this through
a linear layer, tanh, and dropout to attain the combined-output vector ot.
ut = [at; h
dec
t
] where ut ∈ R
3h×1
(10)
vt = Wuut where vt ∈ R
h×1
,Wu ∈ R
h×3h
(11)
ot = dropout(tanh(vt)) where ot ∈ R
h×1
(12)
Then, we produce a probability distribution Pt over target subwords at the t
th timestep:
1
If it’s not obvious, think about why we regard [←−−
h
enc
1
,
−−→h
enc m ] as the ‘final hidden state’ of the Encoder.
CS 224n Assignment 4 Page 3 of 7
Pt = softmax(Wvocabot) where Pt ∈ R
Vt×1
,Wvocab ∈ R
Vt×h
(13)
Here, Vt is the size of the target vocabulary. Finally, to train the network we then compute the softmax
cross entropy loss between Pt and gt, where gt is the one-hot vector of the target subword at timestep
t:
Jt(θ) = CrossEntropy(Pt, gt) (14)
Here, θ represents all the parameters of the model and Jt(θ) is the loss on step t of the decoder. Now
that we have described the model, let’s try implementing it for Cherokee to English translation!
Setting up your Virtual Machine
Follow the instructions in the CS224n Azure Guide (link also provided on website and Ed) in order
to create your VM instance. This should take you approximately 45 minutes. Though you will need
the GPU to train your model, we strongly advise that you first develop the code locally and ensure
that it runs, before attempting to train it on your VM. GPU time is expensive and limited. It takes
approximately 30 minutes to 1 hour to train the NMT system. We don’t want you to accidentally
use all your GPU time for debugging your model rather than training and evaluating it. Finally, make
sure that your VM is turned off whenever you are not using it.
If your Azure subscription runs out of money, your VM will be temporarily locked and
inaccessible. If that happens, please fill out a request form here.
In order to run the model code on your local machine, please run the following command to create the
proper virtual environment:
conda env create --file local env.yml
Note that this virtual environment will not be needed on the VM.
Implementation and written questions
(a) (2 points) (coding) In order to apply tensor operations, we must ensure that the sentences in a
given batch are of the same length. Thus, we must identify the longest sentence in a batch and
pad others to be the same length. Implement the pad sents function in utils.py, which shall
produce these padded sentences.
(b) (3 points) (coding) Implement the init function in model embeddings.py to initialize the
necessary source and target embeddings.
(c) (4 points) (coding) Implement the init function in nmt model.py to initialize the necessary
model embeddings (using the ModelEmbeddings class from model embeddings.py) and layers
(LSTM, projection, and dropout) for the NMT system.
(d) (8 points) (coding) Implement the encode function in nmt model.py. This function converts
the padded source sentences into the tensor X, generates h
enc
1
, . . . , h
enc
m , and computes the initial
state h
dec
0 and initial cell c
dec
0
for the Decoder. You can run a non-comprehensive sanity check by
executing:
python sanity_check.py 1d
CS 224n Assignment 4 Page 4 of 7
(e) (8 points) (coding) Implement the decode function in nmt model.py. This function constructs
y¯ and runs the step function over every timestep for the input. You can run a non-comprehensive
sanity check by executing:
python sanity_check.py 1e
(f) (10 points) (coding) Implement the step function in nmt model.py. This function applies the
Decoder’s LSTM cell for a single timestep, computing the encoding of the target subword h
dec
t
,
the attention scores et, attention distribution αt, the attention output at, and finally the combined
output ot. You can run a non-comprehensive sanity check by executing:
python sanity_check.py 1f
(g) (3 points) (written) The generate sent masks() function in nmt model.py produces a tensor
called enc masks. It has shape (batch size, max source sentence length) and contains 1s in positions
corresponding to ‘pad’ tokens in the input, and 0s for non-pad tokens. Look at how the masks are
used during the attention computation in the step() function.
First explain (in around three sentences) what effect the masks have on the entire attention computation. Then explain (in one or two sentences) why it is necessary to use the masks in this
way.
Now it’s time to get things running! Execute the following to generate the necessary vocab file:
sh run.sh vocab
Or if you are on Windows, use the following command instead. Make sure you execute this in an
environment that has python in path. For example, you can run this in the terminal of your IDE or
your Anaconda prompt.
run.bat vocab
As noted earlier, we recommend that you develop the code on your personal computer. Confirm that
you are running in the proper conda environment and then execute the following command to train the
model on your local machine:
sh run.sh train_local
(Windows) run.bat train_local
You should see a significant decrease in loss during the initial iterations. Once you have ensured that
your code does not crash (i.e. let it run till iter 10 or iter 20), power on your VM from the Azure
Web Portal. Then read the Managing Code Deployment to a VM section of our Practical Guide to VMs
(link also given on website and Ed) for instructions on how to upload your code to the VM.
Next, install necessary packages to your VM by running:
pip install -r gpu_requirements.txt
Finally, turn to the Managing Processes on a VM section of the Practical Guide and follow the instructions to create a new tmux session. Concretely, run the following command to create tmux session called
nmt.
tmux new -s nmt
Once your VM is configured and you are in a tmux session, execute:
sh run.sh train
(Windows) run.bat train
CS 224n Assignment 4 Page 5 of 7
Once you know your code is running properly, you can detach from session and close your ssh connection
to the server. To detach from the session, run:
tmux detach
You can return to your training model by ssh-ing back into the server and attaching to the tmux session
by running:
tmux a -t nmt
(h) (3 points) Once your model is done training (this should take under 1 hour on the VM),
execute the following command to test the model:
sh run.sh test
(Windows) run.bat test
Please report the model’s corpus BLEU Score. It should be larger than 10.
(i) (4 points) (written) In class, we learned about dot product attention, multiplicative attention, and
additive attention. As a reminder, dot product attention is et,i = s
T
t hi
, multiplicative attention is
et,i = s
T
t Whi
, and additive attention is et,i = v
T
tanh(W1hi + W2st).
i. (2 points) Explain one advantage and one disadvantage of dot product attention compared to
multiplicative attention.
ii. (2 points) Explain one advantage and one disadvantage of additive attention compared to multiplicative attention.
2. Analyzing NMT Systems (30 points)
(a) (2 points) In part 1, we modeled our NMT problem at a subword-level. That is, given a sentence in
the source language, we looked up subword components from an embeddings matrix. Alternatively,
we could have modeled the NMT problem at the word-level, by looking up whole words from the
embeddings matrix. Why might it be important to model our Cherokee-to-English NMT problem
at the subword-level vs. the whole word-level? (Hint: Cherokee is a polysynthetic language.)
(b) (2 points) Character-level and subword embeddings are often smaller than whole word embeddings.
In 1-2 sentences, explain one reason why this might be the case.
(c) (2 points) One challenge of training successful NMT models is lack of language data, particularly for
resource-scarce languages like Cherokee. One way of addressing this challenge is with multilingual
training, where we train our NMT on multiple languages (including Cherokee). You can read
more about multilingual training here. How does multilingual training help in improving NMT
performance with low-resource languages?
(d) (6 points) Here we present three examples of errors we found in the outputs of our NMT model
(which is the same as the one you just trained). The errors are underlined in the NMT translation
sentence. For each example of a source sentence, reference (i.e., ‘gold’) English translation, and
NMT (i.e., ‘model’) English translation, please:
1. Provide possible reason(s) why the model may have made the error (either due to a specific
linguistic construct or a specific model limitation).
2. Describe one possible way we might alter the NMT system to fix the observed error. There
are more than one possible fixes for an error. For example, it could be tweaking the size of the
hidden layers or changing the attention mechanism.
Below are the translations that you should analyze as described above. Only analyze the underlined
error in each sentence. Rest assured that you don’t need to know Cherokee to answer these questions.
You just need to know English! If, however, you would like additional color on the source sentences,
feel free to use resources like https://www.cherokeedictionary.net/ to look up words.
CS 224n Assignment 4 Page 6 of 7
i. (2 points) Source Translation: Yona utsesdo ustiyegv anitsilvsgi digvtanv uwoduisdei.
Reference Translation: Fern had a crown of daisies in her hair.
NMT Translation: Fern had her hair with her hair.
ii. (2 points) Source Sentence: Ulihelisdi nigalisda.
Reference Translation: She is very excited.
NMT Translation: It’s joy.
iii. (2 points) Source Sentence: Tsesdi hana yitsadawoesdi usdi atsadi!
Reference Translation: Don’t swim there, Littlefish!
NMT Translation: Don’t know how a small fish!
(e) (4 points) Now it is time to explore the outputs of the model that you have trained! The test-set
translations your model produced in question 1-i should be located in outputs/test outputs.txt.
i. (2 points) Find a line where the predicted translation is correct for a long (4 or 5 word) sequence
of words. Check the training target file (English); does the training file contain that string
(almost) verbatim? If so or if not, what does this say about what the MT system learned to
do?
ii. (2 points) Find a line where the predicted translation starts off correct for a long (4 or 5
word) sequence of words, but then diverges (where the latter part of the sentence seems totally
unrelated). What does this say about the model’s decoding behavior?
(f) (14 points) BLEU score is the most commonly used automatic evaluation metric for NMT systems.
It is usually calculated across the entire test set, but here we will consider BLEU defined for a single
example.2 Suppose we have a source sentence s, a set of k reference translations r1, . . . , rk, and a
candidate translation c. To compute the BLEU score of c, we first compute the modified n-gram
precision pn of c, for each of n = 1, 2, 3, 4, where n is the n in n-gram:
pn =
X
ngram∈c
min
max
i=1,...,k
Countri
(ngram), Countc(ngram)
X
ngram∈c
Countc(ngram)
(15)
Here, for each of the n-grams that appear in the candidate translation c, we count the maximum number of times it appears in any one reference translation, capped by the number of times
it appears in c (this is the numerator). We divide this by the number of n-grams in c (denominator).
Next, we compute the brevity penalty BP. Let len(c) be the length of c and let len(r) be the
length of the reference translation that is closest to len(c) (in the case of two equally-close reference
translation lengths, choose len(r) as the shorter one).
BP =
(
1 if len(c) ≥ len(r)
exp
1 −
len(r)
len(c)

otherwise
(16)
Lastly, the BLEU score for candidate c with respect to r1, . . . , rk is:
BLEU = BP × exp X
4
n=1
λn log pn

(17)
where λ1, λ2, λ3, λ4 are weights that sum to 1. The log here is natural log.
2This definition of sentence-level BLEU score matches the sentence bleu() function in the nltk Python package. Note
that the NLTK function is sensitive to capitalization. In this question, all text is lowercased, so capitalization is irrelevant.
http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_ble