NLP-Assignment4

Assignment4讲解

RNN和神经网络机器翻译

机器翻译是指,构建一个系统完成源语言到目标语言的变换映射,比如给定一个源句子(比如西班牙语),输出一个目标句子(比如英语)。本次作业中要实现的是一个带注意力机制的Seq2Seq神经模型,用于构建神经网络机器翻译(NMT)系统。首先我们来看NMT系统的训练过程,它用到了双向LSTM作为编码器(encoder)和单向LSTM作为解码器(decoder)。

给定长度为m的源语言句子(source),经过嵌入层,得到输入序列 \(x_1, x_2, …, x_m \in R^{e \times 1}\)\(e\)是词向量大小。经过双向Encoder后,得到前向(→)和反向(←)LSTM的隐藏层和神经元状态,将两个方向的状态连接起来得到时间步 \(i\) 的隐藏状态 \(h_i^{enc}\)\(c_i^{enc}\)\[ \mathbf{h}_{i}^{\text {enc }}=\left[\overleftarrow{\mathbf{h}_{i}^{\text {enc }}} ; \overrightarrow{\mathbf{h}_{i}^{\text {enc }}}\right] \text { where } \mathbf{h}_{i}^{\text {enc }} \in \mathbb{R}^{2 h \times 1}, \overleftarrow{\mathbf{h}_{i}^{\text {enc }}}, \overrightarrow{\mathbf{h}_{i}^{\text {en }}} \in \mathbb{R}^{h \times 1} \quad 1 \leq i \leq m \]

\[ \mathbf{c}_{i}^{\text {enc }}=\left[\overleftarrow{\mathbf{c}_{i}^{\text {enc }}} ; \overrightarrow{\mathbf{c}_{i}^{\text {enc }}}\right] \text { where } \mathbf{c}_{i}^{\text {enc }} \in \mathbb{R}^{2 h \times 1}, \overleftarrow{\mathbf{c}_{i}^{\text {enc }}}, \overrightarrow{\mathbf{c}_{i}^{\text {en }}} \in \mathbb{R}^{h \times 1} \quad 1 \leq i \leq m \]

接着我们使用一个线性层来初始化Decoder的初始隐藏、神经元的状态: \[ \mathbf{h}_{0}^{\text {dec }}=\mathbf{W}_{h}\left[\overleftarrow{\mathbf{h}_{1}^{\text {enc }}} ; \overrightarrow{\mathbf{h}_{m}^{\text {enc }}}\right] \text { where } \mathbf{h}_{0}^{\text {dec }} \in \mathbb{R}^{h \times 1}, \mathbf{W}_{h} \in \mathbb{R}^{h \times 2 h} \]

\[ \mathbf{c}_{0}^{\text {dec }}=\mathbf{W}_{c}\left[\overleftarrow{\mathbf{c}_{1}^{\text {enc }}} ; \overrightarrow{\mathbf{c}_{m}^{\text {enc }}}\right] \text { where } \mathbf{c}_{0}^{\text {dec }} \in \mathbb{R}^{h \times 1}, \mathbf{W}_{c} \in \mathbb{R}^{h \times 2 h} \]

Decoder的时间步\(t\) 的输入为 \(\bar{y}_t\) ,它由目标语言句子 \(y_t\)和上一神经元的输出和上一神经元的输出\(o_{t-1}\)经过连接得到,经过连接得到,\(o_0\)是0向量,所以 \(\bar{y}_t \in R^{(e + h) \times 1}\) \[ \mathbf{h}_{t}^{\mathrm{dec}}, \mathbf{c}_{t}^{\mathrm{dec}}=\operatorname{Decoder}\left(\overline{\mathbf{y}_{t}}, \mathbf{h}_{t-1}^{\mathrm{dec}}, \mathbf{c}_{t-1}^{\mathrm{dec}}\right) \text { where } \mathbf{h}_{t}^{\mathrm{dec}} \in \mathbb{R}^{h \times 1}, \mathbf{c}_{t}^{\mathrm{dec}} \in \mathbb{R}^{h \times 1} \] 接着我们使用 \(h^{dec}_t\) 来计算在 \(h^{enc}_0, h^{enc}_1, …, h^{enc}_m\) 的乘积注意力(multiplicative attention): \[ \begin{array}{c} \mathbf{e}_{t, i}=\left(\mathbf{h}_{t}^{\mathrm{dec}}\right)^{T} \mathbf{W}_{\text {attProj }} \mathbf{h}_{i}^{\text {enc }} \text { where } \mathbf{e}_{t} \in \mathbb{R}^{m \times 1}, \mathbf{W}_{\text {attProj }} \in \mathbb{R}^{h \times 2 h} \quad 1 \leq i \leq m \\ \alpha_{t}=\operatorname{softmax}\left(\mathbf{e}_{t}\right) \text { where } \alpha_{t} \in \mathbb{R}^{m \times 1} \\ \mathbf{a}_{t}=\sum_{i=1}^{m} \alpha_{t, i} \mathbf{h}_{i}^{\text {enc }} \text { where } \mathbf{a}_{t} \in \mathbb{R}^{2 h \times 1} \end{array} \] 然后将注意力 \(a_t\) 和解码器的隐藏状态 \(h^{dec}_t\) 连接,送入线性层,得到 combined-output 向量 \(o_t\) \[ \begin{array}{r} \mathbf{u}_{t}=\left[\mathbf{a}_{t} ; \mathbf{h}_{t}^{\mathrm{dec}}\right] \text { where } \mathbf{u}_{t} \in \mathbb{R}^{3 h \times 1} \\ \mathbf{v}_{t}=\mathbf{W}_{u} \mathbf{u}_{t} \text { where } \mathbf{v}_{t} \in \mathbb{R}^{h \times 1}, \mathbf{W}_{u} \in \mathbb{R}^{h \times 3 h} \\ \mathbf{o}_{t}=\operatorname{Dropout}\left(\tanh \left(\mathbf{v}_{t}\right)\right) \text { where } \mathbf{o}_{t} \in \mathbb{R}^{h \times 1} \end{array} \] 这样以来,目标词的概率分布则为: \[ \mathbf{P}_{t}=\operatorname{softmax}\left(\mathbf{W}_{\text {vocab }} \mathbf{o}_{t}\right) \text { where } \mathbf{P}_{t} \in \mathbb{R}^{V_{t} \times 1}, \mathbf{W}_{\text {vocab }} \in \mathbb{R}^{V_{t} \times h} \] 使用交叉熵做目标函数即可 \[ J_{t}(\theta)=\text { CrossEntropy }\left(\mathbf{P}_{t}, \mathbf{g}_{t}\right) \]

代码实现部分,关键在于过程中的向量维度,向量维度匹配没有问题,整个过程的实现就比较清晰。

part1 神经网络翻译系统代码实现

(a) pad_sents

In order to apply tensor operations, we must ensure that the sentences in a given batch are of the same length. Thus, we must identify the longest sentence in a batch and pad others to be the same length. Implement the pad sents function in utils.py, which shall produce these padded sentences.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def pad_sents(sents, pad_token):
""" Pad list of sentences according to the longest sentence in the batch.
@param sents (list[list[str]]): list of sentences, where each sentence
is represented as a list of words
@param pad_token (str): padding token
@returns sents_padded (list[list[str]]): list of sentences where sentences shorter
than the max length sentence are padded out with the pad_token, such that
each sentences in the batch now has equal length.
"""
sents_padded = []

# YOUR CODE HERE (~6 Lines)
corpus_size = len(sents)
lens = [len(i) for i in sents] # every sentence's length
max_lens = max(lens)
sents_padded = [sents[i] + [pad_token] * (max_lens - lens[i]) for i in range(corpus_size)] # shape N x max_lens
# END YOUR CODE

return sents_padded

(b) ModelEmbeddings

Implement the __init__ function in model embeddings.py to initialize the necessary source and target embeddings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class ModelEmbeddings(nn.Module): 
"""
Class that converts input words to their embeddings.
"""
def __init__(self, embed_size, vocab):
"""
Init the Embedding layers.

@param embed_size (int): Embedding size (dimensionality)
@param vocab (Vocab): Vocabulary object containing src and tgt languages
See vocab.py for documentation.
"""
super(ModelEmbeddings, self).__init__()
self.embed_size = embed_size

# default values
self.source = None
self.target = None

src_pad_token_idx = vocab.src['<pad>']
tgt_pad_token_idx = vocab.tgt['<pad>']

# YOUR CODE HERE (~2 Lines)
# TODO - Initialize the following variables:
# self.source (Embedding Layer for source language)
# self.target (Embedding Layer for target langauge)
#
# Note:
# 1. `vocab` object contains two vocabularies:
# `vocab.src` for source
# `vocab.tgt` for target
# 2. You can get the length of a specific vocabulary by running:
# `len(vocab.<specific_vocabulary>)`
# 3. Remember to include the padding token for the specific vocabulary
# when creating your Embedding.
#
# Use the following docs to properly initialize these variables:
# Embedding Layer:
# https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding

self.source = nn.Embedding(len(vocab.src), embed_size, padding_idx=src_pad_token_idx)
self.target = nn.Embedding(len(vocab.tgt), embed_size, padding_idx=tgt_pad_token_idx)

# END YOUR CODE

(c) NMT

Implement the __init__ function in nmt model.py to initialize the necessary model embeddings (using the ModelEmbeddings class from model embeddings.py) and layers (LSTM, projection, and dropout) for the NMT system

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
class NMT(nn.Module):
""" Simple Neural Machine Translation Model:
- Bidrectional LSTM Encoder
- Unidirection LSTM Decoder
- Global Attention Model (Luong, et al. 2015)
"""
def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
""" Init NMT Model.

@param embed_size (int): Embedding size (dimensionality)
@param hidden_size (int): Hidden Size (dimensionality)
@param vocab (Vocab): Vocabulary object containing src and tgt languages
See vocab.py for documentation.
@param dropout_rate (float): Dropout probability, for attention
"""
super(NMT, self).__init__()
self.model_embeddings = ModelEmbeddings(embed_size, vocab)
self.hidden_size = hidden_size
self.dropout_rate = dropout_rate
self.vocab = vocab

# default values
self.encoder = None
self.decoder = None
self.h_projection = None
self.c_projection = None
self.att_projection = None
self.combined_output_projection = None
self.target_vocab_projection = None
self.dropout = None

# YOUR CODE HERE (~8 Lines)
# TODO - Initialize the following variables:
# self.encoder (Bidirectional LSTM with bias)
# self.decoder (LSTM Cell with bias)
# self.h_projection (Linear Layer with no bias), called W_{h} in the PDF.
# self.c_projection (Linear Layer with no bias), called W_{c} in the PDF.
# self.att_projection (Linear Layer with no bias), called W_{attProj} in the PDF.
# self.combined_output_projection (Linear Layer with no bias), called W_{u} in the PDF.
# self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} in the PDF.
# self.dropout (Dropout Layer)
#
# Use the following docs to properly initialize these variables:
# LSTM:
# https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
# LSTM Cell:
# https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell
# Linear Layer:
# https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
# Dropout Layer:
# https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout

self.encoder = nn.LSTM(embed_size, hidden_size, bias=True, bidirectional=True)
self.decoder = nn.LSTMCell(embed_size + hidden_size, hidden_size, bias=True)
self.h_projection = nn.Linear(hidden_size * 2, hidden_size, bias=False) # prj output of last h_state of encode (R^2h) to R^h
self.c_projection = nn.Linear(hidden_size * 2, hidden_size, bias=False)
self.att_projection = nn.Linear(hidden_size * 2, hidden_size, bias=False) # 1 x 2h (h_encode_i) * 2h x h (W) * h * 1 (h_decode_t) = 1 x 1 = e_t,i
self.combined_output_projection = nn.Linear(hidden_size * 3, hidden_size, bias=False) # use after combined attention output and h_decode
self.target_vocab_projection = nn.Linear(hidden_size, len(vocab.tgt), bias=False) # for softmax of last
self.dropout = nn.Dropout(self.dropout_rate)
# END YOUR CODE


def forward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor:
""" Take a mini-batch of source and target sentences, compute the log-likelihood of
target sentences under the language models learned by the NMT system.

@param source (List[List[str]]): list of source sentence tokens
@param target (List[List[str]]): list of target sentence tokens, wrapped by `<s>` and `</s>`

@returns scores (Tensor): a variable/tensor of shape (b, ) representing the
log-likelihood of generating the gold-standard target sentence for
each example in the input batch. Here b = batch size.
"""
# Compute sentence lengths
source_lengths = [len(s) for s in source]

# Convert list of lists into tensors
source_padded = self.vocab.src.to_input_tensor(source, device=self.device) # Tensor: (src_len, b)
target_padded = self.vocab.tgt.to_input_tensor(target, device=self.device) # Tensor: (tgt_len, b)

# Run the network forward:
# 1. Apply the encoder to `source_padded` by calling `self.encode()`
# 2. Generate sentence masks for `source_padded` by calling `self.generate_sent_masks()`
# 3. Apply the decoder to compute combined-output by calling `self.decode()`
# 4. Compute log probability distribution over the target vocabulary using the
# combined_outputs returned by the `self.decode()` function.

enc_hiddens, dec_init_state = self.encode(source_padded, source_lengths)
enc_masks = self.generate_sent_masks(enc_hiddens, source_lengths)
combined_outputs = self.decode(enc_hiddens, enc_masks, dec_init_state, target_padded)
P = F.log_softmax(self.target_vocab_projection(combined_outputs), dim=-1)

# Zero out, probabilities for which we have nothing in the target text
target_masks = (target_padded != self.vocab.tgt['<pad>']).float()

# Compute log probability of generating true target words
target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(-1) * target_masks[1:]
scores = target_gold_words_log_prob.sum(dim=0)
return scores

(d) encode

Implement the encode function in nmt model.py. This function converts the padded source sentences into the tensor\(X\), generates\(h^{enc}_1,...,h^{enc}_m\) , and computes the initial state\(\ h^{dec}_0\)and initial cell\(\ c^{dec}_0\)for the Decoder. You can run a non-comprehensive sanity check by executing:python sanity_check.py 1d

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
""" Apply the encoder to source sentences to obtain encoder hidden states.
Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

@param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where
b = batch_size, src_len = maximum source sentence length. Note that
these have already been sorted in order of longest to shortest sentence.
@param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
@returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
b = batch size, src_len = maximum source sentence length, h = hidden size.
@returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
hidden state and cell.
"""
enc_hiddens, dec_init_state = None, None

# YOUR CODE HERE (~ 8 Lines)
# TODO:
# 1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
# src_len = maximum source sentence length, b = batch size, e = embedding size. Note
# that there is no initial hidden state or cell for the decoder.
# 2. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`.
# - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X.
# - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens.
# - Note that the shape of the tensor returned by the encoder is (src_len b, h*2) and we want to
# return a tensor of shape (b, src_len, h*2) as `enc_hiddens`.
# 3. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell):
# - `init_decoder_hidden`:
# `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
# Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
# Apply the h_projection layer to this in order to compute init_decoder_hidden.
# This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size
# - `init_decoder_cell`:
# `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
# Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
# Apply the c_projection layer to this in order to compute init_decoder_cell.
# This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size
#
# See the following docs, as you may need to use some of the following functions in your implementation:
# Pack the padded sequence X before passing to the encoder:
# https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence
# Pad the packed sequence, enc_hiddens, returned by the encoder:
# https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_packed_sequence
# Tensor Concatenation:
# https://pytorch.org/docs/stable/torch.html#torch.cat
# Tensor Permute:
# https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute
X = self.model_embeddings.source(source_padded) # (src_len, b, e)
X = pack_padded_sequence(X, lengths=source_lengths) # if feed pack to RNN, it will not calculate output for pad element
# pack_padded_sequence and pad_packed_sequence example:
# https://github.com/HarshTrivedi/packing-unpacking-pytorch-minimal-tutorial
# PackedSequence: Named Tuple with 2 attribute data & batch_size
# data: shape (batch_sum_len x embed_dim)
# batch_size: each columns when feed to lstm (max = batch_size (start word of all sentence), min = 1 (only one word in this column))

# After feed PackedSequence to LSTM, return PackedSequence with the same attributes : data & batch_size
enc_hiddens, (last_hidden, last_cell) = self.encoder(X)
# pad_packed_sequence will unpack PackedSequence, which transform (data & batch_size) -> (max_len, b, h * 2)
# padded indice will be 0s
enc_hiddens, _ = pad_packed_sequence(enc_hiddens)
enc_hiddens = enc_hiddens.transpose(0, 1) # (b, max_len, h* 2) 维度互换

last_hidden = torch.cat((last_hidden[0], last_hidden[1]), 1) # (2, b, h) -> (b, h * 2)
init_decoder_hidden = self.h_projection(last_hidden)

last_cell = torch.cat((last_cell[0], last_cell[1]), 1)
init_decoder_cell = self.c_projection(last_cell)

dec_init_state = (init_decoder_hidden, init_decoder_cell)
# END YOUR CODE

return enc_hiddens, dec_init_state

(e) decode

Implement the decode function in nmt model.py. This function constructs y¯ and runs the step function over every timestep for the input. You can run a non-comprehensive sanity check by executing:python sanity_check.py 1e

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
"""Compute combined output vectors for a batch.

@param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
b = batch size, src_len = maximum source sentence length, h = hidden size.
@param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
b = batch size, src_len = maximum source sentence length.
@param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
@param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where
tgt_len = maximum target sentence length, b = batch size.

@returns combined_outputs (Tensor): combined output tensor (tgt_len, b, h), where
tgt_len = maximum target sentence length, b = batch_size, h = hidden size
"""
# Chop of the <END> token for max length sentences.
target_padded = target_padded[:-1]

# Initialize the decoder state (hidden and cell)
dec_state = dec_init_state

# Initialize previous combined output vector o_{t-1} as zero
batch_size = enc_hiddens.size(0)
o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

# Initialize a list we will use to collect the combined output o_t on each step
combined_outputs = []

# YOUR CODE HERE (~9 Lines)
# TODO:
# 1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`,
# which should be shape (b, src_len, h),
# where b = batch size, src_len = maximum source length, h = hidden size.
# This is applying W_{attProj} to h^enc, as described in the PDF.
# 2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
# where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
# 3. Use the torch.split function to iterate over the time dimension of Y.
# Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
# - Squeeze Y_t into a tensor of dimension (b, e).
# - Construct Ybar_t by concatenating Y_t with o_prev.
# - Use the step function to compute the the Decoder's next (cell, state) values
# as well as the new combined output o_t.
# - Append o_t to combined_outputs
# - Update o_prev to the new o_t.
# 4. Use torch.stack to convert combined_outputs from a list length tgt_len of
# tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
# where tgt_len = maximum target sentence length, b = batch size, h = hidden size.
#
# Note:
# - When using the squeeze() function make sure to specify the dimension you want to squeeze
# over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
#
# Use the following docs to implement this functionality:
# Zeros Tensor:
# https://pytorch.org/docs/stable/torch.html#torch.zeros
# Tensor Splitting (iteration):
# https://pytorch.org/docs/stable/torch.html#torch.split
# Tensor Dimension Squeezing:
# https://pytorch.org/docs/stable/torch.html#torch.squeeze
# Tensor Concatenation:
# https://pytorch.org/docs/stable/torch.html#torch.cat
# Tensor Stacking:
# https://pytorch.org/docs/stable/torch.html#torch.stack

# 1,
enc_hiddens_proj = self.att_projection(enc_hiddens) # enc_hiddens: (b, l, h * 2) dot (h * 2, h) -> b, l, h
# 2,
Y = self.model_embeddings.target(target_padded) # (tgt_len, b, h)
# 3,
for Y_t in torch.split(Y, 1, dim=0):
squeezed = torch.squeeze(Y_t) # shape (b, e)
Ybar_t = torch.cat((squeezed, o_prev), dim=1) # shape (b, e + h)
dec_state, o_t, _ = self.step(Ybar_t, dec_state, enc_hiddens, enc_hiddens_proj, enc_masks)
combined_outputs.append(o_t)
o_prev = o_t
# 4,
combined_outputs = torch.stack(combined_outputs, dim=0)
# END YOUR CODE

return combined_outputs

(f) step

Implement the step function in nmt model.py. This function applies the Decoder’s LSTM cell for a single timestep, computing the encoding of the target word h dec t , the attention scores et, attention distribution αt, the attention output at, and finally the combined output ot. You can run a non-comprehensive sanity check by executing:python sanity_check.py 1f

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
def step(self, Ybar_t: torch.Tensor,
dec_state: Tuple[torch.Tensor, torch.Tensor],
enc_hiddens: torch.Tensor,
enc_hiddens_proj: torch.Tensor,
enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
""" Compute one forward step of the LSTM decoder, including the attention computation.

@param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
where b = batch size, e = embedding size, h = hidden size.
@param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
@param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
src_len = maximum source length, h = hidden size.
@param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
where b = batch size, src_len = maximum source length, h = hidden size.
@param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
where b = batch size, src_len is maximum source length.

@returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
First tensor is decoder's new hidden state, second tensor is decoder's new cell.
@returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
@returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
Note: You will not use this outside of this function.
We are simply returning this value so that we can sanity check
your implementation.
"""

combined_output = None

# YOUR CODE HERE (~3 Lines)
# TODO:
# 1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state.
# 2. Split dec_state into its two parts (dec_hidden, dec_cell)
# 3. Compute the attention scores e_t, a Tensor shape (b, src_len).
# Note: b = batch_size, src_len = maximum source length, h = hidden size.
#
# Hints:
# - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched)
# - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched).
# - Use batched matrix multiplication (torch.bmm) to compute e_t.
# - To get the tensors into the right shapes for bmm, you will need to do some squeezing and unsqueezing.
# - When using the squeeze() function make sure to specify the dimension you want to squeeze
# over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
#
# Use the following docs to implement this functionality:
# Batch Multiplication:
# https://pytorch.org/docs/stable/torch.html#torch.bmm
# Tensor Unsqueeze:
# https://pytorch.org/docs/stable/torch.html#torch.unsqueeze
# Tensor Squeeze:
# https://pytorch.org/docs/stable/torch.html#torch.squeeze

# 1,
dec_state = self.decoder(Ybar_t, dec_state)
(dec_hidden, dec_cell) = dec_state
# 3, (b, src_len, h) .dot(b, h, 1) -> (b, src_len, 1) -> (b, src_len)
e_t = enc_hiddens_proj.bmm(dec_hidden.unsqueeze(2)).squeeze(2)
### END YOUR CODE

# Set e_t to -inf where enc_masks has 1
if enc_masks is not None:
e_t.data.masked_fill_(enc_masks.byte(), -float('inf')) # mask the 0s with -inf, so e^x = 0

# YOUR CODE HERE (~6 Lines)
# TODO:
# 1. Apply softmax to e_t to yield alpha_t
# 2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the
# attention output vector, a_t.
# Hints:
# - alpha_t is shape (b, src_len)
# - enc_hiddens is shape (b, src_len, 2h)
# - a_t should be shape (b, 2h)
# - You will need to do some squeezing and unsqueezing.
# Note: b = batch size, src_len = maximum source length, h = hidden size.
#
# 3. Concatenate dec_hidden with a_t to compute tensor U_t
# 4. Apply the combined output projection layer to U_t to compute tensor V_t
# 5. Compute tensor O_t by first applying the Tanh function and then the dropout layer.
#
# Use the following docs to implement this functionality:
# Softmax:
# https://pytorch.org/docs/stable/nn.html#torch.nn.functional.softmax
# Batch Multiplication:
# https://pytorch.org/docs/stable/torch.html#torch.bmm
# Tensor View:
# https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
# Tensor Concatenation:
# https://pytorch.org/docs/stable/torch.html#torch.cat
# Tanh:
# https://pytorch.org/docs/stable/torch.html#torch.tanh

# 1, apply softmax to e_t
alpha_t = F.softmax(e_t, dim=1) # (b, src_len)
# 2, (b, 1, src_len) x (b, src_len, 2h) = (b, 1, 2h) -> (b, 2h)
# a_t = e_t.unsqueeze(1).bmm(enc_hiddens).squeeze(1)
att_view = (alpha_t.size(0), 1, alpha_t.size(1))
a_t = torch.bmm(alpha_t.view(*att_view), enc_hiddens).squeeze(1)

# 3, concate a_t (b, 2h) and dec_hidden (b, h) to U_t (b, 3h)
U_t = torch.cat((a_t, dec_hidden), dim=1)
# 4, apply combined output to U_T -> V_t, shape (b, h)
V_t = self.combined_output_projection(U_t)
O_t = self.dropout(torch.tanh(V_t))

# END YOUR CODE

combined_output = O_t
return dec_state, combined_output, e_t

(g) generate_sent_masks

The generate sent masks() function in nmt model.py produces a tensor called enc masks. It has shape (batch size, max source sentence length) and contains 1s in positions corresponding to ‘pad’ tokens in the input, and 0s for non-pad tokens. Look at how the masks are used during the attention computation in the step() function (lines 295-296). First explain (in around three sentences) what effect the masks have on the entire attention computation. Then explain (in one or two sentences) why it is necessary to use the masks in this way.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def generate_sent_masks(self, enc_hiddens: torch.Tensor, source_lengths: List[int]) -> torch.Tensor:
""" Generate sentence masks for encoder hidden states.

@param enc_hiddens (Tensor): encodings of shape (b, src_len, 2*h), where b = batch size,
src_len = max source length, h = hidden size.
@param source_lengths (List[int]): List of actual lengths for each of the sentences in the batch.

@returns enc_masks (Tensor): Tensor of sentence masks of shape (b, src_len),
where src_len = max source length, h = hidden size.
"""
enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
for e_id, src_len in enumerate(source_lengths):
enc_masks[e_id, src_len:] = 1
return enc_masks.to(self.device)