In order to apply tensor operations, we must ensure that the
sentences in a given batch are of the same length. Thus, we must
identify the longest sentence in a batch and pad others to be the same
length. Implement the pad sents function in utils.py, which shall
produce these padded sentences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
defpad_sents(sents, pad_token): """ Pad list of sentences according to the longest sentence in the batch. @param sents (list[list[str]]): list of sentences, where each sentence is represented as a list of words @param pad_token (str): padding token @returns sents_padded (list[list[str]]): list of sentences where sentences shorter than the max length sentence are padded out with the pad_token, such that each sentences in the batch now has equal length. """ sents_padded = []
# YOUR CODE HERE (~6 Lines) corpus_size = len(sents) lens = [len(i) for i in sents] # every sentence's length max_lens = max(lens) sents_padded = [sents[i] + [pad_token] * (max_lens - lens[i]) for i inrange(corpus_size)] # shape N x max_lens # END YOUR CODE
return sents_padded
(b) ModelEmbeddings
Implement the __init__ function in model embeddings.py
to initialize the necessary source and target embeddings.
classModelEmbeddings(nn.Module): """ Class that converts input words to their embeddings. """ def__init__(self, embed_size, vocab): """ Init the Embedding layers.
@param embed_size (int): Embedding size (dimensionality) @param vocab (Vocab): Vocabulary object containing src and tgt languages See vocab.py for documentation. """ super(ModelEmbeddings, self).__init__() self.embed_size = embed_size
# YOUR CODE HERE (~2 Lines) # TODO - Initialize the following variables: # self.source (Embedding Layer for source language) # self.target (Embedding Layer for target langauge) # # Note: # 1. `vocab` object contains two vocabularies: # `vocab.src` for source # `vocab.tgt` for target # 2. You can get the length of a specific vocabulary by running: # `len(vocab.<specific_vocabulary>)` # 3. Remember to include the padding token for the specific vocabulary # when creating your Embedding. # # Use the following docs to properly initialize these variables: # Embedding Layer: # https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
Implement the __init__ function in nmt model.py to
initialize the necessary model embeddings (using the ModelEmbeddings
class from model embeddings.py) and layers (LSTM, projection, and
dropout) for the NMT system
# YOUR CODE HERE (~8 Lines) # TODO - Initialize the following variables: # self.encoder (Bidirectional LSTM with bias) # self.decoder (LSTM Cell with bias) # self.h_projection (Linear Layer with no bias), called W_{h} in the PDF. # self.c_projection (Linear Layer with no bias), called W_{c} in the PDF. # self.att_projection (Linear Layer with no bias), called W_{attProj} in the PDF. # self.combined_output_projection (Linear Layer with no bias), called W_{u} in the PDF. # self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} in the PDF. # self.dropout (Dropout Layer) # # Use the following docs to properly initialize these variables: # LSTM: # https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM # LSTM Cell: # https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell # Linear Layer: # https://pytorch.org/docs/stable/nn.html#torch.nn.Linear # Dropout Layer: # https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout
self.encoder = nn.LSTM(embed_size, hidden_size, bias=True, bidirectional=True) self.decoder = nn.LSTMCell(embed_size + hidden_size, hidden_size, bias=True) self.h_projection = nn.Linear(hidden_size * 2, hidden_size, bias=False) # prj output of last h_state of encode (R^2h) to R^h self.c_projection = nn.Linear(hidden_size * 2, hidden_size, bias=False) self.att_projection = nn.Linear(hidden_size * 2, hidden_size, bias=False) # 1 x 2h (h_encode_i) * 2h x h (W) * h * 1 (h_decode_t) = 1 x 1 = e_t,i self.combined_output_projection = nn.Linear(hidden_size * 3, hidden_size, bias=False) # use after combined attention output and h_decode self.target_vocab_projection = nn.Linear(hidden_size, len(vocab.tgt), bias=False) # for softmax of last self.dropout = nn.Dropout(self.dropout_rate) # END YOUR CODE
defforward(self, source: List[List[str]], target: List[List[str]]) -> torch.Tensor: """ Take a mini-batch of source and target sentences, compute the log-likelihood of target sentences under the language models learned by the NMT system.
@param source (List[List[str]]): list of source sentence tokens @param target (List[List[str]]): list of target sentence tokens, wrapped by `<s>` and `</s>`
@returns scores (Tensor): a variable/tensor of shape (b, ) representing the log-likelihood of generating the gold-standard target sentence for each example in the input batch. Here b = batch size. """ # Compute sentence lengths source_lengths = [len(s) for s in source]
# Convert list of lists into tensors source_padded = self.vocab.src.to_input_tensor(source, device=self.device) # Tensor: (src_len, b) target_padded = self.vocab.tgt.to_input_tensor(target, device=self.device) # Tensor: (tgt_len, b)
# Run the network forward: # 1. Apply the encoder to `source_padded` by calling `self.encode()` # 2. Generate sentence masks for `source_padded` by calling `self.generate_sent_masks()` # 3. Apply the decoder to compute combined-output by calling `self.decode()` # 4. Compute log probability distribution over the target vocabulary using the # combined_outputs returned by the `self.decode()` function.
# Zero out, probabilities for which we have nothing in the target text target_masks = (target_padded != self.vocab.tgt['<pad>']).float()
# Compute log probability of generating true target words target_gold_words_log_prob = torch.gather(P, index=target_padded[1:].unsqueeze(-1), dim=-1).squeeze(-1) * target_masks[1:] scores = target_gold_words_log_prob.sum(dim=0) return scores
(d) encode
Implement the encode function in nmt model.py. This
function converts the padded source sentences into the tensor\(X\), generates\(h^{enc}_1,...,h^{enc}_m\) , and computes
the initial state\(\ h^{dec}_0\)and
initial cell\(\ c^{dec}_0\)for the
Decoder. You can run a non-comprehensive sanity check by
executing:python sanity_check.py 1d
defencode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]: """ Apply the encoder to source sentences to obtain encoder hidden states. Additionally, take the final states of the encoder and project them to obtain initial states for decoder.
@param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where b = batch_size, src_len = maximum source sentence length. Note that these have already been sorted in order of longest to shortest sentence. @param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch @returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where b = batch size, src_len = maximum source sentence length, h = hidden size. @returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial hidden state and cell. """ enc_hiddens, dec_init_state = None, None
# YOUR CODE HERE (~ 8 Lines) # TODO: # 1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings. # src_len = maximum source sentence length, b = batch size, e = embedding size. Note # that there is no initial hidden state or cell for the decoder. # 2. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`. # - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X. # - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens. # - Note that the shape of the tensor returned by the encoder is (src_len b, h*2) and we want to # return a tensor of shape (b, src_len, h*2) as `enc_hiddens`. # 3. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell): # - `init_decoder_hidden`: # `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards. # Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h). # Apply the h_projection layer to this in order to compute init_decoder_hidden. # This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size # - `init_decoder_cell`: # `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards. # Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h). # Apply the c_projection layer to this in order to compute init_decoder_cell. # This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size # # See the following docs, as you may need to use some of the following functions in your implementation: # Pack the padded sequence X before passing to the encoder: # https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence # Pad the packed sequence, enc_hiddens, returned by the encoder: # https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_packed_sequence # Tensor Concatenation: # https://pytorch.org/docs/stable/torch.html#torch.cat # Tensor Permute: # https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute X = self.model_embeddings.source(source_padded) # (src_len, b, e) X = pack_padded_sequence(X, lengths=source_lengths) # if feed pack to RNN, it will not calculate output for pad element # pack_padded_sequence and pad_packed_sequence example: # https://github.com/HarshTrivedi/packing-unpacking-pytorch-minimal-tutorial # PackedSequence: Named Tuple with 2 attribute data & batch_size # data: shape (batch_sum_len x embed_dim) # batch_size: each columns when feed to lstm (max = batch_size (start word of all sentence), min = 1 (only one word in this column))
# After feed PackedSequence to LSTM, return PackedSequence with the same attributes : data & batch_size enc_hiddens, (last_hidden, last_cell) = self.encoder(X) # pad_packed_sequence will unpack PackedSequence, which transform (data & batch_size) -> (max_len, b, h * 2) # padded indice will be 0s enc_hiddens, _ = pad_packed_sequence(enc_hiddens) enc_hiddens = enc_hiddens.transpose(0, 1) # (b, max_len, h* 2) 维度互换
dec_init_state = (init_decoder_hidden, init_decoder_cell) # END YOUR CODE
return enc_hiddens, dec_init_state
(e) decode
Implement the decode function in nmt model.py. This
function constructs y¯ and runs the step function over every timestep
for the input. You can run a non-comprehensive sanity check by
executing:python sanity_check.py 1e
defdecode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor, dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor: """Compute combined output vectors for a batch.
@param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where b = batch size, src_len = maximum source sentence length, h = hidden size. @param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where b = batch size, src_len = maximum source sentence length. @param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder @param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where tgt_len = maximum target sentence length, b = batch size.
@returns combined_outputs (Tensor): combined output tensor (tgt_len, b, h), where tgt_len = maximum target sentence length, b = batch_size, h = hidden size """ # Chop of the <END> token for max length sentences. target_padded = target_padded[:-1]
# Initialize the decoder state (hidden and cell) dec_state = dec_init_state
# Initialize previous combined output vector o_{t-1} as zero batch_size = enc_hiddens.size(0) o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)
# Initialize a list we will use to collect the combined output o_t on each step combined_outputs = []
# YOUR CODE HERE (~9 Lines) # TODO: # 1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`, # which should be shape (b, src_len, h), # where b = batch size, src_len = maximum source length, h = hidden size. # This is applying W_{attProj} to h^enc, as described in the PDF. # 2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings. # where tgt_len = maximum target sentence length, b = batch size, e = embedding size. # 3. Use the torch.split function to iterate over the time dimension of Y. # Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size. # - Squeeze Y_t into a tensor of dimension (b, e). # - Construct Ybar_t by concatenating Y_t with o_prev. # - Use the step function to compute the the Decoder's next (cell, state) values # as well as the new combined output o_t. # - Append o_t to combined_outputs # - Update o_prev to the new o_t. # 4. Use torch.stack to convert combined_outputs from a list length tgt_len of # tensors shape (b, h), to a single tensor shape (tgt_len, b, h) # where tgt_len = maximum target sentence length, b = batch size, h = hidden size. # # Note: # - When using the squeeze() function make sure to specify the dimension you want to squeeze # over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1. # # Use the following docs to implement this functionality: # Zeros Tensor: # https://pytorch.org/docs/stable/torch.html#torch.zeros # Tensor Splitting (iteration): # https://pytorch.org/docs/stable/torch.html#torch.split # Tensor Dimension Squeezing: # https://pytorch.org/docs/stable/torch.html#torch.squeeze # Tensor Concatenation: # https://pytorch.org/docs/stable/torch.html#torch.cat # Tensor Stacking: # https://pytorch.org/docs/stable/torch.html#torch.stack
# 1, enc_hiddens_proj = self.att_projection(enc_hiddens) # enc_hiddens: (b, l, h * 2) dot (h * 2, h) -> b, l, h # 2, Y = self.model_embeddings.target(target_padded) # (tgt_len, b, h) # 3, for Y_t in torch.split(Y, 1, dim=0): squeezed = torch.squeeze(Y_t) # shape (b, e) Ybar_t = torch.cat((squeezed, o_prev), dim=1) # shape (b, e + h) dec_state, o_t, _ = self.step(Ybar_t, dec_state, enc_hiddens, enc_hiddens_proj, enc_masks) combined_outputs.append(o_t) o_prev = o_t # 4, combined_outputs = torch.stack(combined_outputs, dim=0) # END YOUR CODE
return combined_outputs
(f) step
Implement the step function in nmt model.py. This
function applies the Decoder’s LSTM cell for a single timestep,
computing the encoding of the target word h dec t , the attention scores
et, attention distribution αt, the attention output at, and finally the
combined output ot. You can run a non-comprehensive sanity check by
executing:python sanity_check.py 1f
defstep(self, Ybar_t: torch.Tensor, dec_state: Tuple[torch.Tensor, torch.Tensor], enc_hiddens: torch.Tensor, enc_hiddens_proj: torch.Tensor, enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]: """ Compute one forward step of the LSTM decoder, including the attention computation.
@param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder, where b = batch size, e = embedding size, h = hidden size. @param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size. First tensor is decoder's prev hidden state, second tensor is decoder's prev cell. @param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size, src_len = maximum source length, h = hidden size. @param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h), where b = batch size, src_len = maximum source length, h = hidden size. @param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len), where b = batch size, src_len is maximum source length.
@returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size. First tensor is decoder's new hidden state, second tensor is decoder's new cell. @returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size. @returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution. Note: You will not use this outside of this function. We are simply returning this value so that we can sanity check your implementation. """
combined_output = None
# YOUR CODE HERE (~3 Lines) # TODO: # 1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state. # 2. Split dec_state into its two parts (dec_hidden, dec_cell) # 3. Compute the attention scores e_t, a Tensor shape (b, src_len). # Note: b = batch_size, src_len = maximum source length, h = hidden size. # # Hints: # - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched) # - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched). # - Use batched matrix multiplication (torch.bmm) to compute e_t. # - To get the tensors into the right shapes for bmm, you will need to do some squeezing and unsqueezing. # - When using the squeeze() function make sure to specify the dimension you want to squeeze # over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1. # # Use the following docs to implement this functionality: # Batch Multiplication: # https://pytorch.org/docs/stable/torch.html#torch.bmm # Tensor Unsqueeze: # https://pytorch.org/docs/stable/torch.html#torch.unsqueeze # Tensor Squeeze: # https://pytorch.org/docs/stable/torch.html#torch.squeeze
# Set e_t to -inf where enc_masks has 1 if enc_masks isnotNone: e_t.data.masked_fill_(enc_masks.byte(), -float('inf')) # mask the 0s with -inf, so e^x = 0
# YOUR CODE HERE (~6 Lines) # TODO: # 1. Apply softmax to e_t to yield alpha_t # 2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the # attention output vector, a_t. # Hints: # - alpha_t is shape (b, src_len) # - enc_hiddens is shape (b, src_len, 2h) # - a_t should be shape (b, 2h) # - You will need to do some squeezing and unsqueezing. # Note: b = batch size, src_len = maximum source length, h = hidden size. # # 3. Concatenate dec_hidden with a_t to compute tensor U_t # 4. Apply the combined output projection layer to U_t to compute tensor V_t # 5. Compute tensor O_t by first applying the Tanh function and then the dropout layer. # # Use the following docs to implement this functionality: # Softmax: # https://pytorch.org/docs/stable/nn.html#torch.nn.functional.softmax # Batch Multiplication: # https://pytorch.org/docs/stable/torch.html#torch.bmm # Tensor View: # https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view # Tensor Concatenation: # https://pytorch.org/docs/stable/torch.html#torch.cat # Tanh: # https://pytorch.org/docs/stable/torch.html#torch.tanh
The generate sent masks() function in nmt model.py produces a tensor
called enc masks. It has shape (batch size, max source sentence length)
and contains 1s in positions corresponding to ‘pad’ tokens in the input,
and 0s for non-pad tokens. Look at how the masks are used during the
attention computation in the step() function (lines 295-296). First
explain (in around three sentences) what effect the masks have on the
entire attention computation. Then explain (in one or two sentences) why
it is necessary to use the masks in this way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
defgenerate_sent_masks(self, enc_hiddens: torch.Tensor, source_lengths: List[int]) -> torch.Tensor: """ Generate sentence masks for encoder hidden states.
@param enc_hiddens (Tensor): encodings of shape (b, src_len, 2*h), where b = batch size, src_len = max source length, h = hidden size. @param source_lengths (List[int]): List of actual lengths for each of the sentences in the batch. @returns enc_masks (Tensor): Tensor of sentence masks of shape (b, src_len), where src_len = max source length, h = hidden size. """ enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float) for e_id, src_len inenumerate(source_lengths): enc_masks[e_id, src_len:] = 1 return enc_masks.to(self.device)