Table of Contents
Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras
Imagine it is your exam tomorrow, and you are yet to cover a lot of chapters. You pause for a while and then take the smarter option of prioritizing. You pay attention to the chapters that (according to you) have greater weightage than the others.
We do not advise prioritizing on the last day of your semester, but hey, we all did it! Unless your assessment of which chapters carry more weight was totally off, you have scored more than you could, had you covered all of them.
Taking this analogy of paying attention a little further, today, we apply this mechanism to the task of Neural Machine Translation.
In this tutorial, you will learn how to apply Bahdanau’s attention to the Neural Machine Translation task.
This lesson is the first of a 2-part series on NLP 103:
- Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras (this tutorial)
- Neural Machine Translation with Luong’s Attention Using TensorFlow and Keras
To learn how to apply Bahdanau’s attention to the Neural Machine Translation task, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionNeural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras
In a previous blog post, we covered the mathematical intuition behind Neural Machine Translation. We request you to look at the blog post to gain in-depth knowledge about the task.
In this tutorial, we will tackle the problem of translating from a source language (English) to a target language (French) with the help of attention. We have also built an interactive demo for the translation task with the help of the model we train in this tutorial.
Introduction
We have often been astonished by Google Translate. A deep learning model can translate from any language to any other. As you might already have guessed, it is a Neural Machine Translation model.
A Neural Machine Translation model has an encoder and a decoder. The encoder encodes the source sentence into a rich representation. The decoder accepts the encoded representation and decodes it into the target language.
Bahdanau et al., in their academic paper, “Neural Machine Translation by Jointly Learning to Align and Translate,” propose to build the encoder representation each time a word is decoded in the decoder.
This dynamic representation will depend on the parts of the input sentence most relevant to the current decoded word. We attend to the most relevant parts of the input sentence to decode the target sentence.
In this tutorial, we not only explain the concept of attention but also build a model in TensorFlow and Keras. Our task today is to have a model that translates from English to French.
Configuring Your Development Environment
To follow this guide, you need to have tensorflow
and tensorflow-text
installed on your system.
Luckily, TensorFlow is pip-installable:
$ pip install tensorflow==2.8.0 $ pip install tensorflow-text==2.8.0
Having Problems Configuring Your Development Environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Project Structure
We first review our project directory structure.
Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.
From there, take a look at the directory structure:
├── download.sh ├── inference.py ├── output │ └── loss.png ├── pyimagesearch │ ├── config.py │ ├── dataset.py │ ├── __init__.py │ ├── loss.py │ ├── models.py │ ├── schedule.py │ └── translator.py ├── requirements.txt └── train.py
In the pyimagesearch
directory, we have:
config.py
: The configuration file for the task.dataset.py
: The utilities for the dataset pipeline.loss.py
: Holds the code snippet for the losses needed to train the model.models.py
: Encoder and Decoder for the translation model.schedule.py
: The learning rate scheduler for the training pipeline.translator.py
: The train and inference models.
In the core directory, we have four scripts:
download.sh
: A shell script to download the training data.requirements.txt
: The python packages that are required for this tutorial.train.py
: The script to train the model.inference.py
: The inference script.
RNN Encoder-Decoder
In neural machine translation, we have an encoder and a decoder. The encoder builds a rich representation of the source sentence, while the decoder decodes the encoded representation and produces the target sentence.
As the source and the target sentences are sequential data, we can take help from our trusted sequential models. To get a primer on sequential modeling, follow our previous blog post.
While the information that the encoder builds is quite useful, it does not capture some very important aspects of the source sentence. With a static and fixed encoder representation we are under the constraint of using the same meaning of the source sentence for each decoded word.
In the next section, we will discuss the encoder, decoder, and attention module. The attention module will help gather the important information from the source sentence while decoding.
Learning to Align and Translate
Encoder
A recurrent neural network takes the present input and the previous hidden state to model the present hidden state . This forms a recurrence chain, which helps RNNs to model sequential data.
For the encoder, the authors have suggested a Bidirectional RNN. This way the RNN will provide two sets of hidden states, the forward and the backward hidden states. The authors suggest that concatenating the two states gives a richer and better representation .
Decoder
Without attention: The decoder uses a Recurrent Neural Network to decode the encoded representation into the target sentence.
The last encoded hidden state theoretically contains information of the entire source sentence. The decoder is fed with the previous decoder hidden state, the previous decoded word, and the last encoded hidden state.
The equation below is of the conditional probability, which needs to be maximized for the decoder to translate properly.
where is the hidden state for the decoder and the context vector which is the last encoder hidden state .
With attention: The only difference with and without attention is in the context vector . With attention in place, the decoder receives a newly built context vector for each step. Each context vector depends on the relevant information from which the source sentence is attended.
The equation for the conditional probability now changes to the following.
We have the entire set of hidden states (known as annotations) from the encoder . We need a mechanism to evaluate the importance of each annotation for a decoded word. We can either formulate this behavior by hard-coded equations or let another model figure this out. Turns out, being of the lazy kind, we delegate the entire workload to another model. This model is our attention module.
The attention layer takes a tuple as input. The tuple consists of the annotation and the previously decoded hidden state . This provides a set of unnormalized importance maps. Intuitively this provides the answer to the question, “How important is for ?”
With the knowledge that neural networks do great with well-defined distributions, we apply a softmax on the unnormalized importance to have a well-defined normal distribution of importance, as shown in Figure 2.
The unnormalized importance is called the energy matrix.
The normalized importance is the attention weights.
Now that we have the attention weights in hand, we need to pointwise multiply the weights with the annotations to obtain the important annotations while discarding the unimportant ones. Then we sum the weighted annotations to obtain a single feature-rich vector.
We build the context vector for each step of the decoder and attend to specific annotations, as shown in Figure 3.
Code Walkthrough
Configuring the Prerequisites
Before we start our implementation, let’s go over the configuration pipeline of our project. For that, we will move on to the config.py
script located in the pyimagesearch
directory.
# define the data file name DATA_FNAME = "fra.txt" # define the batch size BATCH_SIZE = 512 # define the vocab size for the source and the target # text vectorization layers SOURCE_VOCAB_SIZE = 15_000 TARGET_VOCAB_SIZE = 15_000 # define the encoder configurations ENCODER_EMBEDDING_DIM = 512 ENCODER_UNITS = 512 # define the attention configuration ATTENTION_UNITS = 512 # define the decoder configurations DECODER_EMBEDDING_DIM = 512 DECODER_UNITS = 1024 # define the training configurations EPOCHS = 100 LR_START = 1e-4 LR_MAX = 1e-3 WARMUP_PERCENT = 0.15 # define the patience for early stopping PATIENCE = 10 # define the output path OUTPUT_PATH = "output"
On Line 2, we have the data file name referencing the dataset we will use in our project. This dataset has input English sentences and their corresponding French-translated sentences.
The batch size of our data is defined on Line 5. This is followed by defining the source and target vocabulary sizes (Lines 9 and 10). You are free to adjust this to your liking for varying results.
The encoder model of our architecture will use its own embedding space for the source language. We have defined the number of dimensions for that embedding space on Line 13, followed by the encoder hidden state dimension on Line 14.
On Line 17, the number of attention units used is defined.
The decoder configurations are defined, similar to the encoder configurations on Lines 20 and 21.
The training hyperparameters, namely the number of epochs and specifications of the learning rate, are defined on Lines 24-27. Since we will be using the early stopping callback, the patience is set on Line 30.
The final step of the configuration pipeline is setting the output path for our results (Line 33).
Configuring the Dataset
As mentioned earlier, we need a dataset containing source language-target language sentence pairs. To configure and pre-process a dataset like that, we have prepared the dataset.py
script situated in the pyimagesearch
directory.
# import the necessary packages import tensorflow_text as tf_text import tensorflow as tf import random # define a module level autotune _AUTO = tf.data.AUTOTUNE def load_data(fname): # open the file with utf-8 encoding with open(fname, "r", encoding="utf-8") as textFile: # the source and the target sentence is demarcated with tab # iterate over each line and split the sentences to get # the individual source and target sentence pairs lines = textFile.readlines() pairs = [line.split("\t")[:-1] for line in lines] # randomly shuffle the pairs random.shuffle(pairs) # collect the source sentences and target sentences into # respective lists source = [src for src, _ in pairs] target = [trgt for _, trgt in pairs] # return the list of source and target sentences return (source, target)
We start by defining the AUTOTUNE
constant for efficient training on Line 7.
The first function in this script is load_data
(Line 9), which takes in the filename as its argument. We use the open
functionality to read our dataset file with utf-8
encoding (Line 11).
We loop over each line in the text file (Line 15) and create the pairs based on the premise that each source and target pair is separated by a tab (Line 16).
For less bias, we are randomly shuffling the pairs on Line 19. We store the split source and target sentences into corresponding variables and return them (Lines 23-27).
def splitting_dataset(source, target): # calculate the training and validation size trainSize = int(len(source) * 0.8) valSize = int(len(source) * 0.1) # split the inputs into train, val, and test (trainSource, trainTarget) = (source[: trainSize], target[: trainSize]) (valSource, valTarget) = (source[trainSize : trainSize + valSize], target[trainSize : trainSize + valSize]) (testSource, testTarget) = (source[trainSize + valSize :], target[trainSize + valSize :]) # return the splits return ( (trainSource, trainTarget), (valSource, valTarget), (testSource, testTarget), )
On Line 29, we have the splitting_dataset
function, which takes in the source and target sentences. The next step is to calculate the training dataset and validation dataset sizes (Lines 31 and 32). Using these values, we split the source and target sentences into training, test, and validation sets (Lines 35-47).
def make_dataset(splits, batchSize, train=False): # build a TensorFlow dataset from the input and target (source, target) = splits dataset = tf.data.Dataset.from_tensor_slices((source, target)) # check if this is the training dataset, if so, shuffle, batch, # and prefetch it if train: dataset = ( dataset .shuffle(dataset.cardinality().numpy()) .batch(batchSize) .prefetch(_AUTO) ) # otherwise, just batch the dataset else: dataset = ( dataset .batch(batchSize) .prefetch(_AUTO) ) # return the dataset return dataset
On Line 49, we have the make_dataset
function, which takes in the following arguments:
splits
: One of the three (train, test, and val) dataset splits.batchSize
: The batch size we want our dataset to have.train
: A bool variable to indicate if the dataset in question is the training dataset.
We separate the source and target sentences from the current split (Line 51) and build a TensorFlow dataset using tf.data.Dataset.from_tensor_slices
on Line 52.
On Lines 56-62, we have an if statement that checks if the train
bool is set to true. If the condition is met, we shuffle, batch, and prefetch the dataset. The else
condition is for the test and validation sets, where we only batch and prefetch the dataset (Lines 65-73).
def tf_lower_and_split_punct(text): # split accented characters text = tf_text.normalize_utf8(text, "NFKD") text = tf.strings.lower(text) # keep space, a to z, and selected punctuations text = tf.strings.regex_replace(text, "[^ a-z.?!,]", "") # add spaces around punctuation text = tf.strings.regex_replace(text, "[.?!,]", r" \0 ") # strip whitespace and add [START] and [END] tokens text = tf.strings.strip(text) text = tf.strings.join(["[START]", text, "[END]"], separator=" ") # return the processed text return text
The final data utility function is tf_lower_and_split_punct
, which takes in any single sentence as its argument (Line 75). We start by normalizing the sentences and turning them lowercase (Lines 77 and 78).
On Lines 81-84, we strip the sentence of unnecessary punctuations and characters. The whitespace before the sentence is removed on Line 87, followed by the addition of the start
and end
tokens in the sentence. These tokens help the model understand when to start or end a sequence.
Building the Bahdanau NMT Model
With our dataset utility script out of the way, the focus is now on the neural machine translation model itself. For that, we will hop into the models.py
script in the pyimagesearch
directory.
# import the necessary packages from tensorflow.keras.layers import AdditiveAttention from tensorflow.keras.layers import Bidirectional from tensorflow.keras.layers import Embedding from tensorflow.keras.layers import Layer from tensorflow.keras.layers import Dense from tensorflow.keras.layers import GRU from tensorflow.keras import Sequential import tensorflow as tf class Encoder(Layer): def __init__(self, sourceVocabSize, embeddingDim, encUnits, **kwargs): super().__init__(**kwargs) # initialize the source vocab size, embedding dimensions, and # the encoder units self.sourceVocabSize = sourceVocabSize self.embeddingDim = embeddingDim self.encUnits = encUnits def build(self, inputShape): # the embedding layer converts token IDs to embedding vectors self.embedding = Embedding( input_dim=self.sourceVocabSize, output_dim=self.embeddingDim, mask_zero=True, ) # the GRU layer processes the embedding vectors sequentially self.gru = Bidirectional( GRU( units=self.encUnits, # return the sequence and the state return_sequences=True, return_state=True, recurrent_initializer="glorot_uniform", ) )
The encoder model, attention mechanism, and decoder model are packaged into separate classes for easier accessibility and usage. We start with the Encoder
class in Line 11. The __init__
function (Line 12) takes in the following arguments:
sourceVocabSize
: Defines the source vocabulary size.embeddingDim
: Defines the embedding space size for the source vocabulary.encUnits
: Defines the encoder hidden layer dimension.
The only purpose of this function is to create class variables of the arguments using the self
functionality (Lines 17-19).
The build
function takes in the input shape (tf
custom layers require this parameter in the function call, but it is not mandatory to use it inside the function) as its argument (Line 21). This function first creates an embedding layer (Lines 23-27), followed by the GRU
layer, which will sequentially process the embedding vectors (Lines 30-38). The hidden layer dimension is set in the __init__
function provided to the units
argument of the GRU
layer.
def get_config(self): # return the configuration of the encoder layer return { "inputVocabSize": self.inputVocabSize, "embeddingDim": self.embeddingDim, "encUnits": self.encUnits, } def call(self, sourceTokens, state=None): # pass the source tokens through the embedding layer to get # source vectors sourceVectors = self.embedding(sourceTokens) # create the masks for the source tokens sourceMask = self.embedding.compute_mask(sourceTokens) # pass the source vectors through the GRU layer (encOutput, encFwdState, encBckState) = self.gru( inputs=sourceVectors, initial_state=state, mask=sourceMask ) # return the encoder output, encoder state, and the # source mask return (encOutput, encFwdState, encBckState, sourceMask)
On Line 40, we have the get_config
function, which simply returns the class variables like inputVocabSize
, embeddingDim
, and encUnits
(Lines 42-46).
The call
function on Line 48 takes in the following tokens:
sourceTokens
: The sentence tokens of the input and output.state
: The initial state for theGRU
cell.
The call
function simply acts as a hub for the usage of embedding
and GRU
layers. We first pass the tokens through the embedding layer to obtain the embedding vectors, and then we create masks for the source tokens (Lines 51-54). Masking helps make our GRU
focus only on the tokens and ignore the padding.
The embedding vectors are then passed through the GRU
layer, and a forward pass consisting of the encoder output, encoder forward state, and encoder backward state (bi-directional) is obtained (Lines 57-61).
class BahdanauAttention(Layer): def __init__(self, attnUnits, **kwargs): super().__init__(**kwargs) # initialize the attention units self.attnUnits = attnUnits def build(self, inputShape): # the dense layers projects the query and the value self.denseEncoderAnnotation = Dense( units=self.attnUnits, use_bias=False, ) self.denseDecoderAnnotation = Dense( units=self.attnUnits, use_bias=False, ) # build the additive attention layer self.attention = AdditiveAttention()
As mentioned before, the Bahdanau attention block is packaged into a separate class (Line 67). On Line 68, the __init__
function takes in the number of attention units as its arguments and creates a class variable for the same (Line 71).
In the build
function (Line 73), the dense layers for the encoder and decoder annotations, as well as the additive attention layers, are initialized (Lines 75-85).
def get_config(self): # return the configuration of the layer return { "attnUnits": self.attnUnits, } def call(self, hiddenStateEnc, hiddenStateDec, mask): # grab the source and target mask sourceMask = mask[0] targetMask = mask[1] # pass the query and value through the dense layer encoderAnnotation = self.denseEncoderAnnotation(hiddenStateEnc) decoderAnnotation = self.denseDecoderAnnotation(hiddenStateDec) # apply attention to align the representations (contextVector, attentionWeights) = self.attention( inputs=[decoderAnnotation, hiddenStateEnc, encoderAnnotation], mask=[targetMask, sourceMask], return_attention_scores=True ) # return the context vector and the attention weights return (contextVector, attentionWeights)
The get_config
function of this class returns the number of attention units (Lines 87-91).
On Line 93, we define the call
function, which takes in the following arguments:
hiddenStateEnc
: The encoder hidden statehiddenStateDec
: The decoder hidden statemask
: The source and target vector masks
First, we pass the respective hidden states through the dense layers previously created in the build
function (Lines 99 and 100). Using these annotations, we simply pass them through the attention layer, also specifying the target and input masks (Lines 103-107).
The function returns the context vectors and attention weights obtained from the attention layer (Line 110).
class Decoder(Layer): def __init__(self, targetVocabSize, embeddingDim, decUnits, **kwargs): super().__init__(**kwargs) # initialize the target vocab size, embedding dimension, and # the decoder units self.targetVocabSize = targetVocabSize self.embeddingDim = embeddingDim self.decUnits = decUnits def get_config(self): # return the configuration of the layer return { "targetVocabSize": self.targetVocabSize, "embeddingDim": self.embeddingDim, "decUnits": self.decUnits, }
With the encoder and attention layer done, we move on to the decoder. As with the previous two, we have packaged the decoder into a separate class (Line 112).
Like the __init__
function for the encoder
class, the decoder__init__
takes in:
targetVocabSize
: The target language vocabulary size.embeddingDim
: The embedding dimension size for the embedding space used for the target vocabulary.decUnits
: The decoder hidden layer dimension size.
This function simply creates class variables of the arguments mentioned above (Lines 117-119).
On Line 121, we have the get_config
function, which returns the class variables previously created in the __init__
function.
def build(self, inputShape): # build the embedding layer which converts token IDs to # embedding vectors self.embedding = Embedding( input_dim=self.targetVocabSize, output_dim=self.embeddingDim, mask_zero=True, ) # build the GRU layer which processes the embedding vectors # in a sequential manner self.gru = GRU( units=self.decUnits, return_sequences=True, return_state=True, recurrent_initializer="glorot_uniform" ) # build the attention layer self.attention = BahdanauAttention(self.decUnits) # build the final output layer self.fwdNeuralNet = Sequential([ Dense( units=self.decUnits, activation="tanh", use_bias=False, ), Dense( units=self.targetVocabSize, ), ])
As in the previous build
functions we have encountered in this script, the inputShape
argument needs to be included in the function call (Line 129). Similar to the encoder
, we will first build the embedding space, followed by the GRU
layer (Lines 132-145). The extra addition here is the BahdanauAttention
layer as well as the final feedforward neural network for our outputs (Lines 148-160).
def call(self, inputs, state=None): # grab the target tokens, encoder output, and source mask targetTokens = inputs[0] encOutput = inputs[1] sourceMask = inputs[2] # get the target vectors by passing the target tokens through # the embedding layer and create the target masks targetVectors = self.embedding(targetTokens) targetMask = self.embedding.compute_mask(targetTokens) # process one step with the GRU (decOutput, decState) = self.gru(inputs=targetVectors, initial_state=state, mask=targetMask) # use the GRU output as the query for the attention over the # encoder output (contextVector, attentionWeights) = self.attention( hiddenStateEnc=encOutput, hiddenStateDec=decOutput, mask=[sourceMask, targetMask], ) # concatenate the context vector and output of GRU layer contextAndGruOutput = tf.concat( [contextVector, decOutput], axis=-1) # generate final logit predictions logits = self.fwdNeuralNet(contextAndGruOutput) # return the predicted logits, attention weights, and the # decoder state return (logits, attentionWeights, decState)
Now it is time to sequentially use all the initialized layers and variables in the call
function on Line 162. This function takes in the following arguments:
inputs
: Contains the target tokens,encoder
output, and the source token mask.state
: To specify the initial state of the decoder layer.
First, we obtain the target token vectors by passing the target tokens through the embedding layer defined for the decoder
(Line 170). The target token mask is computed exactly as we had done for the source tokens in the encoder
(Line 171).
On Lines 174 and 175, we process one step by passing the vectors through the decoderGRU
layer and obtaining the output and state variables.
The attention weights and context vectors are then computed using the attention
layer on Lines 179-183.
The context vector and the GRUdecoder
output is then concatenated, and the final logits predictions are computed using the feedforward neural network (Lines 186-190).
Building the Loss Function for Our Model
Our input sequences for the model use lots of masking. For that, we need to ensure that our loss function is also appropriate. Let’s move into the loss.py
script inside the pyimagesearch
directory.
# import the necessary packages from tensorflow.keras.losses import SparseCategoricalCrossentropy from tensorflow.keras.losses import Loss import tensorflow as tf class MaskedLoss(Loss): def __init__(self): # initialize the name of the loss and the loss function self.name = "masked_loss" self.loss = SparseCategoricalCrossentropy(from_logits=True, reduction="none") def __call__(self, yTrue, yPred): # calculate the loss for each item in the batch loss = self.loss(yTrue, yPred) # mask off the losses on padding mask = tf.cast(yTrue != 0, tf.float32) loss *= mask # return the total loss return tf.reduce_sum(loss)
The loss is packaged as a class called MaskedLoss
on Line 6. The __init__
function creates the class variables name
and loss
, which are set to masked_loss
and sparse categorical cross-entropy, respectively (Lines 9 and 10). For the latter, we simply import the loss from tensorflow
itself.
On Line 13, we have the __call__
function, which takes in the labels and the predictions. First, we calculate the loss using the sparse categorical cross-entropy loss on Line 15.
Taking into consideration the padding in our sequences, we mask off the losses for the padding by creating a simple conditional variable mask
, which only considers the sequence tokens which are not 0 and multiplies our loss with it (Line 19). This way, we are nullifying the effect padding has on the loss.
Finally, the loss is returned in a reduce_sum
format (Line 22).
Optimizing Our Training with a Scheduler
To ensure that the efficiency of our training is at its peak, we have created the schedule.py
script.
# import the necessary packages from tensorflow.keras.optimizers.schedules import LearningRateSchedule import tensorflow as tf import numpy as np class WarmUpCosine(LearningRateSchedule): def __init__(self, lrStart, lrMax, warmupSteps, totalSteps): super().__init__() self.lrStart = lrStart self.lrMax = lrMax self.warmupSteps = warmupSteps self.totalSteps = totalSteps self.pi = tf.constant(np.pi)
We will use a dynamic learning rate, for which we have created a class on Line 6.
The __init__
function takes in the following arguments:
lrStart
: The starting value of our learning ratelrMax
: The maximum learning rate valuewarmupSteps
: The number of “warm-up” steps required for the dynamic LR calculationtotalSteps
: The total number of steps
In this function, the class variables for these arguments are created, along with a pi
variable (Lines 9-13).
def __call__(self, step): # check whether the total number of steps is larger than the # warmup steps. If not, then throw a value error if self.totalSteps < self.warmupSteps: raise ValueError( f"Total number of steps {self.totalSteps} must be" + f"larger or equal to warmup steps {self.warmupSteps}." ) # a graph that increases to 1 from the initial step to the # warmup step, later decays to -1 at the final step mark cosAnnealedLr = tf.cos( self.pi * (tf.cast(step, tf.float32) - self.warmupSteps) / tf.cast(self.totalSteps - self.warmupSteps, tf.float32) ) # shift the learning rate and scale it learningRate = 0.5 * self.lrMax * (1 + cosAnnealedLr)
Next, we have the __call__
function, which takes the step
number as its argument (Line 15).
On Lines 18-22, we check if the number of steps is greater than the warmup steps, throwing an error if it’s the latter.
Next, we set up a graph that initially increases to 1
but later decays to -1
, thus keeping our LR value dynamic (Line 26-33).
# check whether warmup steps is more than 0. if self.warmupSteps > 0: # throw a value error is max lr is smaller than start lr if self.lrMax < self.lrStart: raise ValueError( f"lr_start {self.lrStart} must be smaller or" + f"equal to lr_max {self.lrMax}." ) # calculate the slope of the warmup line and build the # warmup rate slope = (self.lrMax - self.lrStart) / self.warmupSteps warmupRate = slope * tf.cast(step, tf.float32) + self.lrStart # when the current step is lesser that warmup steps, get # the line graph, when the current step is greater than # the warmup steps, get the scaled cos graph. learning_rate = tf.where( step < self.warmupSteps, warmupRate, learningRate ) # return the lr schedule return tf.where( step > self.totalSteps, 0.0, learningRate, name="learning_rate", )
On Line 36, we check if warmupSteps
is greater than 0. If yes, we further check if the max learning rate is smaller than starting learning on Line 38. If this returns true, we raise a Value Error.
Next, on Lines 46 and 47, we calculate the slope of the warmup line and build the warmup rate.
On Lines 52-54, we calculate the learning rate by checking if the current step is lesser than the warmup steps. If so, we get the line graph. If the current step is greater, we get the scaled cosine graph.
Finally, on Lines 57-60, we return the learning rate schedule.
Train Translator
# import the necessary packages from tensorflow.keras.layers import StringLookup from tensorflow import keras import tensorflow as tf import numpy as np
We start by importing the necessary packages (Lines 2-5).
class TrainTranslator(keras.Model): def __init__(self, encoder, decoder, sourceTextProcessor, targetTextProcessor, **kwargs): super().__init__(**kwargs) # initialize the encoder, decoder, source text processor, # and the target text processor self.encoder = encoder self.decoder = decoder self.sourceTextProcessor = sourceTextProcessor self.targetTextProcessor = targetTextProcessor
We create a class called TrainTranslator
on Line 7, which contains all of the functionalities that will help train our encoder, attention module, and decoder with the model.fit()
API.
Inside the __init__
function, we initialize the encoder, decoder, source text processor, and target text processor (Lines 13-16).
def _preprocess(self, sourceText, targetText): # convert the text to token IDs sourceTokens = self.sourceTextProcessor(sourceText) targetTokens = self.targetTextProcessor(targetText) # return the source and target token IDs return (sourceTokens, targetTokens)
Next, we have the _preprocess
function (Line 18), which takes as input the source and the target text. It converts the text to token IDs for both source and the target text on Lines 20 and 21 and then returns them on Line 24.
def _calculate_loss(self, sourceTokens, targetTokens): # encode the input text token IDs (encOutput, encFwdState, encBckState, sourceMask) = self.encoder( sourceTokens=sourceTokens ) # initialize the decoder's state to the encoder's final state decState = tf.concat([encFwdState, encBckState], axis=-1) (logits, attentionWeights, decState) = self.decoder( inputs=[targetTokens[:, :-1], encOutput, sourceMask], state=decState, ) # calculate the batch loss yTrue = targetTokens[:, 1:] yPred = logits batchLoss = self.loss(yTrue=yTrue, yPred=yPred) # return the batch loss return batchLoss
On Line 26, we define the _calculate_loss
function, which takes in the source and target tokens. We pass the source tokens through our encoder on Line 28. The encoder outputs the following:
encOutput
: The encoder outputencFwdState
: The encoder forward hidden statesencBckState
: The encoder backward hidden statessourceMask
: The mask tokens of the source
On Line 33, we concatenate the forward and the backward hidden states as suggested by the authors in the paper. Next, on Lines 35-38, we pass the target tokens (offset by one), encoder output, and source mask as the decoder’s input, with the concatenated encoder hidden states as the decoder’s initial state. The decoder then outputs the logits, attention weights, and decoder states.
On Lines 41-43, we use the target tokens and the retrieved logits to calculate the batch loss, which is then returned on Line 46.
@tf.function( input_signature=[[ tf.TensorSpec(dtype=tf.string, shape=[None]), tf.TensorSpec(dtype=tf.string, shape=[None]) ]]) def train_step(self, inputs): # grab the source and the target text from the inputs (sourceText, targetText) = inputs # pre-process the text into token IDs (sourceTokens, targetTokens) = self._preprocess( sourceText=sourceText, targetText=targetText ) # use gradient tape to track the gradients with tf.GradientTape() as tape: # calculate the batch loss loss = self._calculate_loss( sourceTokens=sourceTokens, targetTokens=targetTokens, ) # normalize the loss averageLoss = ( loss / tf.reduce_sum( tf.cast((targetTokens != 0), tf.float32) ) ) # apply an optimization step on all the trainable variables variables = self.trainable_variables gradients = tape.gradient(averageLoss, variables) self.optimizer.apply_gradients(zip(gradients, variables)) # return the batch loss return {"batch_loss": averageLoss}
We define the train step (Line 53), specifying the input signature of the function on Lines 49-52. The input signature will be required later when we use the tf.module
to serve this model for inference.
On Line 55, we grab the source and the target text from the inputs. And next, on Lines 58-61, we pre-process the text into token IDs. Lines 64-69 take care of the gradient tape to track the gradients while calculating the loss.
Finally, on Lines 72-76, we normalize the loss. We then apply an optimization step on all trainable variables (Lines 79-81) and return the normalized loss on Line 84.
@tf.function( input_signature=[[ tf.TensorSpec(dtype=tf.string, shape=[None]), tf.TensorSpec(dtype=tf.string, shape=[None]) ]]) def test_step(self, inputs): # grab the source and the target text from the inputs (sourceText, targetText) = inputs # pre-process the text into token IDs (sourceTokens, targetTokens) = self._preprocess( sourceText=sourceText, targetText=targetText ) # calculate the batch loss loss = self._calculate_loss( sourceTokens=sourceTokens, targetTokens=targetTokens, ) # normalize the loss averageLoss = ( loss / tf.reduce_sum( tf.cast((targetTokens != 0), tf.float32) ) ) # return the batch loss return {"batch_loss": averageLoss}
On Line 91, we create the test step function, which takes in the inputs. We also specify the input signature of the test step (as done in the train step) on Lines 87-90.
Next, on Lines 96-99, we pre-process the text tokens into token IDs. We then calculate the batch loss on Lines 102-105 and normalize the loss on Lines 108-112. Finally, we return the normalized loss on Line 115.
Translator
class Translator(tf.Module): def __init__(self, encoder, decoder, sourceTextProcessor, targetTextProcessor): # initialize the encoder, decoder, source text processor, and # target text processor self.encoder = encoder self.decoder = decoder self.sourceTextProcessor = sourceTextProcessor self.targetTextProcessor = targetTextProcessor # initialize index to string layer self.stringFromIndex = StringLookup( vocabulary=targetTextProcessor.get_vocabulary(), mask_token="", invert=True ) # initialize string to index layer indexFromString = StringLookup( vocabulary=targetTextProcessor.get_vocabulary(), mask_token="", ) # generate IDs for mask tokens tokenMaskIds = indexFromString(["", "[UNK]", "[START]"]).numpy() tokenMask = np.zeros( [indexFromString.vocabulary_size()], dtype=np.bool ) tokenMask[np.array(tokenMaskIds)] = True # initialize the token mask, start token, and end token self.tokenMask = tokenMask self.startToken = indexFromString(tf.constant("[START]")) self.endToken = indexFromString(tf.constant("[END]"))
We create a class called Translator
that houses all the utility functions needed for inference with our trained encoder and decoder.
Inside the __init__
function (Line 118):
- We initialize the encoder, decoder, source text processor, and target text processor (Lines 122-125)
- On Lines 128-132, we initialize the
stringFromIndex
layer for String lookup. - We initialize the
indexFromString
layer (Lines 135-138). This layer converts the token indices back to string format. - Generate the IDs for mask tokens using the
indexFromString
layer on Lines 141-146. - Initialize the token mask, start token, and end token (Lines 149-151).
def tokens_to_text(self, resultTokens): # decode the token from index to string resultTextTokens = self.stringFromIndex(resultTokens) # format the result text into a human readable format resultText = tf.strings.reduce_join(inputs=resultTextTokens, axis=1, separator=" ") resultText = tf.strings.strip(resultText) # return the result text return resultText
On Lines 153-163, we create the tokens_to_text
function, which decodes the token from index back into string format using the stringFromIndex
layer.
def sample(self, logits, temperature): # reshape the token mask tokenMask = self.tokenMask[tf.newaxis, tf.newaxis, :] # set the logits for all masked tokens to -inf, so they are # never chosen logits = tf.where( condition=self.tokenMask, x=-np.inf, y=logits ) # check if the temperature is set to 0 if temperature == 0.0: # select the index for the maximum probability element newTokens = tf.argmax(logits, axis=-1) # otherwise, we have set the temperature else: # sample the index for the element using categorical # probability distribution logits = tf.squeeze(logits, axis=1) newTokens = tf.random.categorical(logits / temperature, num_samples=1 ) # return the new tokens return newTokens
Next, on Lines 165-192, we create the sample
function. On Line 167, we reshape the tokenMask
, and then on Lines 171-175, we set the logits of these masked tokens to -inf
. This is done so that the masked tokens are skipped.
On Line 178, we check if the temperature
parameter is set to zero. If yes, then we select the index of the maximum probability element on Line 180. If not, we sample the element’s index using categorical probability distribution on Lines 186-189.
Finally, we return the newTokens
on Line 192.
@tf.function(input_signature=[tf.TensorSpec(dtype=tf.string, shape=[None])]) def translate(self, sourceText, maxLength=50, returnAttention=True, temperature=1.0): # grab the batch size batchSize = tf.shape(sourceText)[0] # encode the source text to source tokens and pass them # through the encoder sourceTokens = self.sourceTextProcessor(sourceText) (encOutput, encFwdState, encBckState, sourceMask) = self.encoder( sourceTokens=sourceTokens )
On Line 196, we create the translate
function which takes in sourceText
, maxLength
, returnAttention
, and temperature
. We also specify the input signature of the function on Lines 194 and 195.
First, we extract the batchSize
from the sourceText
vector on Line 199. Next, on Lines 203-206, we encode the source text into tokens and pass them through the encoder.
# initialize the decoder state and the new tokens decState = tf.concat([encFwdState, encBckState], axis=-1) newTokens = tf.fill([batchSize, 1], self.startToken) # initialize the result token, attention, and done tensor # arrays resultTokens = tf.TensorArray(tf.int64, size=1, dynamic_size=True) attention = tf.TensorArray(tf.float32, size=1, dynamic_size=True) done = tf.zeros([batchSize, 1], dtype=tf.bool) # loop over the maximum sentence length for i in tf.range(maxLength): # pass the encoded tokens through the decoder (logits, attentionWeights, decState) = self.decoder( inputs=[newTokens, encOutput, sourceMask], state=decState, ) # store the attention weights and sample the new tokens attention = attention.write(i, attentionWeights) newTokens = self.sample(logits, temperature) # if the new token is the end token then set the done # flag done = done | (newTokens == self.endToken) # replace the end token with the padding newTokens = tf.where(done, tf.constant(0, dtype=tf.int64), newTokens) # store the new tokens in the result resultTokens = resultTokens.write(i, newTokens) # end the loop once done if tf.reduce_all(done): break # convert the list of generated token IDs to a list of strings resultTokens = resultTokens.stack() resultTokens = tf.squeeze(resultTokens, -1) resultTokens = tf.transpose(resultTokens, [1, 0]) resultText = self.tokens_to_text(resultTokens) # check if we have to return the attention weights if returnAttention: # format the attention weights attentionStack = attention.stack() attentionStack = tf.squeeze(attentionStack, 2) attentionStack = tf.transpose(attentionStack, [1, 0, 2]) # return the text result and attention weights return {"text": resultText, "attention": attentionStack} # otherwise, we will just be returning the result text else: return {"text": resultText}
We initialize the decoder state and the newTokens
(Lines 209 and 210). We also initialize the resultTokens
, attention
vector, and the done
vector on Lines 214, 216, and 218, respectively.
From Line 221, we loop over the maximum sentence length. We first pass the encoded tokens through the decoder on Lines 223-226. Next, we store the attention weights in the attention vector and sample newTokens
on Lines 229 and 230.
If the new token is the end token, then we set the done
flag to zero on Line 234. We store the newTokens
on the resultTokens
vector on Line 241. Once the entire iteration is done, we break out of the loop on Lines 244 and 245.
On Lines 248-251, we convert the list of all generated token IDs back to strings using the tokens_to_text
layer.
On Line 254, we check if we have to return the attention weights and if yes, we return them with the resultText
on Line 261.
If not, we will return the resultText
only on Line 265.
Training
With all the blocks created, we finally start with train.py
.
# USAGE # python train.py # set random seed for reproducibility import tensorflow as tf tf.keras.utils.set_random_seed(42) # import the necessary packages from pyimagesearch import config from pyimagesearch.schedule import WarmUpCosine from pyimagesearch.dataset import load_data from pyimagesearch.dataset import splitting_dataset from pyimagesearch.dataset import make_dataset from pyimagesearch.dataset import tf_lower_and_split_punct from pyimagesearch.models import Encoder from pyimagesearch.models import Decoder from pyimagesearch.translator import TrainTranslator from pyimagesearch.translator import Translator from pyimagesearch.loss import MaskedLoss from tensorflow.keras.layers import TextVectorization from tensorflow.keras.optimizers import Adam from tensorflow.keras.callbacks import EarlyStopping import tensorflow_text as tf_text import matplotlib.pyplot as plt import numpy as np import os # load data from disk print(f"[INFO] loading data from {config.DATA_FNAME}...") (source, target) = load_data(fname=config.DATA_FNAME) # split the data into training, validation, and test set (train, val, test) = splitting_dataset(source=source, target=target) # build the TensorFlow data datasets of the respective data splits print("[INFO] building TensorFlow Data input pipeline...") trainDs = make_dataset(splits=train, batchSize=config.BATCH_SIZE, train=True) valDs = make_dataset(splits=val, batchSize=config.BATCH_SIZE, train=False) testDs = make_dataset(splits=test, batchSize=config.BATCH_SIZE, train=False)
We import all the necessary packages on Lines 5-26. Next, on Lines 29 and 30, we load the data from the disk and split the data into training, validation, and test datasets on Line 33.
We create the TensorFlow datasets of each of these on Lines 37, 39, and 41, respectively.
# create source text processing layer and adapt on the training # source sentences print("[INFO] performing text vectorization...") sourceTextProcessor = TextVectorization( standardize=tf_lower_and_split_punct, max_tokens=config.SOURCE_VOCAB_SIZE ) sourceTextProcessor.adapt(train[0]) # create target text processing layer and adapt on the training # target sentences targetTextProcessor = TextVectorization( standardize=tf_lower_and_split_punct, max_tokens=config.TARGET_VOCAB_SIZE ) targetTextProcessor.adapt(train[1])
Next, we create the source text processing layer and adapt it on the training source sentences on Lines 47-51. On Lines 55-59, we do the same for target sentences.
# build the encoder and the decoder print("[INFO] building the encoder and decoder models...") encoder = Encoder( sourceVocabSize=config.SOURCE_VOCAB_SIZE, embeddingDim=config.ENCODER_EMBEDDING_DIM, encUnits=config.ENCODER_UNITS ) decoder = Decoder( targetVocabSize=config.TARGET_VOCAB_SIZE, embeddingDim=config.DECODER_EMBEDDING_DIM, decUnits=config.DECODER_UNITS, ) # build the trainer module print("[INFO] build the translator trainer model...") translatorTrainer = TrainTranslator( encoder=encoder, decoder=decoder, sourceTextProcessor=sourceTextProcessor, targetTextProcessor=targetTextProcessor, )
On Lines 63-67, we build the encoder model using the vocab size, embedding dimension, and encoder units. Similarly, we build the decoder model on Lines 68-72 with the corresponding configurations.
Finally, we initialize the trainer module called TrainTranslator
(Line 76). We pass the encoder, decoder, and the source and target text processors to build it on Lines 77-81.
# get the total number of steps for training. totalSteps = int(trainDs.cardinality() * config.EPOCHS) # calculate the number of steps for warmup. warmupEpochPercentage = config.WARMUP_PERCENT warmupSteps = int(totalSteps * warmupEpochPercentage) # Initialize the warmupcosine schedule. scheduledLrs = WarmUpCosine( lrStart=config.LR_START, lrMax=config.LR_MAX, warmupSteps=warmupSteps, totalSteps=totalSteps, ) # configure the loss and optimizer print("[INFO] compile the translator trainer model...") translatorTrainer.compile( optimizer=Adam(learning_rate=scheduledLrs), loss=MaskedLoss(), ) # build the early stopping callback earlyStoppingCallback = EarlyStopping( monitor="val_batch_loss", patience=config.PATIENCE, restore_best_weights=True, )
Next, we define some important parameters and configurations for training. On Line 84, we get the total number of steps for training. On Lines 87 and 88, we calculate the warmup percentage and the warmup steps. On Lines 91-96, we define the warmup cosine schedule.
We compile the translatorTrainer
model on Lines 100-103 using the Adam
as an optimizer and MaskedLoss
as the loss metric.
Next, we build the early stopping callback with our configured patience on Lines 106-110.
# train the model print("[INFO] training the translator model...") history = translatorTrainer.fit( trainDs, validation_data=valDs, epochs=config.EPOCHS, callbacks=[earlyStoppingCallback], ) # save the loss plot if not os.path.exists(config.OUTPUT_PATH): os.makedirs(config.OUTPUT_PATH) plt.plot(history.history["batch_loss"], label="batch_loss") plt.plot(history.history["val_batch_loss"], label="val_batch_loss") plt.xlabel("EPOCHS") plt.ylabel("LOSS") plt.title("Loss Plots") plt.legend() plt.savefig(f"{config.OUTPUT_PATH}/loss.png") # build the translator module print("[INFO] build the inference translator model...") translator = Translator( encoder=translatorTrainer.encoder, decoder=translatorTrainer.decoder, sourceTextProcessor=sourceTextProcessor, targetTextProcessor=targetTextProcessor, ) # save the model print("[INFO] serialize the inference translator to disk...") tf.saved_model.save( obj=translator, export_dir="translator", signatures={"serving_default": translator.translate} )
On Lines 113-119, we call the fit API with data, epochs, and callbacks. We examine the loss plot with respect to epochs and save the plot to disk on Lines 122-130.
On Lines 134-139, we build the Translator
model for inference and then serialize and save the model to disk on Lines 143-147. We will be using this serialized model for inference.
Inference
With training complete, we can finally move to our inference script to see how our model is performing as a translator.
# USAGE # python inference.py -s "input sentence" # import the necessary packages import tensorflow_text as tf_text import tensorflow as tf import argparse # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-s", "--sentence", required=True, help="input english sentence") args = vars(ap.parse_args()) # convert the input english sentence to a constant tensor sourceText = tf.constant([args["sentence"]]) # load the translator model from disk print("[INFO] loading the translator model from disk...") translator = tf.saved_model.load("translator") # perform inference and display the result print("[INFO] translating english sentence to french...") result = translator.translate(sourceText) translatedText = result["text"][0].numpy().decode() print("[INFO] english sentence: {}".format(args["sentence"])) print("[INFO] french translation: {}".format(translatedText))
On Lines 5-7 of the inference.py
script, we import the necessary packages. On Lines 10-13, we construct the argument parser needed to parse the source sentence.
We first convert the source English sentence to a constant tensor on Line 16. Next, we load the serialized translator from the disk on Line 20.
We use the loaded translator to perform Neural Machine Translation on the source Text on Line 24 and then display the results on Lines 26 and 27.
You can experiment with the Hugging Face 🤗 space to better understand the inference task:
Summary
This was our first tutorial where we discussed attention and used it for Neural Machine Translation. In upcoming tutorials, we will learn how to better this architecture using Luong’s attention and then how to design an architecture for Machine Translation using just Attention.
The ideas and methodologies behind this tutorial are very simple. However, they are pillars upon which the most advanced Deep Learning Architectures, like Transformers, are built.
So, as Professor Turing inquires all of us in Figure 4, “Are you paying attention?”
Citation Information
A. R. Gosthipaty and R. Raha. “Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras,” PyImageSearch, P. Chugh, S. Huot, K. Kidriavsteva, and A. Thanki, eds., 2022, https://pyimg.co/kf8ma
@incollection{ARG-RR_2022_Bahdanau, author = {Aritra Roy Gosthipaty and Ritwik Raha}, title = {Neural Machine Translation with {Bahdanau's} Attention Using TensorFlow and Keras}, booktitle = {PyImageSearch}, editor = {Puneet Chugh and Susan Huot and Kseniia Kidriavsteva and Abhishek Thanki}, year = {2022}, note = {https://pyimg.co/kf8ma}, }
Unleash the potential of computer vision with Roboflow - Free!
- Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
- Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
- Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
- Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
- Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.