Table of Contents
Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras
Imagine it is your exam tomorrow, and you are yet to cover a lot of chapters. You pause for a while and then take the smarter option of prioritizing. You pay attention to the chapters that (according to you) have greater weightage than the others.

We do not advise prioritizing on the last day of your semester, but hey, we all did it! Unless your assessment of which chapters carry more weight was totally off, you have scored more than you could, had you covered all of them.
Taking this analogy of paying attention a little further, today, we apply this mechanism to the task of Neural Machine Translation.
In this tutorial, you will learn how to apply Bahdanau’s attention to the Neural Machine Translation task.
This lesson is the first of a 2-part series on NLP 103:
- Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras (this tutorial)
- Neural Machine Translation with Luong’s Attention Using TensorFlow and Keras
To learn how to apply Bahdanau’s attention to the Neural Machine Translation task, just keep reading.
Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras
In a previous blog post, we covered the mathematical intuition behind Neural Machine Translation. We request you to look at the blog post to gain in-depth knowledge about the task.
In this tutorial, we will tackle the problem of translating from a source language (English) to a target language (French) with the help of attention. We have also built an interactive demo for the translation task with the help of the model we train in this tutorial.
Introduction
We have often been astonished by Google Translate. A deep learning model can translate from any language to any other. As you might already have guessed, it is a Neural Machine Translation model.
A Neural Machine Translation model has an encoder and a decoder. The encoder encodes the source sentence into a rich representation. The decoder accepts the encoded representation and decodes it into the target language.
Bahdanau et al., in their academic paper, “Neural Machine Translation by Jointly Learning to Align and Translate,” propose to build the encoder representation each time a word is decoded in the decoder.
This dynamic representation will depend on the parts of the input sentence most relevant to the current decoded word. We attend to the most relevant parts of the input sentence to decode the target sentence.
In this tutorial, we not only explain the concept of attention but also build a model in TensorFlow and Keras. Our task today is to have a model that translates from English to French.
Configuring Your Development Environment
To follow this guide, you need to have tensorflow and tensorflow-text installed on your system.
Luckily, TensorFlow is pip-installable:
$ pip install tensorflow==2.8.0 $ pip install tensorflow-text==2.8.0
Having Problems Configuring Your Development Environment?

All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Project Structure
We first review our project directory structure.
Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.
From there, take a look at the directory structure:
├── download.sh ├── inference.py ├── output │ └── loss.png ├── pyimagesearch │ ├── config.py │ ├── dataset.py │ ├── __init__.py │ ├── loss.py │ ├── models.py │ ├── schedule.py │ └── translator.py ├── requirements.txt └── train.py
In the pyimagesearch directory, we have:
config.py: The configuration file for the task.dataset.py: The utilities for the dataset pipeline.loss.py: Holds the code snippet for the losses needed to train the model.models.py: Encoder and Decoder for the translation model.schedule.py: The learning rate scheduler for the training pipeline.translator.py: The train and inference models.
In the core directory, we have four scripts:
download.sh: A shell script to download the training data.requirements.txt: The python packages that are required for this tutorial.train.py: The script to train the model.inference.py: The inference script.
RNN Encoder-Decoder
In neural machine translation, we have an encoder and a decoder. The encoder builds a rich representation of the source sentence, while the decoder decodes the encoded representation and produces the target sentence.
As the source and the target sentences are sequential data, we can take help from our trusted sequential models. To get a primer on sequential modeling, follow our previous blog post.
While the information that the encoder builds is quite useful, it does not capture some very important aspects of the source sentence. With a static and fixed encoder representation we are under the constraint of using the same meaning of the source sentence for each decoded word.
In the next section, we will discuss the encoder, decoder, and attention module. The attention module will help gather the important information from the source sentence while decoding.
Learning to Align and Translate
Encoder
A recurrent neural network takes the present input and the previous hidden state
to model the present hidden state
. This forms a recurrence chain, which helps RNNs to model sequential data.
For the encoder, the authors have suggested a Bidirectional RNN. This way the RNN will provide two sets of hidden states, the forward and the backward
hidden states. The authors suggest that concatenating the two states gives a richer and better representation
.
Decoder
Without attention: The decoder uses a Recurrent Neural Network to decode the encoded representation into the target sentence.
The last encoded hidden state theoretically contains information of the entire source sentence. The decoder is fed with the previous decoder hidden state, the previous decoded word, and the last encoded hidden state.
The equation below is of the conditional probability, which needs to be maximized for the decoder to translate properly.
where is the hidden state for the decoder and
the context vector which is the last encoder hidden state
.
With attention: The only difference with and without attention is in the context vector . With attention in place, the decoder receives a newly built context vector for each step. Each context vector depends on the relevant information from which the source sentence is attended.
The equation for the conditional probability now changes to the following.
We have the entire set of hidden states (known as annotations) from the encoder . We need a mechanism to evaluate the importance of each annotation for a decoded word. We can either formulate this behavior by hard-coded equations or let another model figure this out. Turns out, being of the lazy kind, we delegate the entire workload to another model. This model is our attention module.
The attention layer takes a tuple as input. The tuple consists of the annotation and the previously decoded hidden state
. This provides a set of unnormalized importance maps. Intuitively this provides the answer to the question, “How important is
for
?”
With the knowledge that neural networks do great with well-defined distributions, we apply a softmax on the unnormalized importance to have a well-defined normal distribution of importance, as shown in Figure 2.

The unnormalized importance is called the energy matrix.
The normalized importance is the attention weights.
Now that we have the attention weights in hand, we need to pointwise multiply the weights with the annotations to obtain the important annotations while discarding the unimportant ones. Then we sum the weighted annotations to obtain a single feature-rich vector.
We build the context vector for each step of the decoder and attend to specific annotations, as shown in Figure 3.
Code Walkthrough
Configuring the Prerequisites
Before we start our implementation, let’s go over the configuration pipeline of our project. For that, we will move on to the config.py script located in the pyimagesearch directory.
# define the data file name DATA_FNAME = "fra.txt" # define the batch size BATCH_SIZE = 512 # define the vocab size for the source and the target # text vectorization layers SOURCE_VOCAB_SIZE = 15_000 TARGET_VOCAB_SIZE = 15_000 # define the encoder configurations ENCODER_EMBEDDING_DIM = 512 ENCODER_UNITS = 512 # define the attention configuration ATTENTION_UNITS = 512 # define the decoder configurations DECODER_EMBEDDING_DIM = 512 DECODER_UNITS = 1024 # define the training configurations EPOCHS = 100 LR_START = 1e-4 LR_MAX = 1e-3 WARMUP_PERCENT = 0.15 # define the patience for early stopping PATIENCE = 10 # define the output path OUTPUT_PATH = "output"
On Line 2, we have the data file name referencing the dataset we will use in our project. This dataset has input English sentences and their corresponding French-translated sentences.
The batch size of our data is defined on Line 5. This is followed by defining the source and target vocabulary sizes (Lines 9 and 10). You are free to adjust this to your liking for varying results.
The encoder model of our architecture will use its own embedding space for the source language. We have defined the number of dimensions for that embedding space on Line 13, followed by the encoder hidden state dimension on Line 14.
On Line 17, the number of attention units used is defined.
The decoder configurations are defined, similar to the encoder configurations on Lines 20 and 21.
The training hyperparameters, namely the number of epochs and specifications of the learning rate, are defined on Lines 24-27. Since we will be using the early stopping callback, the patience is set on Line 30.
The final step of the configuration pipeline is setting the output path for our results (Line 33).
Configuring the Dataset
As mentioned earlier, we need a dataset containing source language-target language sentence pairs. To configure and pre-process a dataset like that, we have prepared the dataset.py script situated in the pyimagesearch directory.
# import the necessary packages
import tensorflow_text as tf_text
import tensorflow as tf
import random
# define a module level autotune
_AUTO = tf.data.AUTOTUNE
def load_data(fname):
# open the file with utf-8 encoding
with open(fname, "r", encoding="utf-8") as textFile:
# the source and the target sentence is demarcated with tab
# iterate over each line and split the sentences to get
# the individual source and target sentence pairs
lines = textFile.readlines()
pairs = [line.split("\t")[:-1] for line in lines]
# randomly shuffle the pairs
random.shuffle(pairs)
# collect the source sentences and target sentences into
# respective lists
source = [src for src, _ in pairs]
target = [trgt for _, trgt in pairs]
# return the list of source and target sentences
return (source, target)
We start by defining the AUTOTUNE constant for efficient training on Line 7.
The first function in this script is load_data (Line 9), which takes in the filename as its argument. We use the open functionality to read our dataset file with utf-8 encoding (Line 11).
We loop over each line in the text file (Line 15) and create the pairs based on the premise that each source and target pair is separated by a tab (Line 16).
For less bias, we are randomly shuffling the pairs on Line 19. We store the split source and target sentences into corresponding variables and return them (Lines 23-27).
def splitting_dataset(source, target):
# calculate the training and validation size
trainSize = int(len(source) * 0.8)
valSize = int(len(source) * 0.1)
# split the inputs into train, val, and test
(trainSource, trainTarget) = (source[: trainSize],
target[: trainSize])
(valSource, valTarget) = (source[trainSize : trainSize + valSize],
target[trainSize : trainSize + valSize])
(testSource, testTarget) = (source[trainSize + valSize :],
target[trainSize + valSize :])
# return the splits
return (
(trainSource, trainTarget),
(valSource, valTarget),
(testSource, testTarget),
)
On Line 29, we have the splitting_dataset function, which takes in the source and target sentences. The next step is to calculate the training dataset and validation dataset sizes (Lines 31 and 32). Using these values, we split the source and target sentences into training, test, and validation sets (Lines 35-47).
def make_dataset(splits, batchSize, train=False):
# build a TensorFlow dataset from the input and target
(source, target) = splits
dataset = tf.data.Dataset.from_tensor_slices((source, target))
# check if this is the training dataset, if so, shuffle, batch,
# and prefetch it
if train:
dataset = (
dataset
.shuffle(dataset.cardinality().numpy())
.batch(batchSize)
.prefetch(_AUTO)
)
# otherwise, just batch the dataset
else:
dataset = (
dataset
.batch(batchSize)
.prefetch(_AUTO)
)
# return the dataset
return dataset
On Line 49, we have the make_dataset function, which takes in the following arguments:
splits: One of the three (train, test, and val) dataset splits.batchSize: The batch size we want our dataset to have.train: A bool variable to indicate if the dataset in question is the training dataset.
We separate the source and target sentences from the current split (Line 51) and build a TensorFlow dataset using tf.data.Dataset.from_tensor_slices on Line 52.
On Lines 56-62, we have an if statement that checks if the train bool is set to true. If the condition is met, we shuffle, batch, and prefetch the dataset. The else condition is for the test and validation sets, where we only batch and prefetch the dataset (Lines 65-73).
def tf_lower_and_split_punct(text):
# split accented characters
text = tf_text.normalize_utf8(text, "NFKD")
text = tf.strings.lower(text)
# keep space, a to z, and selected punctuations
text = tf.strings.regex_replace(text, "[^ a-z.?!,]", "")
# add spaces around punctuation
text = tf.strings.regex_replace(text, "[.?!,]", r" \0 ")
# strip whitespace and add [START] and [END] tokens
text = tf.strings.strip(text)
text = tf.strings.join(["[START]", text, "[END]"], separator=" ")
# return the processed text
return text
The final data utility function is tf_lower_and_split_punct, which takes in any single sentence as its argument (Line 75). We start by normalizing the sentences and turning them lowercase (Lines 77 and 78).
On Lines 81-84, we strip the sentence of unnecessary punctuations and characters. The whitespace before the sentence is removed on Line 87, followed by the addition of the start and end tokens in the sentence. These tokens help the model understand when to start or end a sequence.
Building the Bahdanau NMT Model
With our dataset utility script out of the way, the focus is now on the neural machine translation model itself. For that, we will hop into the models.py script in the pyimagesearch directory.
# import the necessary packages
from tensorflow.keras.layers import AdditiveAttention
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import GRU
from tensorflow.keras import Sequential
import tensorflow as tf
class Encoder(Layer):
def __init__(self, sourceVocabSize, embeddingDim, encUnits,
**kwargs):
super().__init__(**kwargs)
# initialize the source vocab size, embedding dimensions, and
# the encoder units
self.sourceVocabSize = sourceVocabSize
self.embeddingDim = embeddingDim
self.encUnits = encUnits
def build(self, inputShape):
# the embedding layer converts token IDs to embedding vectors
self.embedding = Embedding(
input_dim=self.sourceVocabSize,
output_dim=self.embeddingDim,
mask_zero=True,
)
# the GRU layer processes the embedding vectors sequentially
self.gru = Bidirectional(
GRU(
units=self.encUnits,
# return the sequence and the state
return_sequences=True,
return_state=True,
recurrent_initializer="glorot_uniform",
)
)
The encoder model, attention mechanism, and decoder model are packaged into separate classes for easier accessibility and usage. We start with the Encoder class in Line 11. The __init__ function (Line 12) takes in the following arguments:
sourceVocabSize: Defines the source vocabulary size.embeddingDim: Defines the embedding space size for the source vocabulary.encUnits: Defines the encoder hidden layer dimension.
The only purpose of this function is to create class variables of the arguments using the self functionality (Lines 17-19).
The build function takes in the input shape (tf custom layers require this parameter in the function call, but it is not mandatory to use it inside the function) as its argument (Line 21). This function first creates an embedding layer (Lines 23-27), followed by the GRU layer, which will sequentially process the embedding vectors (Lines 30-38). The hidden layer dimension is set in the __init__ function provided to the units argument of the GRU layer.
def get_config(self):
# return the configuration of the encoder layer
return {
"inputVocabSize": self.inputVocabSize,
"embeddingDim": self.embeddingDim,
"encUnits": self.encUnits,
}
def call(self, sourceTokens, state=None):
# pass the source tokens through the embedding layer to get
# source vectors
sourceVectors = self.embedding(sourceTokens)
# create the masks for the source tokens
sourceMask = self.embedding.compute_mask(sourceTokens)
# pass the source vectors through the GRU layer
(encOutput, encFwdState, encBckState) = self.gru(
inputs=sourceVectors,
initial_state=state,
mask=sourceMask
)
# return the encoder output, encoder state, and the
# source mask
return (encOutput, encFwdState, encBckState, sourceMask)
On Line 40, we have the get_config function, which simply returns the class variables like inputVocabSize, embeddingDim, and encUnits (Lines 42-46).
The call function on Line 48 takes in the following tokens:
sourceTokens: The sentence tokens of the input and output.state: The initial state for theGRUcell.
The call function simply acts as a hub for the usage of embedding and GRU layers. We first pass the tokens through the embedding layer to obtain the embedding vectors, and then we create masks for the source tokens (Lines 51-54). Masking helps make our GRU focus only on the tokens and ignore the padding.
The embedding vectors are then passed through the GRU layer, and a forward pass consisting of the encoder output, encoder forward state, and encoder backward state (bi-directional) is obtained (Lines 57-61).
class BahdanauAttention(Layer):
def __init__(self, attnUnits, **kwargs):
super().__init__(**kwargs)
# initialize the attention units
self.attnUnits = attnUnits
def build(self, inputShape):
# the dense layers projects the query and the value
self.denseEncoderAnnotation = Dense(
units=self.attnUnits,
use_bias=False,
)
self.denseDecoderAnnotation = Dense(
units=self.attnUnits,
use_bias=False,
)
# build the additive attention layer
self.attention = AdditiveAttention()
As mentioned before, the Bahdanau attention block is packaged into a separate class (Line 67). On Line 68, the __init__ function takes in the number of attention units as its arguments and creates a class variable for the same (Line 71).
In the build function (Line 73), the dense layers for the encoder and decoder annotations, as well as the additive attention layers, are initialized (Lines 75-85).
def get_config(self):
# return the configuration of the layer
return {
"attnUnits": self.attnUnits,
}
def call(self, hiddenStateEnc, hiddenStateDec, mask):
# grab the source and target mask
sourceMask = mask[0]
targetMask = mask[1]
# pass the query and value through the dense layer
encoderAnnotation = self.denseEncoderAnnotation(hiddenStateEnc)
decoderAnnotation = self.denseDecoderAnnotation(hiddenStateDec)
# apply attention to align the representations
(contextVector, attentionWeights) = self.attention(
inputs=[decoderAnnotation, hiddenStateEnc, encoderAnnotation],
mask=[targetMask, sourceMask],
return_attention_scores=True
)
# return the context vector and the attention weights
return (contextVector, attentionWeights)
The get_config function of this class returns the number of attention units (Lines 87-91).
On Line 93, we define the call function, which takes in the following arguments:
hiddenStateEnc: The encoder hidden statehiddenStateDec: The decoder hidden statemask: The source and target vector masks
First, we pass the respective hidden states through the dense layers previously created in the build function (Lines 99 and 100). Using these annotations, we simply pass them through the attention layer, also specifying the target and input masks (Lines 103-107).
The function returns the context vectors and attention weights obtained from the attention layer (Line 110).
class Decoder(Layer):
def __init__(self, targetVocabSize, embeddingDim, decUnits, **kwargs):
super().__init__(**kwargs)
# initialize the target vocab size, embedding dimension, and
# the decoder units
self.targetVocabSize = targetVocabSize
self.embeddingDim = embeddingDim
self.decUnits = decUnits
def get_config(self):
# return the configuration of the layer
return {
"targetVocabSize": self.targetVocabSize,
"embeddingDim": self.embeddingDim,
"decUnits": self.decUnits,
}
With the encoder and attention layer done, we move on to the decoder. As with the previous two, we have packaged the decoder into a separate class (Line 112).
Like the __init__ function for the encoder class, the decoder__init__ takes in:
targetVocabSize: The target language vocabulary size.embeddingDim: The embedding dimension size for the embedding space used for the target vocabulary.decUnits: The decoder hidden layer dimension size.
This function simply creates class variables of the arguments mentioned above (Lines 117-119).
On Line 121, we have the get_config function, which returns the class variables previously created in the __init__ function.
def build(self, inputShape):
# build the embedding layer which converts token IDs to
# embedding vectors
self.embedding = Embedding(
input_dim=self.targetVocabSize,
output_dim=self.embeddingDim,
mask_zero=True,
)
# build the GRU layer which processes the embedding vectors
# in a sequential manner
self.gru = GRU(
units=self.decUnits,
return_sequences=True,
return_state=True,
recurrent_initializer="glorot_uniform"
)
# build the attention layer
self.attention = BahdanauAttention(self.decUnits)
# build the final output layer
self.fwdNeuralNet = Sequential([
Dense(
units=self.decUnits,
activation="tanh",
use_bias=False,
),
Dense(
units=self.targetVocabSize,
),
])
As in the previous build functions we have encountered in this script, the inputShape argument needs to be included in the function call (Line 129). Similar to the encoder, we will first build the embedding space, followed by the GRU layer (Lines 132-145). The extra addition here is the BahdanauAttention layer as well as the final feedforward neural network for our outputs (Lines 148-160).
def call(self, inputs, state=None):
# grab the target tokens, encoder output, and source mask
targetTokens = inputs[0]
encOutput = inputs[1]
sourceMask = inputs[2]
# get the target vectors by passing the target tokens through
# the embedding layer and create the target masks
targetVectors = self.embedding(targetTokens)
targetMask = self.embedding.compute_mask(targetTokens)
# process one step with the GRU
(decOutput, decState) = self.gru(inputs=targetVectors,
initial_state=state, mask=targetMask)
# use the GRU output as the query for the attention over the
# encoder output
(contextVector, attentionWeights) = self.attention(
hiddenStateEnc=encOutput,
hiddenStateDec=decOutput,
mask=[sourceMask, targetMask],
)
# concatenate the context vector and output of GRU layer
contextAndGruOutput = tf.concat(
[contextVector, decOutput], axis=-1)
# generate final logit predictions
logits = self.fwdNeuralNet(contextAndGruOutput)
# return the predicted logits, attention weights, and the
# decoder state
return (logits, attentionWeights, decState)
Now it is time to sequentially use all the initialized layers and variables in the call function on Line 162. This function takes in the following arguments:
inputs: Contains the target tokens,encoderoutput, and the source token mask.state: To specify the initial state of the decoder layer.
First, we obtain the target token vectors by passing the target tokens through the embedding layer defined for the decoder (Line 170). The target token mask is computed exactly as we had done for the source tokens in the encoder (Line 171).
On Lines 174 and 175, we process one step by passing the vectors through the decoderGRU layer and obtaining the output and state variables.
The attention weights and context vectors are then computed using the attention layer on Lines 179-183.
The context vector and the GRUdecoder output is then concatenated, and the final logits predictions are computed using the feedforward neural network (Lines 186-190).
Building the Loss Function for Our Model
Our input sequences for the model use lots of masking. For that, we need to ensure that our loss function is also appropriate. Let’s move into the loss.py script inside the pyimagesearch directory.
# import the necessary packages
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.losses import Loss
import tensorflow as tf
class MaskedLoss(Loss):
def __init__(self):
# initialize the name of the loss and the loss function
self.name = "masked_loss"
self.loss = SparseCategoricalCrossentropy(from_logits=True,
reduction="none")
def __call__(self, yTrue, yPred):
# calculate the loss for each item in the batch
loss = self.loss(yTrue, yPred)
# mask off the losses on padding
mask = tf.cast(yTrue != 0, tf.float32)
loss *= mask
# return the total loss
return tf.reduce_sum(loss)
The loss is packaged as a class called MaskedLoss on Line 6. The __init__ function creates the class variables name and loss, which are set to masked_loss and sparse categorical cross-entropy, respectively (Lines 9 and 10). For the latter, we simply import the loss from tensorflow itself.
On Line 13, we have the __call__ function, which takes in the labels and the predictions. First, we calculate the loss using the sparse categorical cross-entropy loss on Line 15.
Taking into consideration the padding in our sequences, we mask off the losses for the padding by creating a simple conditional variable mask, which only considers the sequence tokens which are not 0 and multiplies our loss with it (Line 19). This way, we are nullifying the effect padding has on the loss.
Finally, the loss is returned in a reduce_sum format (Line 22).
Optimizing Our Training with a Scheduler
To ensure that the efficiency of our training is at its peak, we have created the schedule.py script.
# import the necessary packages
from tensorflow.keras.optimizers.schedules import LearningRateSchedule
import tensorflow as tf
import numpy as np
class WarmUpCosine(LearningRateSchedule):
def __init__(self, lrStart, lrMax, warmupSteps, totalSteps):
super().__init__()
self.lrStart = lrStart
self.lrMax = lrMax
self.warmupSteps = warmupSteps
self.totalSteps = totalSteps
self.pi = tf.constant(np.pi)
We will use a dynamic learning rate, for which we have created a class on Line 6.
The __init__ function takes in the following arguments:
lrStart: The starting value of our learning ratelrMax: The maximum learning rate valuewarmupSteps: The number of “warm-up” steps required for the dynamic LR calculationtotalSteps: The total number of steps
In this function, the class variables for these arguments are created, along with a pi variable (Lines 9-13).
def __call__(self, step):
# check whether the total number of steps is larger than the
# warmup steps. If not, then throw a value error
if self.totalSteps < self.warmupSteps:
raise ValueError(
f"Total number of steps {self.totalSteps} must be"
+ f"larger or equal to warmup steps {self.warmupSteps}."
)
# a graph that increases to 1 from the initial step to the
# warmup step, later decays to -1 at the final step mark
cosAnnealedLr = tf.cos(
self.pi
* (tf.cast(step, tf.float32) - self.warmupSteps)
/ tf.cast(self.totalSteps - self.warmupSteps, tf.float32)
)
# shift the learning rate and scale it
learningRate = 0.5 * self.lrMax * (1 + cosAnnealedLr)
Next, we have the __call__ function, which takes the step number as its argument (Line 15).
On Lines 18-22, we check if the number of steps is greater than the warmup steps, throwing an error if it’s the latter.
Next, we set up a graph that initially increases to 1 but later decays to -1, thus keeping our LR value dynamic (Line 26-33).
# check whether warmup steps is more than 0.
if self.warmupSteps > 0:
# throw a value error is max lr is smaller than start lr
if self.lrMax < self.lrStart:
raise ValueError(
f"lr_start {self.lrStart} must be smaller or"
+ f"equal to lr_max {self.lrMax}."
)
# calculate the slope of the warmup line and build the
# warmup rate
slope = (self.lrMax - self.lrStart) / self.warmupSteps
warmupRate = slope * tf.cast(step, tf.float32) + self.lrStart
# when the current step is lesser that warmup steps, get
# the line graph, when the current step is greater than
# the warmup steps, get the scaled cos graph.
learning_rate = tf.where(
step < self.warmupSteps, warmupRate, learningRate
)
# return the lr schedule
return tf.where(
step > self.totalSteps, 0.0, learningRate,
name="learning_rate",
)
On Line 36, we check if warmupSteps is greater than 0. If yes, we further check if the max learning rate is smaller than starting learning on Line 38. If this returns true, we raise a Value Error.
Next, on Lines 46 and 47, we calculate the slope of the warmup line and build the warmup rate.
On Lines 52-54, we calculate the learning rate by checking if the current step is lesser than the warmup steps. If so, we get the line graph. If the current step is greater, we get the scaled cosine graph.
Finally, on Lines 57-60, we return the learning rate schedule.
Train Translator
# import the necessary packages from tensorflow.keras.layers import StringLookup from tensorflow import keras import tensorflow as tf import numpy as np
We start by importing the necessary packages (Lines 2-5).
class TrainTranslator(keras.Model):
def __init__(self, encoder, decoder, sourceTextProcessor,
targetTextProcessor, **kwargs):
super().__init__(**kwargs)
# initialize the encoder, decoder, source text processor,
# and the target text processor
self.encoder = encoder
self.decoder = decoder
self.sourceTextProcessor = sourceTextProcessor
self.targetTextProcessor = targetTextProcessor
We create a class called TrainTranslator on Line 7, which contains all of the functionalities that will help train our encoder, attention module, and decoder with the model.fit() API.
Inside the __init__ function, we initialize the encoder, decoder, source text processor, and target text processor (Lines 13-16).
def _preprocess(self, sourceText, targetText):
# convert the text to token IDs
sourceTokens = self.sourceTextProcessor(sourceText)
targetTokens = self.targetTextProcessor(targetText)
# return the source and target token IDs
return (sourceTokens, targetTokens)
Next, we have the _preprocess function (Line 18), which takes as input the source and the target text. It converts the text to token IDs for both source and the target text on Lines 20 and 21 and then returns them on Line 24.
def _calculate_loss(self, sourceTokens, targetTokens):
# encode the input text token IDs
(encOutput, encFwdState, encBckState, sourceMask) = self.encoder(
sourceTokens=sourceTokens
)
# initialize the decoder's state to the encoder's final state
decState = tf.concat([encFwdState, encBckState], axis=-1)
(logits, attentionWeights, decState) = self.decoder(
inputs=[targetTokens[:, :-1], encOutput, sourceMask],
state=decState,
)
# calculate the batch loss
yTrue = targetTokens[:, 1:]
yPred = logits
batchLoss = self.loss(yTrue=yTrue, yPred=yPred)
# return the batch loss
return batchLoss
On Line 26, we define the _calculate_loss function, which takes in the source and target tokens. We pass the source tokens through our encoder on Line 28. The encoder outputs the following:
encOutput: The encoder outputencFwdState: The encoder forward hidden statesencBckState: The encoder backward hidden statessourceMask: The mask tokens of the source
On Line 33, we concatenate the forward and the backward hidden states as suggested by the authors in the paper. Next, on Lines 35-38, we pass the target tokens (offset by one), encoder output, and source mask as the decoder’s input, with the concatenated encoder hidden states as the decoder’s initial state. The decoder then outputs the logits, attention weights, and decoder states.
On Lines 41-43, we use the target tokens and the retrieved logits to calculate the batch loss, which is then returned on Line 46.
@tf.function(
input_signature=[[
tf.TensorSpec(dtype=tf.string, shape=[None]),
tf.TensorSpec(dtype=tf.string, shape=[None])
]])
def train_step(self, inputs):
# grab the source and the target text from the inputs
(sourceText, targetText) = inputs
# pre-process the text into token IDs
(sourceTokens, targetTokens) = self._preprocess(
sourceText=sourceText,
targetText=targetText
)
# use gradient tape to track the gradients
with tf.GradientTape() as tape:
# calculate the batch loss
loss = self._calculate_loss(
sourceTokens=sourceTokens,
targetTokens=targetTokens,
)
# normalize the loss
averageLoss = (
loss / tf.reduce_sum(
tf.cast((targetTokens != 0), tf.float32)
)
)
# apply an optimization step on all the trainable variables
variables = self.trainable_variables
gradients = tape.gradient(averageLoss, variables)
self.optimizer.apply_gradients(zip(gradients, variables))
# return the batch loss
return {"batch_loss": averageLoss}
We define the train step (Line 53), specifying the input signature of the function on Lines 49-52. The input signature will be required later when we use the tf.module to serve this model for inference.
On Line 55, we grab the source and the target text from the inputs. And next, on Lines 58-61, we pre-process the text into token IDs. Lines 64-69 take care of the gradient tape to track the gradients while calculating the loss.
Finally, on Lines 72-76, we normalize the loss. We then apply an optimization step on all trainable variables (Lines 79-81) and return the normalized loss on Line 84.
@tf.function(
input_signature=[[
tf.TensorSpec(dtype=tf.string, shape=[None]),
tf.TensorSpec(dtype=tf.string, shape=[None])
]])
def test_step(self, inputs):
# grab the source and the target text from the inputs
(sourceText, targetText) = inputs
# pre-process the text into token IDs
(sourceTokens, targetTokens) = self._preprocess(
sourceText=sourceText,
targetText=targetText
)
# calculate the batch loss
loss = self._calculate_loss(
sourceTokens=sourceTokens,
targetTokens=targetTokens,
)
# normalize the loss
averageLoss = (
loss / tf.reduce_sum(
tf.cast((targetTokens != 0), tf.float32)
)
)
# return the batch loss
return {"batch_loss": averageLoss}
On Line 91, we create the test step function, which takes in the inputs. We also specify the input signature of the test step (as done in the train step) on Lines 87-90.
Next, on Lines 96-99, we pre-process the text tokens into token IDs. We then calculate the batch loss on Lines 102-105 and normalize the loss on Lines 108-112. Finally, we return the normalized loss on Line 115.
Translator
class Translator(tf.Module):
def __init__(self, encoder, decoder, sourceTextProcessor,
targetTextProcessor):
# initialize the encoder, decoder, source text processor, and
# target text processor
self.encoder = encoder
self.decoder = decoder
self.sourceTextProcessor = sourceTextProcessor
self.targetTextProcessor = targetTextProcessor
# initialize index to string layer
self.stringFromIndex = StringLookup(
vocabulary=targetTextProcessor.get_vocabulary(),
mask_token="",
invert=True
)
# initialize string to index layer
indexFromString = StringLookup(
vocabulary=targetTextProcessor.get_vocabulary(),
mask_token="",
)
# generate IDs for mask tokens
tokenMaskIds = indexFromString(["", "[UNK]", "[START]"]).numpy()
tokenMask = np.zeros(
[indexFromString.vocabulary_size()],
dtype=np.bool
)
tokenMask[np.array(tokenMaskIds)] = True
# initialize the token mask, start token, and end token
self.tokenMask = tokenMask
self.startToken = indexFromString(tf.constant("[START]"))
self.endToken = indexFromString(tf.constant("[END]"))
We create a class called Translator that houses all the utility functions needed for inference with our trained encoder and decoder.
Inside the __init__ function (Line 118):
- We initialize the encoder, decoder, source text processor, and target text processor (Lines 122-125)
- On Lines 128-132, we initialize the
stringFromIndexlayer for String lookup. - We initialize the
indexFromStringlayer (Lines 135-138). This layer converts the token indices back to string format. - Generate the IDs for mask tokens using the
indexFromStringlayer on Lines 141-146. - Initialize the token mask, start token, and end token (Lines 149-151).
def tokens_to_text(self, resultTokens):
# decode the token from index to string
resultTextTokens = self.stringFromIndex(resultTokens)
# format the result text into a human readable format
resultText = tf.strings.reduce_join(inputs=resultTextTokens,
axis=1, separator=" ")
resultText = tf.strings.strip(resultText)
# return the result text
return resultText
On Lines 153-163, we create the tokens_to_text function, which decodes the token from index back into string format using the stringFromIndex layer.
def sample(self, logits, temperature):
# reshape the token mask
tokenMask = self.tokenMask[tf.newaxis, tf.newaxis, :]
# set the logits for all masked tokens to -inf, so they are
# never chosen
logits = tf.where(
condition=self.tokenMask,
x=-np.inf,
y=logits
)
# check if the temperature is set to 0
if temperature == 0.0:
# select the index for the maximum probability element
newTokens = tf.argmax(logits, axis=-1)
# otherwise, we have set the temperature
else:
# sample the index for the element using categorical
# probability distribution
logits = tf.squeeze(logits, axis=1)
newTokens = tf.random.categorical(logits / temperature,
num_samples=1
)
# return the new tokens
return newTokens
Next, on Lines 165-192, we create the sample function. On Line 167, we reshape the tokenMask, and then on Lines 171-175, we set the logits of these masked tokens to -inf. This is done so that the masked tokens are skipped.
On Line 178, we check if the temperature parameter is set to zero. If yes, then we select the index of the maximum probability element on Line 180. If not, we sample the element’s index using categorical probability distribution on Lines 186-189.
Finally, we return the newTokens on Line 192.
@tf.function(input_signature=[tf.TensorSpec(dtype=tf.string,
shape=[None])])
def translate(self, sourceText, maxLength=50, returnAttention=True,
temperature=1.0):
# grab the batch size
batchSize = tf.shape(sourceText)[0]
# encode the source text to source tokens and pass them
# through the encoder
sourceTokens = self.sourceTextProcessor(sourceText)
(encOutput, encFwdState, encBckState, sourceMask) = self.encoder(
sourceTokens=sourceTokens
)
On Line 196, we create the translate function which takes in sourceText, maxLength, returnAttention, and temperature. We also specify the input signature of the function on Lines 194 and 195.
First, we extract the batchSize from the sourceText vector on Line 199. Next, on Lines 203-206, we encode the source text into tokens and pass them through the encoder.
# initialize the decoder state and the new tokens
decState = tf.concat([encFwdState, encBckState], axis=-1)
newTokens = tf.fill([batchSize, 1], self.startToken)
# initialize the result token, attention, and done tensor
# arrays
resultTokens = tf.TensorArray(tf.int64, size=1,
dynamic_size=True)
attention = tf.TensorArray(tf.float32, size=1,
dynamic_size=True)
done = tf.zeros([batchSize, 1], dtype=tf.bool)
# loop over the maximum sentence length
for i in tf.range(maxLength):
# pass the encoded tokens through the decoder
(logits, attentionWeights, decState) = self.decoder(
inputs=[newTokens, encOutput, sourceMask],
state=decState,
)
# store the attention weights and sample the new tokens
attention = attention.write(i, attentionWeights)
newTokens = self.sample(logits, temperature)
# if the new token is the end token then set the done
# flag
done = done | (newTokens == self.endToken)
# replace the end token with the padding
newTokens = tf.where(done, tf.constant(0, dtype=tf.int64),
newTokens)
# store the new tokens in the result
resultTokens = resultTokens.write(i, newTokens)
# end the loop once done
if tf.reduce_all(done):
break
# convert the list of generated token IDs to a list of strings
resultTokens = resultTokens.stack()
resultTokens = tf.squeeze(resultTokens, -1)
resultTokens = tf.transpose(resultTokens, [1, 0])
resultText = self.tokens_to_text(resultTokens)
# check if we have to return the attention weights
if returnAttention:
# format the attention weights
attentionStack = attention.stack()
attentionStack = tf.squeeze(attentionStack, 2)
attentionStack = tf.transpose(attentionStack, [1, 0, 2])
# return the text result and attention weights
return {"text": resultText, "attention": attentionStack}
# otherwise, we will just be returning the result text
else:
return {"text": resultText}
We initialize the decoder state and the newTokens (Lines 209 and 210). We also initialize the resultTokens, attention vector, and the done vector on Lines 214, 216, and 218, respectively.
From Line 221, we loop over the maximum sentence length. We first pass the encoded tokens through the decoder on Lines 223-226. Next, we store the attention weights in the attention vector and sample newTokens on Lines 229 and 230.
If the new token is the end token, then we set the done flag to zero on Line 234. We store the newTokens on the resultTokens vector on Line 241. Once the entire iteration is done, we break out of the loop on Lines 244 and 245.
On Lines 248-251, we convert the list of all generated token IDs back to strings using the tokens_to_text layer.
On Line 254, we check if we have to return the attention weights and if yes, we return them with the resultText on Line 261.
If not, we will return the resultText only on Line 265.
Training
With all the blocks created, we finally start with train.py.
# USAGE
# python train.py
# set random seed for reproducibility
import tensorflow as tf
tf.keras.utils.set_random_seed(42)
# import the necessary packages
from pyimagesearch import config
from pyimagesearch.schedule import WarmUpCosine
from pyimagesearch.dataset import load_data
from pyimagesearch.dataset import splitting_dataset
from pyimagesearch.dataset import make_dataset
from pyimagesearch.dataset import tf_lower_and_split_punct
from pyimagesearch.models import Encoder
from pyimagesearch.models import Decoder
from pyimagesearch.translator import TrainTranslator
from pyimagesearch.translator import Translator
from pyimagesearch.loss import MaskedLoss
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow_text as tf_text
import matplotlib.pyplot as plt
import numpy as np
import os
# load data from disk
print(f"[INFO] loading data from {config.DATA_FNAME}...")
(source, target) = load_data(fname=config.DATA_FNAME)
# split the data into training, validation, and test set
(train, val, test) = splitting_dataset(source=source, target=target)
# build the TensorFlow data datasets of the respective data splits
print("[INFO] building TensorFlow Data input pipeline...")
trainDs = make_dataset(splits=train, batchSize=config.BATCH_SIZE,
train=True)
valDs = make_dataset(splits=val, batchSize=config.BATCH_SIZE,
train=False)
testDs = make_dataset(splits=test, batchSize=config.BATCH_SIZE,
train=False)
We import all the necessary packages on Lines 5-26. Next, on Lines 29 and 30, we load the data from the disk and split the data into training, validation, and test datasets on Line 33.
We create the TensorFlow datasets of each of these on Lines 37, 39, and 41, respectively.
# create source text processing layer and adapt on the training
# source sentences
print("[INFO] performing text vectorization...")
sourceTextProcessor = TextVectorization(
standardize=tf_lower_and_split_punct,
max_tokens=config.SOURCE_VOCAB_SIZE
)
sourceTextProcessor.adapt(train[0])
# create target text processing layer and adapt on the training
# target sentences
targetTextProcessor = TextVectorization(
standardize=tf_lower_and_split_punct,
max_tokens=config.TARGET_VOCAB_SIZE
)
targetTextProcessor.adapt(train[1])
Next, we create the source text processing layer and adapt it on the training source sentences on Lines 47-51. On Lines 55-59, we do the same for target sentences.
# build the encoder and the decoder
print("[INFO] building the encoder and decoder models...")
encoder = Encoder(
sourceVocabSize=config.SOURCE_VOCAB_SIZE,
embeddingDim=config.ENCODER_EMBEDDING_DIM,
encUnits=config.ENCODER_UNITS
)
decoder = Decoder(
targetVocabSize=config.TARGET_VOCAB_SIZE,
embeddingDim=config.DECODER_EMBEDDING_DIM,
decUnits=config.DECODER_UNITS,
)
# build the trainer module
print("[INFO] build the translator trainer model...")
translatorTrainer = TrainTranslator(
encoder=encoder,
decoder=decoder,
sourceTextProcessor=sourceTextProcessor,
targetTextProcessor=targetTextProcessor,
)
On Lines 63-67, we build the encoder model using the vocab size, embedding dimension, and encoder units. Similarly, we build the decoder model on Lines 68-72 with the corresponding configurations.
Finally, we initialize the trainer module called TrainTranslator (Line 76). We pass the encoder, decoder, and the source and target text processors to build it on Lines 77-81.
# get the total number of steps for training.
totalSteps = int(trainDs.cardinality() * config.EPOCHS)
# calculate the number of steps for warmup.
warmupEpochPercentage = config.WARMUP_PERCENT
warmupSteps = int(totalSteps * warmupEpochPercentage)
# Initialize the warmupcosine schedule.
scheduledLrs = WarmUpCosine(
lrStart=config.LR_START,
lrMax=config.LR_MAX,
warmupSteps=warmupSteps,
totalSteps=totalSteps,
)
# configure the loss and optimizer
print("[INFO] compile the translator trainer model...")
translatorTrainer.compile(
optimizer=Adam(learning_rate=scheduledLrs),
loss=MaskedLoss(),
)
# build the early stopping callback
earlyStoppingCallback = EarlyStopping(
monitor="val_batch_loss",
patience=config.PATIENCE,
restore_best_weights=True,
)
Next, we define some important parameters and configurations for training. On Line 84, we get the total number of steps for training. On Lines 87 and 88, we calculate the warmup percentage and the warmup steps. On Lines 91-96, we define the warmup cosine schedule.
We compile the translatorTrainer model on Lines 100-103 using the Adam as an optimizer and MaskedLoss as the loss metric.
Next, we build the early stopping callback with our configured patience on Lines 106-110.
# train the model
print("[INFO] training the translator model...")
history = translatorTrainer.fit(
trainDs,
validation_data=valDs,
epochs=config.EPOCHS,
callbacks=[earlyStoppingCallback],
)
# save the loss plot
if not os.path.exists(config.OUTPUT_PATH):
os.makedirs(config.OUTPUT_PATH)
plt.plot(history.history["batch_loss"], label="batch_loss")
plt.plot(history.history["val_batch_loss"], label="val_batch_loss")
plt.xlabel("EPOCHS")
plt.ylabel("LOSS")
plt.title("Loss Plots")
plt.legend()
plt.savefig(f"{config.OUTPUT_PATH}/loss.png")
# build the translator module
print("[INFO] build the inference translator model...")
translator = Translator(
encoder=translatorTrainer.encoder,
decoder=translatorTrainer.decoder,
sourceTextProcessor=sourceTextProcessor,
targetTextProcessor=targetTextProcessor,
)
# save the model
print("[INFO] serialize the inference translator to disk...")
tf.saved_model.save(
obj=translator,
export_dir="translator",
signatures={"serving_default": translator.translate}
)
On Lines 113-119, we call the fit API with data, epochs, and callbacks. We examine the loss plot with respect to epochs and save the plot to disk on Lines 122-130.
On Lines 134-139, we build the Translator model for inference and then serialize and save the model to disk on Lines 143-147. We will be using this serialized model for inference.
Inference
With training complete, we can finally move to our inference script to see how our model is performing as a translator.
# USAGE
# python inference.py -s "input sentence"
# import the necessary packages
import tensorflow_text as tf_text
import tensorflow as tf
import argparse
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-s", "--sentence", required=True,
help="input english sentence")
args = vars(ap.parse_args())
# convert the input english sentence to a constant tensor
sourceText = tf.constant([args["sentence"]])
# load the translator model from disk
print("[INFO] loading the translator model from disk...")
translator = tf.saved_model.load("translator")
# perform inference and display the result
print("[INFO] translating english sentence to french...")
result = translator.translate(sourceText)
translatedText = result["text"][0].numpy().decode()
print("[INFO] english sentence: {}".format(args["sentence"]))
print("[INFO] french translation: {}".format(translatedText))
On Lines 5-7 of the inference.py script, we import the necessary packages. On Lines 10-13, we construct the argument parser needed to parse the source sentence.
We first convert the source English sentence to a constant tensor on Line 16. Next, we load the serialized translator from the disk on Line 20.
We use the loaded translator to perform Neural Machine Translation on the source Text on Line 24 and then display the results on Lines 26 and 27.
You can experiment with the Hugging Face 🤗 space to better understand the inference task:
Summary
This was our first tutorial where we discussed attention and used it for Neural Machine Translation. In upcoming tutorials, we will learn how to better this architecture using Luong’s attention and then how to design an architecture for Machine Translation using just Attention.
The ideas and methodologies behind this tutorial are very simple. However, they are pillars upon which the most advanced Deep Learning Architectures, like Transformers, are built.
So, as Professor Turing inquires all of us in Figure 4, “Are you paying attention?”

Citation Information
A. R. Gosthipaty and R. Raha. “Neural Machine Translation with Bahdanau’s Attention Using TensorFlow and Keras,” PyImageSearch, P. Chugh, S. Huot, K. Kidriavsteva, and A. Thanki, eds., 2022, https://pyimg.co/kf8ma
@incollection{ARG-RR_2022_Bahdanau,
author = {Aritra Roy Gosthipaty and Ritwik Raha},
title = {Neural Machine Translation with {Bahdanau's} Attention Using TensorFlow and Keras},
booktitle = {PyImageSearch},
editor = {Puneet Chugh and Susan Huot and Kseniia Kidriavsteva and Abhishek Thanki},
year = {2022},
note = {https://pyimg.co/kf8ma},
}
Unleash the potential of computer vision with Roboflow - Free!
- Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
- Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
- Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
- Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
- Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!


Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.