DEV Community

Cover image for The Magic of Attention: How Transformers Improved Generative AI
Olorundara Akojede (dvrvsimi)
Olorundara Akojede (dvrvsimi)

Posted on • Updated on

The Magic of Attention: How Transformers Improved Generative AI

Table of Contents.

  1. Preamble
  2. Encoder-Decoder Architecture

  3. Limitations of the Traditional Encoder-Decoder.

  4. "Attention is all you need"

  5. Conclusion


Generative AI is the new buzzword in the world of AI, big enterprises are looking to incorporate generative features into their solutions and AI engineers are working now more than ever to train models that are taking strides that were once inconceivable to the human mind when it comes to generating content.

Watch this video of Sundar Pichai (Chief Executive Officer at Google), it compiles all the times he said "AI" and "generative AI" during his keynote speech at Google IO, 2023:

Generative AI refers to a branch of artificial intelligence that focuses on generating new content based on patterns and examples from existing data, these contents may be in the form of a captivating story in text, an image of a landscape scenery from the Paleolithic era, or even an audio of what Mozart would sound like in a different genre like jazz.

Generative AI involves training a model using large datasets and algorithms, enabling it to produce near original contents that expand on the patterns it has learned. In this article, I will talk about the technologies on which generative AI models are built and how transformers have improved generative AI over the years so stay glued!


Readers should have a good understanding of:

  • Machine learning and

  • Artificial intelligence.

Encoder-Decoder Architecture.

To properly communicate with AI models, it is important to make them understand the information that is being conveyed and regular human languages would not suffice. This is why the encoder-decoder architecture was developed, it is a neural network sequence-to-sequence architecture that was specifically designed for machine translation, text summarization, question-answering, and other machine learning use cases.

Just as its nomenclature suggests, it has two networks- the encoder and the decoder network, these networks serve as the final gateways for input-output (I/O) operations in the model.

At the encoder part, an input sequence in natural language is converted into its corresponding vector representation. This vector representation attempts to capture all the relevant bits from the input sequence(or prompt).

This vector representation is then fed into the decoder network which generates an output after a series of internal processes.

a visual representation of the Encoder-Decoder architecture

The Encoder Network.

For an encoder with a Recurrent Neural Network (RNN) internal architecture, each token in an input sequence like The man is going to the bank must first be tokenized, this tokenization process converts the natural language into understandable sets of bits that the model can process. It recurs until the input sequence at the encoder has been completely tokenized.

In most NLP character tokenization adoptions, each token is usually a representation of 4 characters so the example above would be a minimum of 6 tokens.

These tokens are then passed into an embedding layer, this is where they are converted into a single vector representation. The encoder passes the vector representation onto the decoder through a Feedforward neural network.

The Decoder Network.

The encoder and decoder can be built on different architectures and more complex blocks but cases of the same architecture are not unlikely.

The decoder would have its own set of input sequences which would also have been tokenized and embedded. Introducing this sequence of tokens to the decoder would trigger it to attempt a prediction of the next token based on the contextual understanding provided by the encoder, the first prediction is outputted through a softmax output layer.

After the first token is generated, the decoder repeats this prediction process until there are no more tokens left to predict, the first and last predicted tokens are called the <start> and <end> tokens respectively.

The final sequence of tokens is detokenized back into natural language. In a language translation use case, the output generated would be: Der Mann geht zur Bank for a German target language.

encoder-decoder architecture showing the important blocks

Training on Encoder-Decoder Architecture.

Training on an encoder-decoder architecture is more complicated than regular predictive models, having a collection of input/output pairs of the type data from a reference model for imitation is important.

Likewise, the decoder needs to be trained on the correct previously translated token rather than what it is triggered to generate, this technique is called teacher forcing and is a good practice only when you have a credible ground truth.

The decoder network generates the next token based on which token has the highest probability in the softmax layer, there are 2 common algorithms for “choosing” the next token in NLP, they are:

  • Greedy search: this algorithm chooses the token with the highest conditional probability from the vocabulary as the next generated token. Take a look at the image below, can you tell what sentence the decoder generated? Note that the red saturation decreases as the probability decreases.

an example of greedy search

If your answer was the last global war is abbreviated as WWII, you are correct. Greedy search is easy to implement but it does not always generate optimal results, a better approach is the beam search algorithm.

  • Beam search: Instead of deciding off the probability of a single token, the algorithm searches for the sequence or series of tokens with the highest probability so the example above would be chosen among a pool of the war is last abbreviated global WWII as, last war abbreviated is the WWII as global, etc. This approach is more efficient because it reduces computation time and provides some extra level of context.

The encoder-decoder architecture is great because the input sequence and the generated output can be of varying lengths, this is very useful in image/video captioning as well as question-answering use cases. However, this architecture has a bottleneck that has made it obsolete over the years.

Limitations of the Traditional Encoder-Decoder.

When the encoder converts an input sequence into a vector, it compresses all the contextual information into that single vector, this poses a problem when the input sequence is too long. It may prove difficult for both the encoder and decoder because the encoder would struggle with keeping the relevant bits, and the decoder would expend more time on decoding and may lose some relevant bits of information in the process regardless of whether the generated output is short or not. This may lead to inaccuracy of the generated output.

malfunctioning bot

How was this problem tackled without jeopardizing the context of the sequence?

money heist meme

"Attention is all you need"

This is the title of a paper published in 2017 by Vaswani et al, this groundbreaking paper introduced the Transformer model, a novel architecture that revolutionized the field of NLP and became the foundation of the popular Large Language Models (LLMs) that are around today (GPT, PaLM, Bard, etc.) The paper proposes a neural network architecture with an entirely attention-based mechanism instead of the traditional RNNs, click here if you wish to read the paper.

A transformer can be summarized as an encoder-decoder model with an attention mechanism. The image below is from the paper, note how the attention layers are grafted in both encoder and decoder.

Flow diagram of transformer model from the paper "Attention is all you need"

Overview of Attention Mechanisms.

Attention mechanism is built to focus on the most important parts of the input sequence and not its entirety. Rather than building a single context vector out of the last hidden state of the encoder, attention mechanism creates shortcuts between the entire input sequence and the context vector.

The weights of these context vectors vary for each output element. Hence, the context vector learns the alignment of the input sequence with the target output by noting the emphasized tokens.

"But how does the model know where to channel its attention?"
It calculates a score known as alignment score which quantifies how much attention should be given to each input. Look at the heatmap below from the Neural Machine Translation by Jointly Learning to Align and Translate paper showing how attention works in a translation use case.


With respect to the previous input sequence: The man is going to the bank, the translation should not pose any problem for regular encoder-decoder models but what if the sequence is longer and has more context?

Take a new input sequence like The man is going to the bank to fish. For regular encoder-decoder models, the generated output in the target language may not align with the contextual meaning of the source language because "bank" now exists with more than one possible translation.

While spotting this distinction is an easy feat for humans, it may be hard for the traditional encoder-decoder, hence, it may produce an output like Der Mann geht zum Bank, um zu fischen instead of Der Mann geht zum Flussufer, um zu fischen, the later is more accurate and would make more sense to a German because *Flussufer means riverbank.
*another translation could be "Ufer" which means "shore".

In the above instance, "bank" and "fish" would have the heaviest weight in an attention mechanism encoder-decoder.

In application, attention layers need to be integrated with the regular encoder-decoder architecture and these layers exist in various types which include:

  • Generalized attention layer
  • Self-attention layer
  • Multi-head attention layer

To know more about layer types, check this article


The advent of attention mechanisms has revolutionized generative AI, enabling machines to better understand us and generate complex sequences with remarkable precision such that humans are sometimes bewildered by it. Applications across machine translation, question answering, text summarization, and more have benefitted from attention's ability to capture contextual relationships.

As we look to the future, combining attention mechanisms with other architectural innovations holds immense potential for handling even more challenging tasks. Generative AI is just at its best milestone yet and it would continue to get better with more attention-driven applications as machines keep surpassing previous landmarks like never before. It is the responsibility of humans to shape this trajectory for the betterment of life.

If you enjoyed this article, I would appreciate it if you leave a reaction or a comment. Know someone else that would find this article insightful? shares are very much welcome too! I am on Twitter @dvrvsimi and Medium @daraakojede01


Top comments (0)