DEV Community

Cover image for Demystifying Transformers Architecture in a Simpler Way
Md. Mahamudur Rahman
Md. Mahamudur Rahman

Posted on

Demystifying Transformers Architecture in a Simpler Way

Introduction

In the world of Natural Language Processing (NLP), the "Attention Is All You Need" research paper introduced an influential architecture known as Transformers. Published by researchers from Google Brain, Google Research and University of Toronto in 2017, this paper presented a groundbreaking method for teaching computers to understand and generate human language. Read transformers paper here. In this blog post, we will break down the key steps involved in the Transformers architecture, making them accessible and engaging for readers of all ages.

Transformers Architecture

Let's Dive In

Step 1: Word Representation
We begin with a sentence or a sequence of words, such as "I love playing soccer." In order to process these words, we represent each one as a number or a vector. Think of it as giving each word a unique code that the computer can understand.

Step 2: Word Table
Next, we arrange these word representations in a table-like structure. Each word gets its own row in the table, and each column represents a different aspect of the word, such as its meaning or position. This organized setup helps the computer keep track of important details about each word.

Step 3: Query, Key, and Value Matrices
Now we introduce three special matrices: Query (Q), Key (K), and Value (V). These matrices are created by multiplying the word table with certain numbers called weight matrices. It's like mixing the word representations together to form these special matrices.

Step 4: Attention Scores
To understand the relationships between words, we calculate "attention scores" for each word. Imagine each word trying to pay attention to other words based on their relevance. We achieve this by multiplying the Query matrix with the transpose (flipped version) of the Key matrix. It's like measuring how much attention one word should give to another.

Step 5: Measures of Similarity
These attention scores act as measures of similarity between words. We want words that are related to have higher attention scores. This helps the computer identify which words are important for understanding the meaning of a sentence.

Step 6: Making Attention Scores User-Friendly
To make the attention scores easier to work with, we use a mathematical function called Softmax. This function ensures that all the attention scores add up to 1 and emphasizes the more important words. It's like adjusting the focus of the spotlight to highlight the most relevant words.

Step 7: Combining Attention Scores and Value Matrix
Here comes the exciting part! We use the attention scores to combine them with the Value matrix. By multiplying the attention scores with the Value matrix, we get what we call the "attention output." It's like taking the important information from each word and putting them together to form a complete understanding.

Step 8: Attention Outputs for Every Word
We repeat this process for every word in the sentence. Each word generates its own attention output, representing the focused and relevant information specific to that word. It's like each word gets its own unique spotlight moment.

Step 9: Utilizing Attention Outputs
Finally, we can use these attention outputs for various purposes, such as translation or summarization. We can also perform additional calculations using the attention outputs to obtain a final result, customized to the task at hand.

Conclusion

In simpler terms, the Transformers architecture introduced in the "Attention Is All You Need" paper involves representing words as numbers, arranging them in a table-like structure, calculating attention scores to understand relationships, and combining attention outputs for a holistic understanding of language. This approach has revolutionized how computers process and generate human language, opening up exciting possibilities in the field of NLP. With Transformers, computers can now grasp the intricacies of language and communicate with us in a more human-like manner. It gets more exciting when we employ the transformers architecture for Computer Vision and many other modalities and tasks. As an example you can read this blog to see how it works.

Top comments (0)