juvet manga

Posted on Nov 3

What are attention masks in the context of transformers (GPT, BERT, T5)

#beginners #ai #deeplearning #learning

Imagine your brain as a supercomputer, constantly flooded with data—sights, sounds, thoughts. Every second, new information bombards us, yet somehow, we avoid being overwhelmed. Our brains don’t process every detail; they focus on what matters and filter the rest. Deep learning models, especially transformer-based ones like GPT and BERT, try to mimic this focus. But in a digital world, how do they know what’s important and what to ignore?

Why Do Transformers Need an Attention Mask?

Imagine sitting in a bar, talking with a friend while other people chat loudly around you. You won’t listen to every conversation. You focus only on the words said by your friend despite the fact that your ears capture the surrounding chats. Your brain creates an ‘attention mask,’ filtering out everything irrelevant.

Transformers have a similar challenge and this is where attention masks step in. An attention mask is a tool that tells the model which parts of the input are relevant (ones to “pay attention to”) and which parts should be ignored (masked out). It’s like a set of invisible markers that highlight where to focus and what to skip.

💁🏾 Can you show a clear example that highlights how attention works in a sentence ?

🧏🏾‍♂️ Sure, check this out

How Attention Masks Work in Practice

Consider the following sentence

"The cat sat q1230jiqowe on 3rk30k1 the mat 1231"

here is how our model will represent the data

Sentence: [The, cat, sat, q1230jiqowe, on, 3rk30k1, the, mat, 1231]
Attention Mask: [1, 1, 1, 0, 1, 0, 1, 1, 0]

In this mask:

1s indicate the model should focus on these tokens because they form the actual sentence.
0s indicate noise or irrelevant tokens that the model should ignore.

I guess you automatically removed the irrelevant characters when reading and focused on the valuable information. That’s it.

In transformers, data is fed into the model in the form of a sequence (like a sentence split into tokens). Not all tokens in the sequence are useful; some may be padding tokens added to create equal-length inputs for batch processing. Attention masks act as a filter to block out these irrelevant tokens.

💁🏾 Pad…

🧏🏾‍♂️ Yes I know, let me explain

Well, models have a particular way of processing data. The model needs all the inputs to have the same length before they can be treated, and this length is set when the model is being trained, it’s called the token length.

In practice, if a model has a token length of 10 for example, it looks like this

Sentence 1: The cat sat on the mat

→ [The, cat, sat, on, the, mat, [PAD], [PAD], [PAD], [PAD]]

Sentence 2: The dog ate the fish and ran to the room before I could realize

→ [The, dog, ate, the, fish, and, ran, to, the, room, before]

This helps the model expect a certain amount of information at a given time, so as to avoid excessive data intake, a bit like a speed limiter. This might be the idea of another article but for now let’s stick to attention masks.

Why Attention Masks Matter

Without attention masks, transformers would process all parts of an input indiscriminately. In practice, this would make the model prone to errors, focusing on irrelevant data and potentially "hallucinating" patterns that don’t exist.

By focusing on key information and ignoring unnecessary parts, attention masks keep models efficient, accurate, and grounded in relevant data.

🤷🏾 I think I got the point now, but how does this help anyone in real-world scenarios ?

💁🏾‍♂️ Well there are many applications but take this one…

Practical Impact of Attention Masks

Think of a voice-controlled virtual assistant that responds to commands like, "Play my favorite song." Often, the audio data is noisy, with background sounds, pauses, or even other conversations nearby. Without an attention mask, the assistant might focus on everything in the audio stream, including background noises and other voices. This could lead to misinterpreting the command or even responding to unrelated words.

For example, if someone says:

"Uh, Alexa, can you uhmm play my favorite song? (kids talking in the background)"

Without an attention mask, the assistant might process every single word, including "uh," “uhmm,” "kids talking in the background," and other irrelevant sounds. This can make it slower to respond or even trigger the wrong action.

With an attention mask, the assistant zeros in on the actual command ("can you play my favorite song?") and filters out the rest. This helps it respond quickly, accurately, and without being thrown off by background noise, providing a much smoother user experience.

iPhone users should be able to relate from the way Siri acts most of the time

Brief Note on Other Mask Types

In addition to attention masks, there are several other types of masks that play important roles in transformer models:

Padding Masks: These masks indicate which tokens in a sequence are padding tokens (usually represented as 0 or a special token). Padding is used to ensure all input sequences in a batch are of equal length. Padding masks help the model ignore these irrelevant tokens during processing, much like attention masks.
Segment Masks: In tasks like question-answering or sentence-pair classification, segment masks distinguish between different segments of input. For instance, in a question-answer pair, one segment might represent the question while the other represents the context. This helps the model understand how to treat different parts of the input relative to one another.
Subword Masks: In models that utilize subword tokenization (like BERT), these masks help identify which parts of the input correspond to actual subwords and which are merely padding or irrelevant. This ensures that the model focuses on meaningful linguistic units.
Future Masks: In autoregressive models like GPT, future masks prevent the model from attending to future tokens in the sequence during training. This ensures that predictions for the next token are based solely on past tokens, maintaining the causal nature of the model.
Token Type IDs: While not a mask in the strict sense, token type IDs indicate the type of token in a sequence. They can be useful for differentiating between multiple sentences or parts of text in tasks that require understanding of context. This is sometimes used interchangeably with segment mask ids, I realized this one when working on with a BERT question-answering model.

Closing Recap

In summary, attention masks are a crucial component of transformer models, enabling them to focus on relevant information while filtering out distractions. Just as our brains prioritize significant data amidst a flood of sensory input, attention masks guide models to pay attention to important tokens and ignore irrelevant ones.

Information Filtering: Just like you filter out background noise when having a conversation, attention masks help models zero in on relevant input, ensuring accurate processing.
Practical Applications: The impact of attention masks is clear in real-world scenarios, such as voice-controlled assistants, where the ability to focus on user commands amidst background chatter is vital for delivering a seamless user experience.
Integration with Other Masks: Attention masks work in harmony with other types of masks, such as padding masks, segment masks, and future masks, all of which contribute to the overall effectiveness of transformer architectures.

By understanding how attention masks function, we can appreciate the sophistication behind models like GPT and BERT, which mimic human cognitive abilities to process and prioritize information. As the field of deep learning continues to evolve, mastering these concepts will empower developers and researchers to build more efficient and accurate AI systems.

🙆🏾‍♀️ Is it finished already ? I still wanted to know some stuffs and ask questions 🙁

🙋🏾‍♂️ Don’t bother, ask your concerns in the comment section or even DM, I’ll do my best to answer

About Me

Hi there! I'm Juvet Manga, a young passionate machine learning engineer specializing in developing cutting-edge AI models for mobile applications. With a focus on deep learning and natural language processing, I strive to bridge the gap between complex technology and everyday understanding.

Currently, I’m working on an exciting project as a member of the startup Mapossa involving transformer models. My goal is to make AI accessible and comprehensible for everyone, whether you're a seasoned developer, a curious business exec or just starting your journey in tech.

In addition to my technical work, I love sharing knowledge through writing and presentations, aiming to simplify advanced concepts for a broader audience. When I'm not coding, you can find me playing games (Legend of Zelda is my favorite😍) or exploring the latest AI research. Let’s connect and explore the fascinating world of AI together !
-> LinkedIn: Juvet Manga
-> X: juvet_manga

DEV Community

What are attention masks in the context of transformers (GPT, BERT, T5)

Why Do Transformers Need an Attention Mask?

How Attention Masks Work in Practice

Why Attention Masks Matter

Practical Impact of Attention Masks

Brief Note on Other Mask Types

Closing Recap

Further Reading

About Me

Top comments (0)

Read next

I made OpenAPI and LLM schema definitions

How to Become a Successful Software Developer in 2024

Github Copilot is Now Free: Here's How to Set It Up

Day 20: Docker Orchestrators