Transformers are very important to modern society. I, at one point, thought that transformers were here from the devil to transform the world into chaos. At one time I thought they would transform the world into utopia. In any case, the power they possess is alluring. Where does it come from? What is attention? How does it result in such amazing feats?
What is Attention?
Attention mechanism is effectively a way to extract features from the input data. Like convolutional layers extract features from an image, so attention mechanisms extract features from language, or anything else that you input into it. It is used in Transformer to compare the sequence to itself, to predict the next word, but you could compare other things as well, using the same attention mechanism.
Attention Heads as Information Channels
To understand the intuition behind a transformer, you can think of them as setting up information channels between different parts of the sentences. If you have ever heard or studied convolutional neural networks, they are like a convolutional neural network on steroids. Essentially, the attention heads move across the sentence, and then they are compared to each other to see how they relate. These attention heads picking out parts of the sentences to compare to each other is like setting up information channels between different parts of the sentences, so that information can get through. Like a CNN uses the kernel to extract features in a square, moving it across the image, so the attention heads move across the sentence. The difference is that there are multiple attention heads, each moving across the sentence, which each set up channels between different parts and then compare them to each other.
Key, Query, and Value Intuition
I like to think of things in terms of y = mx + b. For a neural network, this is like a recursive relationship, where the m contains more y = mx + b, but the basic idea is that the m is calculated to find the relationship between y and x. Key, Query, and Value can be thought of, intuitively, like this, where the Key = Query * Value. For example, in a transformer encoder for language models, the key, query, and value are all the input sentence, because you are trying to see how the sentence relates to itself. This is why it's called self-attention, because we use the attention mechanism in such a way to set the key, query, and value to the same "self", which is the language input.
The power of attention does not stop at self-attention, however. You can set the key and the query to the same, and then set the value to something else, to see how two inputs vary, to extract features from two different sequences. Effectively, attention is a way to extract features from sequences. The only problem is that when you set up that many channels, it can be costly. In fact, it is hard to imagine how badly the attention mechanism scales. This is why new ways to extract features that use or resemble attention mechanism are being created. It is difficult to do so well, however, because the power from the attention mechanism comes from the fact that there be a way for the information to flow through, and without having a "channel" set up for it to flow through, it will not be able to. This is why we use attention to set up multiple channels for each word, however in the future we will probably get better at predicting which channel to listen to.
Top comments (0)