BERT is a very recent introduction that has provided us the-state-of-art results in many NLP tasks Published by researchers at Google AI language; its main outline is creating a bi-directional training of a transformer. When trained bi-directionally, the model has a deeper sense of the context of the language.
What is the necessity of BERT?
The most challenging aspect of any NLP problem is the deficit of training data. Despite the availability of enormous training data in general if we want to categorize the training set into task-specific datasets, we need to segregate them into many sub-categories. After this segregation, we end up with only a few hundred thousand odd human-labeled training datasets. Unfortunately, the Deep Learning based NLP tasks; require a much higher amount and can only have a significant accuracy or represent improvement only when trained with an enormous dataset.
To overcome this shortcoming, researchers and scholars have come up with techniques for general-purpose language representation with the help of unannotated text available on the web (pre-training). These models can then be fine-tuned for smaller task-specific datasets. This method significantly increases the accuracy when compared to the model trained on a small dataset for task-specific from the scratch, thus drastically reducing the training data that we usually require.
Why is BERT so powerful?
The BERT rather than predicting the next word in the sequence masks a finite percent of words within the sequence randomly and tries to predict them. This is termed Masked LM (MLM). The reason why the BERT model is powerful is because instead of predicting the subsequent word in the sequence, the model accounts for all words in the sequence and comes up with a deeper understanding of the context. To explain it more clearly, the usage of words in any language for that matter depends entirely on the context:
Let’s say we have two sentences: he is an avid book reader, and he went to book movie tickets along with his friend. In the two sentences, the word book incorporates a different meaning based on the context in which the word is used. BERT analyses and learns this contextual meaning to differentiate between the same words used in different locations.
Hope this was helpful.