BERT Architecture

#bert #encoders #nlp #nlu

BERT radically changed the NLP landscape when it was created by google. With bert, you can have a single model that is trained on a large unlabelled dataset to achieve State-of-the-Art results on 11 different NLP based use cases. Google’s TransformerXL, OpenAI’s GPT-2, XLNet, ERNIE2.0, RoBERTa have evolved over BERT.

“BERT Acronym stands for Bidirectional Encoder Representations from Transformers. BERT can be used for pertaining bidirectional representations from unlableled text using conditioning on the joint left and right context.

The BERT architecture is based on Transformers. Two variants available are listed below:

BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters

Bert's Transformer layers are Encoder-only blocks.