Since the seminal paper “Attention Is All You Need” of Vaswani et al, transformer models have become by far the state of the art in NLP technology. With applications ranging from NER, text classification, question answering, or text generation, the applications of this amazing technology are limitless.
More specifically, BERT — which stands for Bidirectional Encoder Representations from Transformers — leverages the transformer architecture in a novel way. For example, BERT analyses both sides of the sentence with a randomly masked word to make a prediction. In addition to predicting the masked token, BERT predicts the sequence of the sentences by adding a classification token [CLS] at the beginning of the first sentence and tries to predict if the second sentence follows the first one by adding a separation token [SEP] between the two sentences.