Why are there so many tokenization methods in Transformers?

#python #machinelearning #deeplearning #datascience

HuggingFace's transformers library is the de-facto standard for NLP - used by practitioners worldwide, it's powerful, flexible, and easy to use. It achieves this through a fairly large (and complex) code-base, which has resulted in the question:

"Why are there so many tokenization methods in HuggingFace transformers?"

Tokenization is the process of encoding a string of text into transformer-readable token ID integers. In this video we cover five different methods for this - do these all produce the same output, or is there a difference between them?

📙 Check out the Medium article or if you don't have Medium membership here's a free access link

I also made a NLP with Transformers course, here's 70% off if you're interested!