Phrase:
“I wouldn’t like to go to the gym today because I am sick”
might be split into the tokens (depending on the model) like:
["I", "would", "n", "'", "t", "like", "to", "go", "to", "the", "gym", "today", "because", "I", "am", "sick"]
Why tokenization?
Tokens are more meaningful than single characters.
There are fewer unique tokens, than words. Also “ing” token in English is quiet common. Making model more efficient.
Tokens help with unknown words, like “bananing”, which is made of “banana” and “ing”.
Vocabulary
Set of all tokens a model can work with is the model’s vocabulary.
Sources:
AI Engineering by Chip Huyen (O’Reilly)