January 18, 2025

Out-of-Vocabulary(OOV)

Even if there are words that have not been used in the training of the language model, somehow the language model must process these words. How exactly does it handle them?

Unknown Word Tokens

Some models use special tokens to represent OOV words, for example, <unk>. This token is utilized to signify words that the model does not recognize, and when the model encounters this token, it understands that it is facing an unknown word.

Subword Tokenization

Many modern language models employ techniques that break down words into smaller units (such as subwords, Byte Pair Encoding (BPE), or WordPiece). This allows the models to process previously unseen words at a subword level and combine them to infer new meanings.

The means of Subword Tokenization

The Subword Tokenization looks awsome, but how to do with sentences?

Byte Pair Encoding

This method registers all characters contained in the given text as subwords in a dictionary. Then, it involves iteratively combining pairs of subwords that occur frequently to register them as new words.

WordPiece

It is similar to BPE, but constructs subwords taking likelihood into consideration.