nlp_intro

8. Subword tokenization

Explanations and visualisations:

 

Why is text segmentation not trivial?

data types

Source: Khan Academuy

 

nlpflow

 

The problem of out-of-vocabulary (OOV) words

 

The space of possible subword splits

For a word of length 6:

possible_splits

 

 

Compression algorithms

subword options

 

 

Probability models

data-represent

 

Training vs. applying a tokenizer

 

The trade-off between data (=text) size and vocabulary size

 

Practical issues