nlp_intro

8. Subword tokenization

Explanations and visualisations:

 

Why is text segmentation not trivial?

data types

Source: Khan Academuy

 

nlpflow

 

The problem of out-of-vocabulary (OOV) words

 

Compression algorithms

subword options

 

 

Probability models

data-represent

 

The trade-off between data (=text) size and vocabulary size

 

Practical tips