
3. Evaluation, data splits

Explanations and formulas:


1. Data splits




2. Comparing two sets of labels (confusion matrix)


3. Comparing two sequences of labels

We have to take into account the order of the labels. The true sequence of labels is called reference.




4. Comparing a sequence of labels to a model

We can evaluate a sequence against a model. For this, we rely on information theory, specifically:

These quantities are used as measures of the quality of generated text, but also (and much more!) as loss functions in training.


E.g. high vs. low entropy:



Example: log likelihood, cross-entropy, perplexity



5. Loss vs. performance

All these comparisons give us a measure of error of our model, that is how well our model corresponds to the real (true) phenomenon that we are modelling. We measure the error in different contexts for different purposes.

  1. On the train set -> loss, training error for setting model parameters (weights)
  2. On the dev set -> no standard term, but can be thought as interim performance, sometimes called validation, error measured for setting hyperparameters, e.g. the weight of a component in a processing pipeline, learning rate for weight updating, training duration etc.
  3. On the test set -> performance, importantly, an estimate of the performance!

Only the last point is evaluation.


6. The baseline

When evaluating a NLP system, we want to know whether it performs better than another system. There is no point in reporting scores without a comparison. If no other system exists, then we compare our system to a simple solution, which does not involve learning. This simple solution is called the baseline. An example of such a simple solution is the majority class baseline – putting all test items in a single class, the one that is most frequently seen in the training set.


Common mistakes in evaluation