In the previous 4 blog posts, I covered the most important steps of the NLP Pipeline, data collection, text preprocessing, feature extraction, and modeling. In this blog post, I will go over the last and most important stage, model evaluation. Model evaluation is vital to understanding how effective the model is before it is deployed.
There are numerous methods to evaluate a model’s performance. Some techniques depend largely on the specific task at hand. For example, the standard for machine translation is BLEU (BiLingual Evaluation Understudy). BLEU essentially evaluates how closely a machine-translated piece of text resembles a professional human translation. Similarly, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) evaluates how similar a machine summarization of text is to a human summarization. These two algorithms work by calculating overlapping n-grams, series of letters/words, between two pieces of text. The more effective the model, the more overlap the algorithm finds.
However, there are also more general evaluation metrics. The most fundamental of these metrics are loss and accuracy. Accuracy is fairly straightforward, it’s simply a measure of how many correct predictions were made compared to incorrect predictions. Loss, on the other hand, is a measure of how far away a prediction is from the real answer. On the surface, it appears that it is always ideal for a model to have high accuracy and low loss, however, it is much more nuanced than that. The reason for most of this nuance comes from the dataset that these two metrics are being calculated. In general, there are three datasets throughout the whole process, training, testing, and validation.
The training dataset is the dataset that the model learns from. The model will iterate over this dataset a certain number of rounds, technically referred to as epochs. In this set, you can’t trust the loss or accuracy. This is due to overfitting, a common issue where a model has “learned” the training data. A sign of overfitting is very low loss and very high accuracy on the training data. While this is typically a good sign, it is quite deceptive. The loss and accuracy suggest that the model is performing fantastically when in reality it is simply extremely biased toward the data in the training set and will perform very poorly on new data. It is impossible to know if the model is overfitted or if it truly is performing well, so you must refer to the loss and accuracy that comes from the validation set.
The validation set is the dataset that is used to fine-tune the model. After seeing the loss and accuracy from the validation set, you can go back and change some hyperparameters before training again and evaluating the loss and accuracy of the validation set with these new hyperparameters. This process is repeated until you are satisfied with the result. It is important that the validation set is independent from the training set and is not too closely related. Otherwise, the parameter tuning will have the same effect as overfitting, and the testing data may reveal poor results.
The testing data is the final set of data that the model will see. It is the ultimate test of how effective the model is. It is absolutely imperative that the testing data is completely diverse and unseen by the model. It must be brand new to ensure unbiased and accurate results. Assuming that these prerequisites are met, a very low loss and very high accuracy ensure a robust model that will perform well.
All in all, there are specific algorithms such as BLEU and ROUGE designed to evaluate certain NLP tasks. However, general metrics such as loss and accuracy are also important to consider, even if they come with more nuances, such as requiring multiple datasets in order to eliminate bias.