In the previous blog post, I covered feature extraction, the second step in the NLP Pipeline. In this post, I will go over modeling, the most complex stage in the NLP process.
The first step in modeling is choosing between using a traditional machine learning algorithm or deep learning. That determination greatly depends on the size of the dataset. From there, you must pick the specific algorithm or deep learning architecture for the NLP task.
Traditional Machine Learning Algorithms
Traditional machine learning algorithms are most effective for smaller datasets. These include Naive Bayes, decision trees, Random Forest, support vector machine (SVM), k-nearest neighbors, linear regression, logistic regression, principal component analysis, gradient boosting, and linear discriminant analysis. However, I will focus on Naive Bayes, random forest, and SVM, as they are the most popular.
Naive Bayes
Random Forest
The random forest approach is based entirely on decision trees. Decision trees are hierarchical structures that repetitively split data into subsets to reach an end decision. The process of splitting goes as follows: Select a feature from the data, set a threshold value for the feature, then create a branch for the data that meets the threshold and another branch for the data that doesn’t meet the threshold. This process is repeated, with the algorithm choosing a different feature to split each branch further. This stops when the maximum depth length is reached and a final prediction is made.
Decision trees are prone to overfitting, an issue where the model works extremely well on the training data but very poorly on unseen data. Essentially, overfitting means that the model has learned or memorized the training data. The random forest solves this issue in a variety of ways. The random forest includes many decision trees trained on different subsets of the training data using different randomly selected subsets of features as well. This creates diversity and variance between each tree in the forest. The final output is decided by either averaging out the output of each tree or through a majority wins approach. The reduced chance of overfitting is a big benefit of using a random forest. Another benefit is that it is less affected by outliers or noise in the data. The main disadvantage to using random forest is that since it uses many decision trees, it can be more expensive computationally and memory-wise.