NLP Pipeline - Part 3: Modeling; Traditional Algorithms

In the previous blog post, I covered feature extraction, the second step in the NLP Pipeline. In this post, I will go over modeling, the most complex stage in the NLP process.

The first step in modeling is choosing between using a traditional machine learning algorithm or deep learning. That determination greatly depends on the size of the dataset. From there, you must pick the specific algorithm or deep learning architecture for the NLP task.

Traditional Machine Learning Algorithms

Traditional machine learning algorithms are most effective for smaller datasets. These include Naive Bayes, decision trees, Random Forest, support vector machine (SVM), k-nearest neighbors, linear regression, logistic regression, principal component analysis, gradient boosting, and linear discriminant analysis. However, I will focus on Naive Bayes, random forest, and SVM, as they are the most popular.

Naive Bayes

Naive Bayes is an algorithm built around Bayes’ Theorem, a concept surrounding probability. Bayes’ Theorem is expressed as a mathematical formula, which the algorithm uses to make predictions on unseen data given the training data. The main benefit of using Naive Bayes is that it is relatively simple and quite computationally efficient. It excels in text classification and works well with small datasets. However, Naive Bayes tends to perform poorly with more complex datasets as it makes “Naive” assumptions.

Random Forest

The random forest approach is based entirely on decision trees. Decision trees are hierarchical structures that repetitively split data into subsets to reach an end decision. The process of splitting goes as follows: Select a feature from the data, set a threshold value for the feature, then create a branch for the data that meets the threshold and another branch for the data that doesn’t meet the threshold. This process is repeated, with the algorithm choosing a different feature to split each branch further. This stops when the maximum depth length is reached and a final prediction is made.

Decision trees are prone to overfitting, an issue where the model works extremely well on the training data but very poorly on unseen data. Essentially, overfitting means that the model has learned or memorized the training data. The random forest solves this issue in a variety of ways. The random forest includes many decision trees trained on different subsets of the training data using different randomly selected subsets of features as well. This creates diversity and variance between each tree in the forest. The final output is decided by either averaging out the output of each tree or through a majority wins approach. The reduced chance of overfitting is a big benefit of using a random forest. Another benefit is that it is less affected by outliers or noise in the data. The main disadvantage to using random forest is that since it uses many decision trees, it can be more expensive computationally and memory-wise.

Support Vector Machine (SVM)

SVMs are the most robust prediction method. The model works by finding the optimal hyperplane that separates the features of one class from another. Since feature extraction creates vectors from words, keeping in mind that vectors encode positions in the space of features, SVMs aim to find the threshold of where one feature would fall into one class as opposed to another. To better understand the process, you can imagine red and green marbles floating in a jar. Try to position a piece of paper such that all of the red marbles are on one side of the paper and all of the green marbles are on the other side. This is the job that SVMs aim to do. In this example, each marble is a different feature, the color of the marble is the class it belongs to, the hyperplane is the piece of paper, and the jar is the feature space.

However, there is an extra layer of complexity. Imagine there is a red marble in the middle of a cluster of green marbles, it is impossible to separate the marbles with a piece of paper without bending it. To get around this, SVMs transform the data into a higher dimensional space where they may become separable. Since SVMs work well in high-dimensional spaces, they are very effective for use in complex data. However, for the same reason, they are very resource intensive and computationally expensive.

Overall, Naive Bayes, random forest, and SVMs are all very powerful when used in the right situation. However, recent developments have opened the doors to even more complex model choices in the form of deep learning. In the next post, I’ll cover different neural networks and architectures used in deep learning.