NLP Pipeline - Part 2: Feature Extraction

In the previous blog post, I covered text preprocessing as the first step in the NLP Pipeline. In this post, I will cover the following step, feature extraction. Feature extraction, a form of data manipulation, is the last step before building the actual NLP model. Feature extraction essentially aims to convert each data point to a numerical value the model can understand. There are multiple approaches to performing feature extraction with the most popular being BoW (Bag of Words), TF-IDF (Term Frequency – Inverse Document Frequency), and Word Embeddings. Each of these strategies produces a different output and changes the overall outcome of the model. In this post, I will cover these three different feature extraction approaches.

1. BoW

Bag of Words (BoW) is the simplest approach. It considers the data as a “bag” of words, meaning that it discards grammar and order. Instead, this strategy focuses solely on the frequency of words in the text. Firstly, it goes through each word in the text and creates a list of unique words, known as the vocabulary. Then, it outputs the number of times each word in the vocabulary appears in the data. For example, we can use the same text sample from Part 1, which after preprocessing and using BoW, would output:

Vocabulary:

[“natural”, “language”, “process”, “nlp”, “interdisciplinary”, “subfield”, “linguist”, “computer”, “science”, “artificial”, “intelligence”, “concern”, “interaction”, “human”]

Final output:

[1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1]

The main benefit of BoW is its simplicity and ease of use. However, it fails to capture word order, relationships between words, and is difficult to work with large vocabularies. BoW is a good choice for information retrieval in search engines and simple text analysis such as document comparisons. For more complex tasks, a different strategy is more apt.

2. TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) builds on BoW to create a numerical representation of words. Essentially, TF-IDF factors each word’s importance on top of its frequency into the final values. This process considers the frequency of each word across different documents to determine the importance of words. Essentially, if one word shows up in only one document as opposed to many documents, then TF-IDF considers that word to be highly important to that one document. The final numerical output depends on the specific implementation and parameters, but it is some combination of frequency and importance.

The main benefit of TF-IDF is that while it’s generally fairly simple, it’s more complex and produces more meaningful output than BoW. However, like BoW, TF-IDF disregards word order and semantic relationships. TF-IDF is best for text summarization and keyword extraction.

3. Word Embeddings

Word Embeddings are dense vector representations of numbers in a high-dimensional space created using deep learning. Unlike BoW and TF-IDF, Word Embeddings aims to capture the relationships between words. If two words have similar meanings, they will have close vector representations numerically and be closer to each other in the high-dimensional space. Word Embeddings are typically the go-to strategy for creating modern language models as they are the best at capturing writing patterns.

One obvious benefit of Word Embeddings over other approaches is its ability to capture deep relationships between words. In addition, when considering a large collection of words, BoW and TF-IDF will create issues with computational cost and memory as the vector dimensionality of those approaches directly depends on the vocabulary size, which is often in the tens of thousands. However, the dimensionality of vectors created by Word Embeddings is typically between 100-200. At the same time, the main disadvantage of Word Embeddings is that since it is powered by deep learning, it requires a large dataset to create those vectors.

Overall, Word Embeddings is easily the most complex and modern approach. Word Embeddings is part of more recent advancements in deep learning, while BoW and TF-IDF are more traditional. However, there are scenarios where such a complex representation isn’t necessary, and BoW or TF-IDF is more appropriate.

In the next post, I will discuss Modeling, one of the most nuanced parts of the NLP pipeline.