NLP Pipeline - Part 1: Data Collection and Text Preprocessing

The process of transforming raw text into a natural language processing (NLP) model that can translate text between numerous languages or generate paragraphs of human-sounding text, involves a multitude of complex stages known together as the NLP Pipeline. These include data collection, text preprocessing, feature extraction, modeling, and model evaluation.

In this post, I’ll briefly cover data collection and go through the steps involved in text preprocessing.

Data collection is obviously a crucial part of building any machine learning model. Without data, there is no model. Firstly, developers have to collect a large amount of text, as modern language models are trained on billions if not trillions of words. Developers often scrape the internet for data or use existing datasets. The specific source that they decide to use depends largely on the application. If the language model is expected to translate between languages, then using religious texts as data may be a good idea, as they have often already been translated into many different languages. After this, the text must be labeled or annotated. The specific annotations depend again on the purpose of the language model. For example, if the language model was meant to perform sentiment analysis, then a human would note how positive or negative a certain piece of text is to create a label. This tedious, time-consuming, and monotonous work is often outsourced to poorer countries in Africa, Asia, and South America where labor costs are significantly lower. However, some developers choose to create labels in-house.

After collecting data, it’s necessary to make modifications to the data before using it to train a language model. The process of preparing a dataset for training is called preprocessing. In NLP, preprocessing includes tokenization, normalization, stop-word removal, stemming, lemmatization, and vectorization.

To better understand what preprocessing does, we can use the following text as a sample to go over each preprocessing step: “Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.”

1. Tokenization

Tokenization, the first step of preprocessing, is very simple. Essentially, it is the breaking down of the text into smaller units. The size of each unit depends on the specific task the language model performs. Here is an overview of what different types of tokenization might output using our earlier text sample.

Word Tokenization: [“Natural”, “language”, “processing”, “(NLP)”, “is”, “an”, “interdisciplinary”, “subfield”, “of”, “linguistics”, “,”, “computer”, “science”, “,”, “and”, “artificial”, “intelligence”, “concerned”, “with”, “the”, “interactions”, “between”, “computers”, “and”, “human”, “language”, “.”]

Subword Tokenization: [“Natur”, “al”, “language”, “process”, “ing”, “(“, “NL”, “P”, “)”, “is”, “an”, “inter”, “discipl”, “inary”, “sub”, “field”, “of”, “lingu”, “istics”, “,”, “computer”, “science”, “,”, “and”, “artificial”, “intell”, “ig”, “ence”, “concern”, “ed”, “with”, “the”, “inter”, “actions”, “between”, “computers”, “and”, “human”, “language”, “.”]

Character Tokenization: [“N”, “a”, “t”, “u”, “r”, “a”, “l”, ” “, “l”, “a”, “n”, “g”, “u”, “a”, “g”, “e”, ” “, “p”, “r”, “o”, “c”, “e”, “s”, “s”, “i”, “n”, “g”, ” “, “(“, “N”, “L”, “P”, “)”, ” “, “i”, “s”, ” “, “a”, “n”, ” “, “i”, “n”, “t”, “e”, “r”, “d”, “i”, “s”, “c”, “i”, “p”, “l”, “i”, “n”, “a”, “r”, “y”, ” “, “s”, “u”, “b”, “f”, “i”, “e”, “l”, “d”, ” “, “o”, “f”, ” “, “l”, “i”, “n”, “g”, “u”, “i”, “s”, “t”, “i”, “c”, “s”, “,”, ” “, “c”, “o”, “m”, “p”, “u”, “t”, “e”, “r”, ” “, “s”, “c”, “i”, “e”, “n”, “c”, “e”, “,”, ” “, “a”, “n”, “d”, ” “, “a”, “r”, “t”, “i”, “f”, “i”, “c”, “i”, “a”, “l”, ” “, “i”, “n”, “t”, “e”, “l”, “l”, “i”, “g”, “e”, “n”, “c”, “e”, ” “, “c”, “o”, “n”, “c”, “e”, “r”, “n”, “e”, “d”, ” “, “w”, “i”, “t”, “h”, ” “, “t”, “h”, “e”, ” “, “i”, “n”, “t”, “e”, “r”, “a”, “c”, “t”, “i”, “o”, “n”, “s”, ” “, “b”, “e”, “t”, “w”, “e”, “e”, “n”, ” “, “c”, “o”, “m”, “p”, “u”, “t”, “e”, “r”, “s”, ” “, “a”, “n”, “d”, ” “, “h”, “u”, “m”, “a”, “n”, ” “, “l”, “a”, “n”, “g”, “u”, “a”, “g”, “e”, “.”]

Moving forward with the other preprocessing steps, we’ll use the word tokenization example, as it is the most commonly used form.

2. Lowercasing

Lowercasing is a form of normalization, a process that aims to keep every word consistent throughout the text. Lowercasing, as its name suggests, turns every uppercase character in the text to its lowercase form.

Output: [“natural”, “language”, “processing”, “(“, “nlp”, “)”, “is”, “an”, “interdisciplinary”, “subfield”, “of”, “linguistics”, “,”, “computer”, “science”, “,”, “and”, “artificial”, “intelligence”, “concerned”, “with”, “the”, “interactions”, “between”, “computers”, “and”, “human”, “language”, “.”]

3. Punctuation Removal

Punctuation removal, another form of normalization, can occur both before and after lowercasing and is pretty straightforward.

Output: [“natural”, “language”, “processing”, “nlp”, “is”, “an”, “interdisciplinary”, “subfield”, “of”, “linguistics”, “computer”, “science”, “and”, “artificial”, “intelligence”, “concerned”, “with”, “the”, “interactions”, “between”, “computers”, “and”, “human”, “language”]

4. Stop-word Removal

Stop-words are words like “is”, “the”, and “a” that carry no semantic meaning but exist for human understanding. These words get in the way during training, so it’s best to remove them and any spaces or empty slots that remain after tokenization and normalization.

Output: [“natural”, “language”, “processing”, “nlp”, “interdisciplinary”, “subfield”, “linguistics”, “computer”, “science”, “artificial”, “intelligence”, “concerned”, “interactions”, “computers”, “human”, “language”]

5. Stemming/Lemmatization

Stemming reduces each word to its base root so that the model can recognize different tenses and versions of the same root word.

Output: [“natur”, “languag”, “process”, “nlp”, “interdisciplinari”, “subfield”, “linguist”, “comput”, “scienc”, “artifici”, “intellig”, “concern”, “interact”, “comput”, “human”, “languag”]

As you can see, stemming often doesn’t result in real words, so lemmatization changes these roots to a dictionary-defined word.

Output: [“natural”, “language”, “process”, “nlp”, “interdisciplinary”, “subfield”, “linguist”, “computer”, “science”, “artificial”, “intelligence”, “concern”, “interaction”, “computer”, “human”, “language”]

Once preprocessing is completed, the next stage in the NLP pipeline is Feature Extraction. In the next post, I will go over feature extraction steps, including vocabulary creation, the nuances associated with different vectorization approaches, and feature engineering.