
When it comes to making any machine-learning model, the first step is collecting data. The data is the backbone of the model and is the key determining factor in how effective the model will be, issues in the data will lead to issues in the output of the model. Large language models (LLMs) are no exception, so running out of high-quality language data could slow down the progression of Natural Language Processing (NLP).
It’s important to denote exactly what “high-quality” language data is. In reality, there is no clear line between “high” and “low” quality language data, but the general idea is that high-quality language data is “good writing”. Academic writing, Wikipedia articles, peer-reviewed scientific articles, and writing from professional writers would all be considered “good writing” and therefore would be high-quality language data. On the other hand, most writing in posts and comments on social media would be considered “low-quality” language data. Most developers are looking for high-quality language data, as they want their LLMs to replicate the caliber of writing seen in high-quality language data.
The issue lies in the fact that developers are forever looking for more and more data in hopes of making their models more accurate. Recent studies show that the size of datasets is growing by more than 50% per year. The effect of this is creating an issue of data scarcity. Some estimates state that available high-quality language data could be expended as soon as 2026, and 2030-2050 for low-quality language data. Considering how crucial the data is to the process of creating models, this poses a serious issue. Without available language data, developers won’t be able to train LLMs at all, which could completely halt the advancement of NLP.
So, what can we do about it?
One potential solution is tweaking low-quality language data to be high-quality. It should come as no surprise that there is much more low-quality data than there is high-quality data, so a way to tune the low-quality data to be high-quality could be game-changing. Kalyan Veeramachaneni, a research scientist at MIT, proposed a framework that could do this in his recent paper, R&R: Metric-guided Adversarial Sentence Generation. Another potential solution is simply using smaller datasets to train models. Of course, it’s not as easy as it sounds, but some researchers suggest that using more data isn’t necessarily better. Finding ways to make LLMs more efficient so that they don’t have to rely on tremendous datasets could resolve the looming issue of data scarcity.
Personally, I think that as the need for larger and larger datasets to power LLMs gets stronger, the concern about available computing power to process these datasets gets bigger as well. This begs the question of “is less more?” when it comes to datasets used to train LLMs.
Let me know your thoughts