We could run out of data to train AI language programs
The thing is, the types of data commonly used to train language models could be used up in the near future—as early as 2026, According to a paper by researchers from Epoch, an AI research and forecasting organization, has yet to be peer-reviewed. The problem stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find more texts to train them. Teven Le Scao, a researcher at AI firm Hugging Face who was not involved in Epoch’s work, said large language model researchers are increasingly concerned that they will run out of this kind of data.
The problem partly stems from the fact that AI language researchers filter the data they use to train models into two categories: high-quality and low-quality. Pablo Villalobos, a staff researcher at Epoch and lead author of the paper, said the line between the two types can be blurred, but text from the former is considered better written and is often produced by professional writers.
Data from low-quality categories includes texts such as social media posts or comments on sites like 4chan and far exceeds data that is considered high-quality. Researchers usually only train models that use data of the high quality because that’s the kind of language they want the models to reproduce. This approach has yielded some impressive results for large language models such as GPT-3.
According to Swabha Swayamdipta, a University of Southern California machine learning professor who specializes in data set quality, one way to overcome these data limitations is to reevaluate what is defined as “low” quality. and “tall”. If the lack of data pushes AI researchers to incorporate more diverse datasets into the training process, that would be a “net positive” for language models, says Swayamdipta.
Researchers can also find ways to extend the life of the data used to train language models. Currently, large language models are only trained on the same data once due to performance and cost constraints. But it is possible to train a model multiple times using the same data, says Swayamdipta.
Some researchers believe that bigger doesn’t equal better anyway when it comes to language models. Percy Liang, a professor of computer science at Stanford University, says there is evidence that creating more efficient models can improve their performance, rather than just increasing in size.
“We have seen how smaller models trained on higher quality data can perform better than larger models trained on lower quality data,” he explains.