Researchers audit Wikipedia data quality for low-resource NLP tasks

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A new study has audited the quality of Wikipedia data for low-resource and multilingual Natural Language Processing (NLP) tasks. Researchers found significant quality issues, including script and language contamination, bot-generated content, and template articles, especially in non-English editions. Filtering this data improved language model performance in several scenarios, particularly for lower-quality language editions, suggesting a need for quality-aware best practices in NLP dataset curation. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Highlights the need for careful data curation in NLP, especially for low-resource languages, to improve model performance.

RANK_REASON Academic paper detailing a data quality audit and its impact on NLP models. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
other

COVERAGE [1]

arXiv cs.CL TIER_1 · Kushal Tatariya, Artur Kulmizev, Wessel Poelman, Esther Ploeger, Marcel Bollmann, Johannes Bjerva, Jiaming Luo, Heather Lent, Miryam de Lhoneux · 2026-05-05 04:00

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

arXiv:2411.05527v3 Announce Type: replace Abstract: Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in NLP. However, in recent years, such assumptions of high quality have become the subject of scrutiny in low-resource …

COVERAGE [1]

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

RELATED ENTITIES

RELATED TOPICS