Loading...
Thumbnail Image
Publication

Unlocking the potential of Arabic NLP : High-quality dataset and preprocessing tool for Arabic large Language models

Attiah, Ameera
Hantash, Jana
Research Projects
Organizational Units
Journal Issue
Abstract
Arabic remains one of the most widely spoken yet technologically underserved languages in the field of Natural Language Processing (NLP), especially within academic and formal domains. This project addresses two critical gaps in Arabic NLP: the scarcity of high-quality domain-specific Arabic datasets for low-resource LLMs and the lack of automated frameworks tailored to the complexities of the Arabic language. Arabic remains underrepresented in large-scale NLP research due to data sparsity, high morphological richness, and limited domain-specific corpora — particularly in academic and educational contexts. To bridge this gap, we developed a curated academic dataset that captures formal Arabic usage across disciplines, aimedat enhancing the training and evaluation of Arabic Large Language Models (LLMs). In parallel, we built a robust, modular framework for large-scale Arabic data preprocessing. This framework automates advanced linguistic refinement stages including deep normalization, morphological transformation, diacritization, and distributed deduplication across multiple GPUs, as well as semantic scoring using LLM-based annotators. By integrating data from Common Crawl and additional Arabic sources such as books and journals, and applying data-centric AI techniques and morphological analysis, the framework ensures high linguistic semantic coherence. Our output is a high-quality, academic-specific Arabic dataset. That was validated through intrinsic evaluations—grammar correctness, lexical diversity, readability, and topic coherence—and extrinsic evaluations on downstream tasks.These outcomes validate the framework’s effectiveness and its potential to accelerate the development of Arabic AI systems. This project supports Saudi Vision 2030 by advancing Arabic AI and aligning with SDG 4 through academic resource accessibility and SDG 9 through scalable NLP tools.
Sponsor
Copyright
Book title
Journal title
DOI
Embedded videos