This paper introduces a new benchmark for machine transliteration between Tajik and Farsi, developing a unique parallel corpus from diverse sources. The study compares six model architectures, including rule-based systems, LSTMs, Transformers, and pre-trained multilingual models. Results show that byte-level and character-level models, particularly ByT5, significantly outperform subword-based models like mT5 for this language pair. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Highlights the effectiveness of byte/character-level models over subword tokenization for specific transliteration tasks.
RANK_REASON This is a research paper presenting a new benchmark and comparative study of machine learning models for a specific NLP task.