The absence of diacritization in Arabic texts is one of the most important challenges facing the
automatic Arabic Language processing. When reading, Arabic reader can expect the correct
diacritics of words, while computers need algorithms to restor
e the diacritization based on
knowledge of different levels. Diacritization here includes all the diacritics (dama, fatha, kasra,
sokon), in addition to alshadda, and altanween.
Some diacritization methods are based on the linguistic processing of texts, while other
methods are based on statistical methods using textual corpus. Some systems integrate the two
methodologies in hybrid approaches.
In this paper we present a comprehensive study of different methods that have been adopted in
these diacritization systems. In addition, we review the various corpuses that have been used
for tests and evaluation, then suggest the specifications of the Arabic corpus needed for
diacritization systems, and the standards that the evaluation process must take into
consideration. The main objective is to develop an action plan for the construction of an
automatic diacritizer of Arabic texts under the auspices of ALECSO, with the participation of
many research entities from different countries.