Pre-training Transformer-based models such as BERT and ELECTRA on a collection of Arabic corpora, demonstrated by both AraBERT and AraELECTRA, shows an impressive result on downstream tasks. However, pre-training Transformer-based language models is
computationally expensive, especially for large-scale models. Recently, Funnel Transformer has addressed the sequential redundancy inside Transformer architecture by compressing the sequence of hidden states, leading to a significant reduction in the pre-training cost. This paper empirically studies the performance and efficiency of building an Arabic language model with Funnel Transformer and ELECTRA objective. We find that our model achieves state-of-the-art results on several Arabic downstream tasks despite using less computational resources compared to other BERT-based models.
The emergence of Multi-task learning (MTL)models in recent years has helped push thestate of the art in Natural Language Un-derstanding (NLU). We strongly believe thatmany NLU problems in Arabic are especiallypoised to reap the benefits of such model
s. Tothis end we propose the Arabic Language Un-derstanding Evaluation Benchmark (ALUE),based on 8 carefully selected and previouslypublished tasks. For five of these, we providenew privately held evaluation datasets to en-sure the fairness and validity of our benchmark.We also provide a diagnostic dataset to helpresearchers probe the inner workings of theirmodels.Our initial experiments show thatMTL models outperform their singly trainedcounterparts on most tasks. But in order to en-tice participation from the wider community,we stick to publishing singly trained baselinesonly. Nonetheless, our analysis reveals thatthere is plenty of room for improvement inArabic NLU. We hope that ALUE will playa part in helping our community realize someof these improvements. Interested researchersare invited to submit their results to our online,and publicly accessible leaderboard.
Advances in English language representation enabled a more sample-efficient pre-training task by Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Which, instead of training a model to recover masked tokens, it
trains a discriminator model to distinguish true input tokens from corrupted tokens that were replaced by a generator network. On the other hand, current Arabic language representation approaches rely only on pretraining via masked language modeling. In this paper, we develop an Arabic language representation model, which we name AraELECTRA. Our model is pretrained using the replaced token detection objective on large Arabic text corpora. We evaluate our model on multiple Arabic NLP tasks, including reading comprehension, sentiment analysis, and named-entity recognition and we show that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size.
Enabling empathetic behavior in Arabic dialogue agents is an important aspect of building human-like conversational models. While Arabic Natural Language Processing has seen significant advances in Natural Language Understanding (NLU) with language m
odels such as AraBERT, Natural Language Generation (NLG) remains a challenge. The shortcomings of NLG encoder-decoder models are primarily due to the lack of Arabic datasets suitable to train NLG models such as conversational agents. To overcome this issue, we propose a transformer-based encoder-decoder initialized with AraBERT parameters. By initializing the weights of the encoder and decoder with AraBERT pre-trained weights, our model was able to leverage knowledge transfer and boost performance in response generation. To enable empathy in our conversational model, we train it using the ArabicEmpatheticDialogues dataset and achieve high performance in empathetic response generation. Specifically, our model achieved a low perplexity value of 17.0 and an increase in 5 BLEU points compared to the previous state-of-the-art model. Also, our proposed model was rated highly by 85 human evaluators, validating its high capability in exhibiting empathy while generating relevant and fluent responses in open-domain settings.
The present study aimed to detect the degree of exercise The Arabic language
teachers for creative thinking skills in the Directorate of Education for the North Eastern
Badia region. The study's sample consisted of (200) The Arabic language teacher
s for sixth
and seven grades. To achieve the objectives of the study, the researcher used a
questionnaire composed of (63) items.
The results of the study showed that the degree of exercise The Arabic language
teachers for creative thinking skills development of the student was moderate on the
instrument total score, and in the fields of freedom of expression, the positive perspective
towards creativity, teaching methods, methods of evaluation, the class environment, and
creativity stimulation. Results of the study also pointed to the lack of a statistically
significant degree in exercise The Arabic language teachers in the Directorate of Education
for the North Eastern Badia region for creative thinking skills development differences
depending on the variable: gender, experience, and qualifications of all fields of study.
Accordingly, the study concluded that a number of recommendations related.
The study aimed at investigating linguistic performances of the teachers of Arabic
language and their relation to their attitudes towards teaching. The sample of the study
consisted of 40 Arabic teachers from the public schools in the Northeastern
Badia
Directorate of Education.
To achieve the purpose of study, analytical descriptive approach was used. The instruments
of the study were a note card, and a measure of trends towards the teaching. The results of
the study showed that the linguistic performances of Arabic teachers and their attitudes
toward teaching were medium which indicates a strong correlation between their linguistic
performances and their attitudes toward teaching.
This study aimed at analyzing the level of involvement of the linguistic performance
in Arabic language curricula, represented by language skills: listening, conversation,
reading and writing, in accordance to the outcomes of teaching embodied in t
he objectives
in order to keep an eye on the appropriateness of the content of the Arabic language
curriculum for the predetermined objectives. The results of analysis showed the following:
the percentage of representation of the content for listening comprehension for all grades is
“80.75”; the percentage of representation of the content for writing skills for all grades is
“84.3”; the percentage of representation of the content for conversation skills for all grades
is “91.25”; the percentage of representation of the content for reading comprehension for
all grades is “92.8”. The results of analysis on the level of curriculum showed that: the
percentage of representation of the content for all skills for the first grade is “89.5”; the
percentage of representation of the content for all skills for the second grade is “89.125”;
the percentage of representation of the content for all skills for the third grade is “87.875”;
the percentage of representation of the content for all skills for the fourth grade is
“87.375”; the percentage of representation of the content for all skills for the fifth grade is
“85.5”; the percentage of representation of the content for all skills for the sixth grade is
“84.375”. The study concluded with a number of recommendations.
In this paper, we introduce an algorithm for grouping Arabic
documents for building an ontology and its words. We execute
the algorithm on five ontologies using Java. We manage the
documents by getting 338667 words with its weights
corresponding
to each ontology. The algorithm had proved its
efficiency in optimizing classifiers (SVM, NB) performance, which
we tested in this study, comparing with former classifiers results
for Arabic language.
The absence of diacritization in Arabic texts is one of the most important challenges facing the
automatic Arabic Language processing. When reading, Arabic reader can expect the correct
diacritics of words, while computers need algorithms to restor
e the diacritization based on
knowledge of different levels. Diacritization here includes all the diacritics (dama, fatha, kasra,
sokon), in addition to alshadda, and altanween.
Some diacritization methods are based on the linguistic processing of texts, while other
methods are based on statistical methods using textual corpus. Some systems integrate the two
methodologies in hybrid approaches.
In this paper we present a comprehensive study of different methods that have been adopted in
these diacritization systems. In addition, we review the various corpuses that have been used
for tests and evaluation, then suggest the specifications of the Arabic corpus needed for
diacritization systems, and the standards that the evaluation process must take into
consideration. The main objective is to develop an action plan for the construction of an
automatic diacritizer of Arabic texts under the auspices of ALECSO, with the participation of
many research entities from different countries.
In this paper we present a web-based Interactive Arabic Dictionary developed in HIAST (Higher Institute
for Applied Sciences and Technology). Users can search online any Arabic word. The system provides
different meanings with example sentences and
multimedia illustrations, in addition to other related
information like associated words, semantic domains, expressions, linguistic avails, common mistakes, and
morphologic, syntactic and semantic information. The dictionary can be enriched collaboratively by
expert users with new words, new meanings for available entries, or other morphological, syntactic, and
semantic related information.