Do you want to publish a course? Click here

In this paper, we propose a definition and taxonomy of various types of non-standard textual content -- generally referred to as noise'' -- in Natural Language Processing (NLP). While data pre-processing is undoubtedly important in NLP, especially wh en dealing with user-generated content, a broader understanding of different sources of noise and how to deal with them is an aspect that has been largely neglected. We provide a comprehensive list of potential sources of noise, categorise and describe them, and show the impact of a subset of standard pre-processing strategies on different tasks. Our main goal is to raise awareness of non-standard content -- which should not always be considered as noise'' -- and of the need for careful, task-dependent pre-processing. This is an alternative to blanket, all-encompassing solutions generally applied by researchers through standard'' pre-processing pipelines. The intention is for this categorisation to serve as a point of reference to support NLP researchers in devising strategies to clean, normalise or embrace non-standard content.
The advancement of the web and information technology has contributed to the rapid growth of digital libraries and automatic machine translation tools which easily translate texts from one language into another. These have increased the content acces sible in different languages, which results in easily performing translated plagiarism, which are referred to as cross-language plagiarism''. Recognition of plagiarism among texts in different languages is more challenging than identifying plagiarism within a corpus written in the same language. This paper proposes a new technique for enhancing English-Arabic cross-language plagiarism detection at the sentence level. This technique is based on semantic and syntactic feature extraction using word order, word embedding and word alignment with multilingual encoders. Those features, and their combination with different machine learning (ML) algorithms, are then used in order to aid the task of classifying sentences as either plagiarized or non-plagiarized. The proposed approach has been deployed and assessed using datasets presented at SemEval-2017. Analysis of experimental data demonstrates that utilizing extracted features and their combinations with various ML classifiers achieves promising results.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا