Research papers, master and doctoral theses about English

The University of Edinburgh's English-German and English-Hausa Submissions to the WMT21 News Translation Task

243 - Association for Computation Linguistics 2021 مقالة

This paper presents the University of Edinburgh's constrained submissions of English-German and English-Hausa systems to the WMT 2021 shared task on news translation. We build En-De systems in three stages: corpus filtering, back-translation, and fin e-tuning. For En-Ha we use an iterative back-translation approach on top of pre-trained En-De models and investigate vocabulary embedding mapping.

university of edinburgh edinburgh constrained submissions english-german and english-hausa جامعة ادنبرة تقييد ادنبره التقديمات الإنجليزية والألمانية والإنجليزية هاوسا صناعة حمض الفوسفور المزيد..

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

256 - Association for Computation Linguistics 2021 مقالة

We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02M sentence pairs, which is 2.9M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15. We conduct experiments comparing strong neur al baselines and well-known automatic translation engines on our dataset and find that in both automatic and human evaluations: the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART. To our best knowledge, this is the first large-scale Vietnamese-English machine translation study. We hope our publicly available dataset and study can serve as a starting point for future research and applications on Vietnamese-English machine translation. We release our dataset at: https://github.com/VinAIResearch/PhoMT

vietnamese-english machine translation benchmark vietnamese-english machine الترجمة الفيتنامية-الإنجليزية القياس الفيتنامية الآلة الإنجليزية صناعة حمض الفوسفور

To What Extent Does Lexical Normalization Help English-as-a-Second Language Learners to Read Noisy English Texts?

222 - Association for Computation Linguistics 2021 مقالة

How difficult is it for English-as-a-second language (ESL) learners to read noisy English texts? Do ESL learners need lexical normalization to read noisy English texts? These questions may also affect community formation on social networking sites wh ere differences can be attributed to ESL learners and native English speakers. However, few studies have addressed these questions. To this end, we built highly accurate readability assessors to evaluate the readability of texts for ESL learners. We then applied these assessors to noisy English texts to further assess the readability of the texts. The experimental results showed that although intermediate-level ESL learners can read most noisy English texts in the first place, lexical normalization significantly improves the readability of noisy English texts for ESL learners.

noisy english texts read noisy english noisy english النصوص الإنجليزية صاخبة قراءة nooisy الإنجليزية الإنجليزية صاخبة صناعة حمض الفوسفور المزيد..

Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies

115 - Association for Computation Linguistics 2021 مقالة

Gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. However, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of no n-binary gender identities. These harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understanding of non-binary genders in society. In this paper, we explain the complexity of gender and language around it, and survey non-binary persons to understand harms associated with the treatment of gender as binary in English language technologies. We also detail how current language representations (e.g., GloVe, BERT) capture and perpetuate these harms and related challenges that need to be acknowledged and addressed for representations to equitably encode gender information.

gender exclusivity english language technologies الحصرية بين الجنسين تقنيات اللغة الإنجليزية صناعة حمض الفوسفور

Detection of Puffery on the English Wikipedia

226 - Association for Computation Linguistics 2021 مقالة

On Wikipedia, an online crowdsourced encyclopedia, volunteers enforce the encyclopedia's editorial policies. Wikipedia's policy on maintaining a neutral point of view has inspired recent research on bias detection, including weasel words'' and hedges ''. Yet to date, little work has been done on identifying puffery,'' phrases that are overly positive without a verifiable source. We demonstrate that collecting training data for this task requires some care, and construct a dataset by combining Wikipedia editorial annotations and information retrieval techniques. We compare several approaches to predicting puffery, and achieve 0.963 f1 score by incorporating citation features into a RoBERTa model. Finally, we demonstrate how to integrate our model with Wikipedia's public infrastructure to give back to the Wikipedia editor community.

english wikipedia wikipedia الإنجليزية ويكيبيديا ويكيبيديا إنجليزي صناعة حمض الفوسفور

BERT Cannot Align Characters

155 - Association for Computation Linguistics 2021 مقالة

In previous work, it has been shown that BERT can adequately align cross-lingual sentences on the word level. Here we investigate whether BERT can also operate as a char-level aligner. The languages examined are English, Fake English, German and Gree k. We show that the closer two languages are, the better BERT can align them on the character level. BERT indeed works well in English to Fake English alignment, but this does not generalize to natural languages to the same extent. Nevertheless, the proximity of two languages does seem to be a factor. English is more related to German than to Greek and this is reflected in how well BERT aligns them; English to German is better than English to Greek. We examine multiple setups and show that the similarity matrices for natural languages show weaker relations the further apart two languages are.

detecting sarcasm fake english وهمية الإنجليزية صناعة حمض الفوسفور

Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus

143 - Association for Computation Linguistics 2021 مقالة

The development of automated approaches to linguistic acceptability has been greatly fostered by the availability of the English CoLA corpus, which has also been included in the widely used GLUE benchmark. However, this kind of research for languages other than English, as well as the analysis of cross-lingual approaches, has been hindered by the lack of resources with a comparable size in other languages. We have therefore developed the ItaCoLA corpus, containing almost 10,000 sentences with acceptability judgments, which has been created following the same approach and the same steps as the English one. In this paper we describe the corpus creation, we detail its content, and we present the first experiments on this new resource. We compare in-domain and out-of-domain classification, and perform a specific evaluation of nine linguistic phenomena. We also present the first cross-lingual experiments, aimed at assessing whether multilingual transformer-based approaches can benefit from using sentences in two languages during fine-tuning.

italian cola corpus english cola corpus italian cola الكولا الإيطالي كولبا كولا سوربوس الكولا الإيطالي صناعة حمض الفوسفور المزيد..

Supervised Identification of Participant Slots in Contracts

371 - Association for Computation Linguistics 2021 مقالة

This paper presents a technique for the identification of participant slots in English language contracts. Taking inspiration from unsupervised slot extraction techniques, the system presented here uses a supervised approach to identify terms used to refer to a genre-specific slot in novel contracts. We evaluate the system in multiple feature configurations to demonstrate that the best performing system in both genres of contracts omits the exact mention form from consideration---even though such mention forms are often the name of the slot under consideration---and is instead based solely on the dependency label and parent; in other words, a more reliable quantification of a party's role in a contract is found in what they do rather than what they are named.

identification of participant participant slots english language contracts تحديد المشارك فتحات المشارك عقود اللغة الإنجليزية صناعة حمض الفوسفور المزيد..

Comprehensive Punctuation Restoration for English and Polish

303 - Association for Computation Linguistics 2021 مقالة

Punctuation restoration is a fundamental requirement for the readability of text derived from Automatic Speech Recognition (ASR) systems. Most contemporary solutions are limited to predicting only a few of the most frequently occurring marks, such as periods, commas, and question marks - and only one per word. However, in written language, we deal with a much larger number of punctuation characters (such as parentheses, hyphens, etc.), and their combinations (like parenthesis followed by a dot). Such comprehensive punctuation cannot always be unambiguously reduced to a basic set of the most frequently occurring marks. In this work, we evaluate several methods in the comprehensive punctuation reconstruction task. We conduct experiments on parallel corpora of two different languages, English and Polish - languages with a relatively simple and complex morphology, respectively. We also investigate the influence of building a model on comprehensive punctuation on the quality of the basic punctuation restoration task

محادثة ملخص DataSet. punctuation restoration english and polish التعرف على الكلام التلقائي استعادة علامات الترقيم الإنجليزية والبولندية صناعة حمض الفوسفور المزيد..

Adam Mickiewicz University's English-Hausa Submissions to the WMT 2021 News Translation Task

137 - Association for Computation Linguistics 2021 مقالة

This paper presents the Adam Mickiewicz University's (AMU) submissions to the WMT 2021 News Translation Task. The submissions focus on the English↔Hausa translation directions, which is a low-resource translation scenario between distant languages. O ur approach involves thorough data cleaning, transfer learning using a high-resource language pair, iterative training, and utilization of monolingual data via back-translation. We experiment with NMT and PB-SMT approaches alike, using the base Transformer architecture for all of the NMT models while utilizing PB-SMT systems as comparable baseline solutions.

adam mickiewicz university mickiewicz university english-hausa university english-hausa submissions جامعة آدم ميكايكز جامعة ميكايكز إنجليزي هوسا التقديمات English-Hausa الجامعة صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد