Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Character-based PCFG Induction for Modeling the Syntactic Acquisition of Morphologically Rich Languages

تحريض PCFG القائم على الأحرف لنمذجة الاستحواذ النحوي للغات الغنية المورفولوجية

401 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Unsupervised PCFG induction models, which build syntactic structures from raw text, can be used to evaluate the extent to which syntactic knowledge can be acquired from distributional information alone. However, many state-of-the-art PCFG induction models are word-based, meaning that they cannot directly inspect functional affixes, which may provide crucial information for syntactic acquisition in child learners. This work first introduces a neural PCFG induction model that allows a clean ablation of the influence of subword information in grammar induction. Experiments on child-directed speech demonstrate first that the incorporation of subword information results in more accurate grammars with categories that word-based induction models have difficulty finding, and second that this effect is amplified in morphologically richer languages that rely on functional affixes to express grammatical relations. A subsequent evaluation on multilingual treebanks shows that the model with subword information achieves state-of-the-art results on many languages, further supporting a distributional model of syntactic acquisition.

References used

https://aclanthology.org/

rate research

Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages

473 - Association for Computation Linguistics 2021 مقالة

Machine translation has seen rapid progress with the advent of Transformer-based models. These models have no explicit linguistic structure built into them, yet they may still implicitly learn structured relationships by attending to relevant tokens. We hypothesize that this structural learning could be made more robust by explicitly endowing Transformers with a structural bias, and we investigate two methods for building in such a bias. One method, the TP-Transformer, augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We test these methods on translating from English into morphologically rich languages, Turkish and Inuktitut, and consider both automatic metrics and human evaluations. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset. In sum, structural encoding methods make Transformers more sample-efficient, enabling them to perform better from smaller amounts of data.

biases for improving morphologically rich languages improving transformers التحيزات للتحسين لغات غنية مورمية صناعة حمض الفوسفور

A Corpus-based Syntactic Analysis of Two-termed Unlike Coordination

576 - Association for Computation Linguistics 2021 مقالة

Coordination is a phenomenon of language that conjoins two or more terms or phrases using a coordinating conjunction. Although coordination has been explored extensively in the linguistics literature, the rules and constraints that govern its structu re are still largely elusive and widely debated amongst linguists. This paper presents a study of two-termed unlike coordinations in particular, where the two conjuncts of the coordination phrase form valid constituents but have distinct categories. We conducted a syntactic analysis of the phrasal categories that can be conjoined in such unlike coordinations through a computational corpus-based approach, utilizing the Corpus of Contemporary American English (COCA) as the main data source, as well as the Penn Treebank (PTB). The results show that the two conjuncts within unlike coordinations display different properties based on their position, supporting an antisymmetric view of the structure of coordination. This research provides new data and perspectives through the use of statistical techniques that can help shape future theories and models of coordination.

two-termed unlike coordination corpus-based syntactic analysis unlike coordinations على عكس التنسيق على عكس التحليل النحوي القائم على Corpus على عكس التنسيق صناعة حمض الفوسفور المزيد..

Unsupervised Chunking as Syntactic Structure Induction with a Knowledge-Transfer Approach

583 - Association for Computation Linguistics 2021 مقالة

In this paper, we address unsupervised chunking as a new task of syntactic structure induction, which is helpful for understanding the linguistic structures of human languages as well as processing low-resource languages. We propose a knowledge-trans fer approach that heuristically induces chunk labels from state-of-the-art unsupervised parsing models; a hierarchical recurrent neural network (HRNN) learns from such induced chunk labels to smooth out the noise of the heuristics. Experiments show that our approach largely bridges the gap between supervised and unsupervised chunking.

syntactic structure induction syntactic structure structure induction هيكل النحوية التعريفي هيكل النحوية هيكل التعريفي صناعة حمض الفوسفور المزيد..

MasakhaNER: Named Entity Recognition for African Languages

802 - Association for Computation Linguistics 2021 مقالة

Abstract We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1

مجموعات البيانات الإنجليزية الحالية صناعة حمض الفوسفور

Glyph Enhanced Chinese Character Pre-Training for Lexical Sememe Prediction

630 - Association for Computation Linguistics 2021 مقالة

Sememes are defined as the atomic units to describe the semantic meaning of concepts. Due to the difficulty of manually annotating sememes and the inconsistency of annotations between experts, the lexical sememe prediction task has been proposed. How ever, previous methods heavily rely on word or character embeddings, and ignore the fine-grained information. In this paper, we propose a novel pre-training method which is designed to better incorporate the internal information of Chinese character. The Glyph enhanced Chinese Character representation (GCC) is used to assist sememe prediction. We experiment and evaluate our model on HowNet, which is a famous sememe knowledge base. The experimental results show that our method outperforms existing non-external information models.

lexical sememe prediction enhanced chinese character glyph enhanced chinese تنبؤات نظرية معجمية تعزيز الشخصية الصينية glyph عزز الصينية صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Character-based PCFG Induction for Modeling the Syntactic Acquisition of Morphologically Rich Languages

تحريض PCFG القائم على الأحرف لنمذجة الاستحواذ النحوي للغات الغنية المورفولوجية

Ask ChatGPT about the research

Read More

suggested questions