Do you want to publish a course? Click here

Field Embedding: A Unified Grain-Based Framework for Word Representation

التضمين الميداني: إطار موحد في الحبوب القائم على تمثيل Word

503   0   0   0.0 ( 0 )
 Publication date 2021
and research's language is English
 Created by Shamra Editor




Ask ChatGPT about the research

Word representations empowered with additional linguistic information have been widely studied and proved to outperform traditional embeddings. Current methods mainly focus on learning embeddings for words while embeddings of linguistic information (referred to as grain embeddings) are discarded after the learning. This work proposes a framework field embedding to jointly learn both word and grain embeddings by incorporating morphological, phonetic, and syntactical linguistic fields. The framework leverages an innovative fine-grained pipeline that integrates multiple linguistic fields and produces high-quality grain sequences for learning supreme word representations. A novel algorithm is also designed to learn embeddings for words and grains by capturing information that is contained within each field and that is shared across them. Experimental results of lexical tasks and downstream natural language processing tasks illustrate that our framework can learn better word embeddings and grain embeddings. Qualitative evaluations show grain embeddings effectively capture the semantic information.

References used
https://aclanthology.org/
rate research

Read More

Abstract Pre-trained language representation models (PLMs) cannot well capture factual knowledge from text. In contrast, knowledge embedding (KE) methods can effectively represent the relational facts in knowledge graphs (KGs) with informative entity embeddings, but conventional KE models cannot take full advantage of the abundant textual information. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagERepresentation (KEPLER), which can not only better integrate factual knowledge into PLMs but also produce effective text-enhanced KE with the strong PLMs. In KEPLER, we encode textual entity descriptions with a PLM as their embeddings, and then jointly optimize the KE and language modeling objectives. Experimental results show that KEPLER achieves state-of-the-art performances on various NLP tasks, and also works remarkably well as an inductive KE model on KG link prediction. Furthermore, for pre-training and evaluating KEPLER, we construct Wikidata5M1 , a large-scale KG dataset with aligned entity descriptions, and benchmark state-of-the-art KE methods on it. It shall serve as a new KE benchmark and facilitate the research on large KG, inductive KE, and KG with text. The source code can be obtained from https://github.com/THU-KEG/KEPLER.
Various temporal knowledge graph (KG) completion models have been proposed in the recent literature. The models usually contain two parts, a temporal embedding layer and a score function derived from existing static KG modeling approaches. Since the approaches differ along several dimensions, including different score functions and training strategies, the individual contributions of different temporal embedding techniques to model performance are not always clear. In this work, we systematically study six temporal embedding approaches and empirically quantify their performance across a wide range of configurations with about 3000 experiments and 13159 GPU hours. We classify the temporal embeddings into two classes: (1) timestamp embeddings and (2) time-dependent entity embeddings. Despite the common belief that the latter is more expressive, an extensive experimental study shows that timestamp embeddings can achieve on-par or even better performance with significantly fewer parameters. Moreover, we find that when trained appropriately, the relative performance differences between various temporal embeddings often shrink and sometimes even reverse when compared to prior results. For example, TTransE (CITATION), one of the first temporal KG models, can outperform more recent architectures on ICEWS datasets. To foster further research, we provide the first unified open-source framework for temporal KG completion models with full composability, where temporal embeddings, score functions, loss functions, regularizers, and the explicit modeling of reciprocal relations can be combined arbitrarily.
Toxic Spans Detection(TSD) task is defined as highlighting spans that make a text toxic. Many works have been done to classify a given comment or document as toxic or non-toxic. However, none of those proposed models work at the token level. In this paper, we propose a self-attention-based bidirectional gated recurrent unit(BiGRU) with a multi-embedding representation of the tokens. Our proposed model enriches the representation by a combination of GPT-2, GloVe, and RoBERTa embeddings, which led to promising results. Experimental results show that our proposed approach is very effective in detecting span tokens.
Neural language models, including transformer-based models, that are pre-trained on very large corpora became a common way to represent text in various tasks, including recognition of textual semantic relations, e.g. Cross-document Structure Theory. Pre-trained models are usually fine tuned to downstream tasks and the obtained vectors are used as an input for deep neural classifiers. No linguistic knowledge obtained from resources and tools is utilised. In this paper we compare such universal approaches with a combination of rich graph-based linguistically motivated sentence representation and a typical neural network classifier applied to a task of recognition of CST relation in Polish. The representation describes selected levels of the sentence structure including description of lexical meanings on the basis of the wordnet (plWordNet) synsets and connected SUMO concepts. The obtained results show that in the case of difficult relations and medium size training corpus semantically enriched text representation leads to significantly better results.
In this work, we describe our efforts in improving the variety of language generated from a rule-based NLG system for automated journalism. We present two approaches: one based on inserting completely new words into sentences generated from templates , and another based on replacing words with synonyms. Our initial results from a human evaluation conducted in English indicate that these approaches successfully improve the variety of the language without significantly modifying sentence meaning. We also present variations of the methods applicable to low-resource languages, simulated here using Finnish, where cross-lingual aligned embeddings are harnessed to make use of linguistic resources in a high-resource language. A human evaluation indicates that while proposed methods show potential in the low-resource case, additional work is needed to improve their performance.

suggested questions

comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا