Word Embeddings: A Survey

161 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Felipe Almeida

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Felipe Almeida - Geraldo Xexeo

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

This work lists and describes the main recent strategies for building fixed-length, dense and distributed representations for words, based on the distributional hypothesis. These representations are now commonly called word embeddings and, in addition to encoding surprisingly good syntactic and semantic information, have been proven useful as extra features in many downstream NLP tasks.

قيم البحث

اقرأ أيضاً

Learning Gender-Neutral Word Embeddings

104 - Jieyu Zhao , Yichao Zhou , Zeyu Li 2018

Word embedding models have become a fundamental component in a wide range of Natural Language Processing (NLP) applications. However, embeddings trained on human-generated corpora have been demonstrated to inherit strong gender stereotypes that refle ct social constructs. To address this concern, in this paper, we propose a novel training procedure for learning gender-neutral word embeddings. Our approach aims to preserve gender information in certain dimensions of word vectors while compelling other dimensions to be free of gender influence. Based on the proposed method, we generate a Gender-Neutral variant of GloVe (GN-GloVe). Quantitative and qualitative experiments demonstrate that GN-GloVe successfully isolates gender information without sacrificing the functionality of the embedding model.

الحساب واللغة التعلم الآلي التعلم الالي

Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops

75 - Limor Gultchin 2019

While humor is often thought to be beyond the reach of Natural Language Processing, we show that several aspects of single-word humor correlate with simple linear directions in Word Embeddings. In particular: (a) the word vectors capture multiple asp ects discussed in humor theories from various disciplines; (b) each individuals sense of humor can be represented by a vector, which can predict differences in peoples senses of humor on new, unrated, words; and (c) upon clustering humor ratings of multiple demographic groups, different humor preferences emerge across the different groups. Humor ratings are taken from the work of Engelthaler and Hills (2017) as well as from an original crowdsourcing study of 120,000 words. Our dataset further includes annotations for the theoretically-motivated humor features we identify.

الحساب واللغة التعلم الآلي التعلم الالي

Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings

132 - Ahmed El-Kishky , Frank Xu , Aston Zhang 2019

Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word -level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into human-verified morphemes. Additionally, we experimentally demonstrate that utilizing MorphMine morphemes to enrich word embeddings consistently improves embedding quality on a variety of of embedding evaluations and a downstream language modeling task.

الحساب واللغة التعلم الآلي التعلم الالي

comp-syn: Perceptually Grounded Word Embeddings with Color

131 - Bhargav Srinivasa Desikan , Tasker Hull , Ethan O. Nadler 2020

Popular approaches to natural language processing create word embeddings based on textual co-occurrence patterns, but often ignore embodied, sensory aspects of language. Here, we introduce the Python package comp-syn, which provides grounded word emb eddings based on the perceptually uniform color distributions of Google Image search results. We demonstrate that comp-syn significantly enriches models of distributional semantics. In particular, we show that (1) comp-syn predicts human judgments of word concreteness with greater accuracy and in a more interpretable fashion than word2vec using low-dimensional word-color embeddings, and (2) comp-syn performs comparably to word2vec on a metaphorical vs. literal word-pair classification task. comp-syn is open-source on PyPi and is compatible with mainstream machine-learning Python packages. Our package release includes word-color embeddings for over 40,000 English words, each associated with crowd-sourced word concreteness judgments.

الحساب واللغة التعلم الآلي الشبكات الاجتماعية والمعلومات

Evaluating the Underlying Gender Bias in Contextualized Word Embeddings

189 - Christine Basta , Marta R. Costa-juss`a , Noe Casas 2019

Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word embeddings have enhanced previous word embedding techniques by computing word vector representations dependent on the sentence they appear in. In this paper, we study the impact of this conceptual change in the word embedding computation in relation with gender bias. Our analysis includes different measures previously applied in the literature to standard word embeddings. Our findings suggest that contextualized word embeddings are less biased than standard ones even when the latter are debiased.

الحساب واللغة التعلم الآلي