New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Denoising Word Embeddings by Averaging in a Shared Space

Denoising Word Embeddings عن طريق المتوسط في مساحة مشتركة

619 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We introduce a new approach for smoothing and improving the quality of word embeddings. We consider a method of fusing word embeddings that were trained on the same corpus but with different initializations. We project all the models to a shared vector space using an efficient implementation of the Generalized Procrustes Analysis (GPA) procedure, previously used in multilingual word translation. Our word representation demonstrates consistent improvements over the raw models as well as their simplistic average, on a range of tasks. As the new representations are more stable and reliable, there is a noticeable improvement in rare word evaluations.

References used

https://aclanthology.org/

rate research

Statistically Significant Detection of Semantic Shifts using Contextual Word Embeddings

284 - Association for Computation Linguistics 2021 مقالة

Detecting lexical semantic change in smaller data sets, e.g. in historical linguistics and digital humanities, is challenging due to a lack of statistical power. This issue is exacerbated by non-contextual embedding models that produce one embedding per word and, therefore, mask the variability present in the data. In this article, we propose an approach to estimate semantic shift by combining contextual word embeddings with permutation-based statistical tests. We use the false discovery rate procedure to address the large number of hypothesis tests being conducted simultaneously. We demonstrate the performance of this approach in simulation where it achieves consistently high precision by suppressing false positives. We additionally analyze real-world data from SemEval-2020 Task 1 and the Liverpool FC subreddit corpus. We show that by taking sample variation into account, we can improve the robustness of individual semantic shift estimates without degrading overall performance.

statistically significant detection significant detection statistically significant الكشف ذات دلالة إحصائية الكشف عن كبير ذات دلالة إحصائية صناعة حمض الفوسفور المزيد..

Query2Prod2Vec: Grounded Word Embeddings for eCommerce

273 - Association for Computation Linguistics 2021 مقالة

We present Query2Prod2Vec, a model that grounds lexical representations for product search in product embeddings: in our model, meaning is a mapping between words and a latent space of products in a digital shop. We leverage shopping sessions to lear n the underlying space and use merchandising annotations to build lexical analogies for evaluation: our experiments show that our model is more accurate than known techniques from the NLP and IR literature. Finally, we stress the importance of data efficiency for product search outside of retail giants, and highlight how Query2Prod2Vec fits with practical constraints faced by most practitioners.

grounded word embeddings grounded word كلمة تضيحية كلمة كلمة أساسية صناعة حمض الفوسفور

Profiling of Intertextuality in Latin Literature Using Word Embeddings

675 - Association for Computation Linguistics 2021 مقالة

Identifying intertextual relationships between authors is of central importance to the study of literature. We report an empirical analysis of intertextuality in classical Latin literature using word embedding models. To enable quantitative evaluatio n of intertextual search methods, we curate a new dataset of 945 known parallels drawn from traditional scholarship on Latin epic poetry. We train an optimized word2vec model on a large corpus of lemmatized Latin, which achieves state-of-the-art performance for synonym detection and outperforms a widely used lexical method for intertextual search. We then demonstrate that training embeddings on very small corpora can capture salient aspects of literary style and apply this approach to replicate a previous intertextual study of the Roman historian Livy, which relied on hand-crafted stylometric features. Our results advance the development of core computational resources for a major premodern language and highlight a productive avenue for cross-disciplinary collaboration between the study of literature and NLP.

latin literature classical latin literature profiling of intertextuality الأدب اللاتيني الأدب اللاتيني الكلاسيكي تنميط intertextulity. صناعة حمض الفوسفور المزيد..

NPVec1: Word Embeddings for Nepali - Construction and Evaluation

308 - Association for Computation Linguistics 2021 مقالة

Word Embedding maps words to vectors of real numbers. It is derived from a large corpus and is known to capture semantic knowledge from the corpus. Word Embedding is a critical component of many state-of-the-art Deep Learning techniques. However, gen erating good Word Embeddings is a special challenge for low-resource languages such as Nepali due to the unavailability of large text corpus. In this paper, we present NPVec1 which consists of 25 state-of-art Word Embeddings for Nepali that we have derived from a large corpus using Glove, Word2Vec, FastText, and BERT. We further provide intrinsic and extrinsic evaluations of these Embeddings using well established metrics and methods. These models are trained using 279 million word tokens and are the largest Embeddings ever trained for Nepali language. Furthermore, we have made these Embeddings publicly available to accelerate the development of Natural Language Processing (NLP) applications in Nepali.

ينطبق على تقطير المعرفة embeddings embeddings. صناعة حمض الفوسفور

OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings

296 - Association for Computation Linguistics 2021 مقالة

Language representations are known to carry stereotypical biases and, as a result, lead to biased predictions in downstream tasks. While existing methods are effective at mitigating biases by linear projection, such methods are too aggressive: they n ot only remove bias, but also erase valuable information from word embeddings. We develop new measures for evaluating specific information retention that demonstrate the tradeoff between bias removal and information retention. To address this challenge, we propose OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale. Our experiments on gender biases show that OSCaR is a well-balanced approach that ensures that semantic information is retained in the embeddings and bias is also effectively mitigated.

orthogonal subspace correction orthogonal subspace subspace correction تصحيح الفضاء الفرعي المتعامد الفضاء الفرعي المتعامد تصحيح الفرعية صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Denoising Word Embeddings by Averaging in a Shared Space

Denoising Word Embeddings عن طريق المتوسط في مساحة مشتركة

Ask ChatGPT about the research

Read More

suggested questions