Research papers, master and doctoral theses about التضمين

Matching-oriented Embedding Quantization For Ad-hoc Retrieval

346 - Association for Computation Linguistics 2021 مقالة

Product quantization (PQ) is a widely used technique for ad-hoc retrieval. Recent studies propose supervised PQ, where the embedding and quantization models can be jointly trained with supervised learning. However, there is a lack of appropriate form ulation of the joint training objective; thus, the improvements over previous non-supervised baselines are limited in reality. In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated. With the minimization of MCL, we are able to maximize the matching probability of query and ground-truth key, which contributes to the optimal retrieval accuracy. Given that the exact computation of MCL is intractable due to the demand of vast contrastive samples, we further propose the Differentiable Cross-device Sampling (DCS), which significantly augments the contrastive samples for precise approximation of MCL. We conduct extensive experimental studies on four real-world datasets, whose results verify the effectiveness of MoPQ. The code is available at https://github.com/microsoft/MoPQ.

ad-hoc retrieval matching-oriented embedding quantization matching-oriented product quantization استرجاع مخصص قياس التضمين التضمين قياس كمية المنتج المطابقة المنحى صناعة حمض الفوسفور المزيد..

Block-wise Word Embedding Compression Revisited: Better Weighting and Structuring

274 - Association for Computation Linguistics 2021 مقالة

Word embedding is essential for neural network models for various natural language processing tasks. Since the word embedding usually has a considerable size, in order to deploy a neural network model having it on edge devices, it should be effective ly compressed. There was a study for proposing a block-wise low-rank approximation method for word embedding, called GroupReduce. Even if their structure is effective, the properties behind the concept of the block-wise word embedding compression were not sufficiently explored. Motivated by this, we improve GroupReduce in terms of word weighting and structuring. For word weighting, we propose a simple yet effective method inspired by the term frequency-inverse document frequency method and a novel differentiable method. Based on them, we construct a discriminative word embedding compression algorithm. In the experiments, we demonstrate that the proposed algorithm more effectively finds word weights than competitors in most cases. In addition, we show that the proposed algorithm can act like a framework through successful cooperation with quantization.

embedding compression revisited إعادة النظر في ضغط التضمين كلمة تضمين صناعة حمض الفوسفور

Identifying the Importance of Content Overlap for Better Cross-lingual Embedding Mappings

155 - Association for Computation Linguistics 2021 مقالة

In this work, we analyze the performance and properties of cross-lingual word embedding models created by mapping-based alignment methods. We use several measures of corpus and embedding similarity to predict BLI scores of cross-lingual embedding map pings over three types of corpora, three embedding methods and 55 language pairs. Our experimental results corroborate that instead of mere size, the amount of common content in the training corpora is essential. This phenomenon manifests in that i) despite of the smaller corpus sizes, using only the comparable parts of Wikipedia for training the monolingual embedding spaces to be mapped is often more efficient than relying on all the contents of Wikipedia, ii) the smaller, in return less diversified Spanish Wikipedia works almost always much better as a training corpus for bilingual mappings than the ubiquitously used English Wikipedia.

cross-lingual embedding mappings identifying the importance content overlap تعريض تضمين التضمين تحديد الأهمية تداخل المحتوى صناعة حمض الفوسفور المزيد..

A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space

228 - Association for Computation Linguistics 2021 مقالة

In cross-lingual language models, representations for many different languages live in the same space. Here, we investigate the linguistic and non-linguistic factors affecting sentence-level alignment in cross-lingual pretrained language models for 1 01 languages and 5,050 language pairs. Using BERT-based LaBSE and BiLSTM-based LASER as our models, and the Bible as our corpus, we compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance, as well as four intrinsic measures of vector space alignment and isomorphism. We then examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics. The results of our analyses show that word order agreement and agreement in morphological complexity are two of the strongest linguistic predictors of cross-linguality. We also note in-family training data as a stronger predictor than language-specific training data across the board. We verify some of our linguistic findings by looking at the effect of morphological segmentation on English-Inuktitut alignment, in addition to examining the effect of word order agreement on isomorphism for 66 zero-shot language pairs from a different corpus. We make the data and code for our experiments publicly available.

massively multilingual analysis shared embedding space massively multilingual تحليل متعدد اللغات بشكل كبير مشترك مساحة التضمين متعددة اللغات بشكل كبير صناعة حمض الفوسفور المزيد..

Direction is what you need: Improving Word Embedding Compression in Large Language Models

472 - Association for Computation Linguistics 2021 مقالة

The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression o f these models to improve their inference time and memory footprint. This paper presents a novel loss objective to compress token embeddings in the Transformer-based models by leveraging an AutoEncoder architecture. More specifically, we emphasize the importance of the direction of compressed embeddings with respect to original uncompressed embeddings. The proposed method is task-agnostic and does not require further language modeling pre-training. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity. Moreover, we evaluate our proposed approach over SQuAD v1.1 dataset and several downstream tasks from the GLUE benchmark, where we also outperform the baseline in most scenarios. Our code is public.

improving word embedding improving word word embedding compression تحسين كلمة التضمين تحسين كلمة كلمة تضمين ضغط صناعة حمض الفوسفور المزيد..

Speaker Verification Experiments for Adults and Children Using Shared Embedding Spaces

367 - Association for Computation Linguistics 2021 مقالة

For children, the system trained on a large corpus of adult speakers performed worse than a system trained on a much smaller corpus of children's speech. This is due to the acoustic mismatch between training and testing data. To capture more acoustic variability we trained a shared system with mixed data from adults and children. The shared system yields the best EER for children with no degradation for adults. Thus, the single system trained with mixed data is applicable for speaker verification for both adults and children.

shared embedding spaces embedding spaces speaker verification experiments أماكن التضمين المشترك تضمين المساحات تجارب التحقق من المتكلم صناعة حمض الفوسفور المزيد..

A Contextual Word Embedding for Arabic Sarcasm Detection with Random Forests

339 - Association for Computation Linguistics 2021 مقالة

Sarcasm detection is of great importance in understanding people's true sentiments and opinions. Many online feedbacks, reviews, social media comments, etc. are sarcastic. Several researches have already been done in this field, but most researchers studied the English sarcasm analysis compared to the researches are done in Arabic sarcasm analysis because of the Arabic language challenges. In this paper, we propose a new approach for improving Arabic sarcasm detection. Our approach is using data augmentation, contextual word embedding and random forests model to get the best results. Our accuracy in the shared task on sarcasm and sentiment detection in Arabic was 0.5189 for F1-sarcastic as the official metric using the shared dataset ArSarcasmV2 (Abu Farha, et al., 2021).

الجنس العربي contextual word embedding arabic sarcasm كلمة التضمين السياقي السخرية العربية صناعة حمض الفوسفور

Grammar inclusion between Alkhasayis and Almaghnii

1958 - Tishreen University 2017 ورقة بحثية

This study is based on the formulation of the concept of inclusion, which means the meaning of another word, and giving it its name. This is called the name because the meaning does not come explicitly to mention it, but rather to the other. The wo rd becomes the meaning of two words, Arabic, and it is one of the most beautiful chapters in it; if you meditate on it, you know from it and with it what the letters of meanings have revealed to you, and it shows an advantage, you see the name or letter with an action. This formulation is based on the books of Alkhasayis of Ibn Jinni, and the Moghnii El-Labib on the books of the Aarib Ibn Hisham Al-Ansari. The contents of this study are based on two main methods: First, he understands what is alternating and rotating between the letters of the traction in Arabic; he knows exactly what this art has on its basis. The second is to understand the meanings of deeds when acts include the meanings of other acts as follows: - Include the necessary meaning of infringing. - Inclusion of the infringer. - Inclusion between transgressions.

التضمين ابن هشام inclusion الإشراب ابن جني implication Ibn jiniy Ibn Hisham المزيد..

Implicature and Indirect Criticism

1091 - Aِl-Baath University 2017 ورقة بحثية

This paper deals with conversational implicature in discourse. It aims at investigating the degree of pragmatic awareness as far as implicature is concerned in third-year students of English at Albaath University.

التضمين النقد غير المباشر Implicature Indirect Criticism

Two Layer-Based Watermarking Algorithm for Copyright Protection of Multi-Language Text Documents

344 - University of Aleppo 2016 رسالة ماجستير

تهدف الخوارزمية التي سوق يتم تقديمها في هذه الأطروحة إلى حماية الوثائق والمستندات النصية التي يتم تبادلها أو نشرها على شبكة الانترنت, وتعمل الخوارزمية التي تم بناؤها على استخدام تقنية العلامة المائية الرقمية من اجل تحقيق ذلك.

التشفير العلامة المائية الرقمية توابع الاختزال علم التضمين الفرق بين التضمين والتشفير التشفير المماثل التشفير اللامتماثل المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد