ﻻ يوجد ملخص باللغة العربية
Idiomatic expressions like `out of the woods and `up the ante present a range of difficulties for natural language processing applications. We present work on the annotation and extraction of what we term potentially idiomatic expressions (PIEs), a subclass of multiword expressions covering both literal and non-literal uses of idiomatic expressions. Existing corpora of PIEs are small and have limited coverage of different PIE types, which hampers research. To further progress on the extraction and disambiguation of potentially idiomatic expressions, larger corpora of PIEs are required. In addition, larger corpora are a potential source for valuable linguistic insights into idiomatic expressions and their variability. We propose automatic tools to facilitate the building of larger PIE corpora, by investigating the feasibility of using dictionary-based extraction of PIEs as a pre-extraction tool for English. We do this by assessing the reliability and coverage of idiom dictionaries, the annotation of a PIE corpus, and the automatic extraction of PIEs from a large corpus. Results show that combinations of dictionaries are a reliable source of idiomatic expressions, that PIEs can be annotated with a high reliability (0.74-0.91 Fleiss Kappa), and that parse-based PIE extraction yields highly accurate performance (88% F1-score). Combining complementary PIE extraction methods increases reliability further, to over 92% F1-score. Moreover, the extraction method presented here could be extended to other types of multiword expressions and to other languages, given that sufficient NLP tools are available.
Idiomatic expressions have always been a bottleneck for language comprehension and natural language understanding, specifically for tasks like Machine Translation(MT). MT systems predominantly produce literal translations of idiomatic expressions as
The performance of relation extraction models has increased considerably with the rise of neural networks. However, a key issue of neural relation extraction is robustness: the models do not scale well to long sentences with multiple entities and rel
In this paper, we present a novel approach to machine reading comprehension for the MS-MARCO dataset. Unlike the SQuAD dataset that aims to answer a question with exact text spans in a passage, the MS-MARCO dataset defines the task as answering a que
We study a new application for text generation -- idiomatic sentence generation -- which aims to transfer literal phrases in sentences into their idiomatic counterparts. Inspired by psycholinguistic theories of idiom use in ones native language, we p
Distant supervision (DS) aims to generate large-scale heuristic labeling corpus, which is widely used for neural relation extraction currently. However, it heavily suffers from noisy labeling and long-tail distributions problem. Many advanced approac