Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Text-to-sql في البرية: مجموعة بيانات تحدث طبيعية تستند إلى بيانات تبادل المكدس

728 0 0 0.0 ( 0 )

Download Cite

Added by Association for Computation Linguistics مقالة

Publication date 2021

fields Artificial Intelligence

and research's language is English

Created by Shamra Editor

stack exchange data naturally-occurring dataset based stack exchange بيانات التبادل المكدس لحالات البيانات التي تحدث بشكل طبيعي كومة البورصة صناعة حمض الفوسفور

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

References used

https://aclanthology.org/

rate research

AraFacts: The First Large Arabic Dataset of Naturally Occurring Claims

531 - Association for Computation Linguistics 2021 مقالة

We introduce AraFacts, the first large Arabic dataset of naturally occurring claims collected from 5 Arabic fact-checking websites, e.g., Fatabyyano and Misbar, and covering claims since 2016. Our dataset consists of 6,121 claims along with their fac tual labels and additional metadata, such as fact-checking article content, topical category, and links to posts or Web pages spreading the claim. Since the data is obtained from various fact-checking websites, we standardize the original claim labels to provide a unified label rating for all claims. Moreover, we provide revealing dataset statistics and motivate its use by suggesting possible research applications. The dataset is made publicly available for the research community.

large arabic dataset naturally occurring claims large arabic مجموعة بيانات عربية كبيرة المطالبات التي تحدث بشكل طبيعي عربي كبير صناعة حمض الفوسفور المزيد..

DuoRAT: Towards Simpler Text-to-SQL Models

879 - Association for Computation Linguistics 2021 مقالة

Recent neural text-to-SQL models can effectively translate natural language questions to corresponding SQL queries on unseen databases. Working mostly on the Spider dataset, researchers have proposed increasingly sophisticated solutions to the proble m. Contrary to this trend, in this paper we focus on simplifications. We begin by building DuoRAT, a re-implementation of the state-of-the-art RAT-SQL model that unlike RAT-SQL is using only relation-aware or vanilla transformers as the building blocks. We perform several ablation experiments using DuoRAT as the baseline model. Our experiments confirm the usefulness of some techniques and point out the redundancy of others, including structural SQL features and features that link the question with the schema.

simpler effectively translate natural translate natural language أبسط ترجمة فعالة الطبيعية ترجمة اللغة الطبيعية صناعة حمض الفوسفور المزيد..

CodRED: A Cross-Document Relation Extraction Dataset for Acquiring Knowledge in the Wild

970 - Association for Computation Linguistics 2021 مقالة

Existing relation extraction (RE) methods typically focus on extracting relational facts between entity pairs within single sentences or documents. However, a large quantity of relational facts in knowledge bases can only be inferred across documents in practice. In this work, we present the problem of cross-document RE, making an initial step towards knowledge acquisition in the wild. To facilitate the research, we construct the first human-annotated cross-document RE dataset CodRED. Compared to existing RE datasets, CodRED presents two key challenges: Given two entities, (1) it requires finding the relevant documents that can provide clues for identifying their relations; (2) it requires reasoning over multiple documents to extract the relational facts. We conduct comprehensive experiments to show that CodRED is challenging to existing RE methods including strong BERT-based models.

تلخيص حوار الخدمة relation extraction dataset existing relation extraction مجموعة بيانات استخراج العلاقة استخراج العلاقة الحالية صناعة حمض الفوسفور

AutoChart: A Dataset for Chart-to-Text Generation Task

626 - Association for Computation Linguistics 2021 مقالة

The analytical description of charts is an exciting and important research area with many applications in academia and industry. Yet, this challenging task has received limited attention from the computational linguistics research community. This pap er proposes AutoChart, a large dataset for the analytical description of charts, which aims to encourage more research into this important area. Specifically, we offer a novel framework that generates the charts and their analytical description automatically. We conducted extensive human and machine evaluation on the generated charts and descriptions and demonstrate that the generated texts are informative, coherent, and relevant to the corresponding charts.

generation task generation analytical description مهمة التوليد توليد وصف تحليلي صناعة حمض الفوسفور المزيد..

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

914 - Association for Computation Linguistics 2021 مقالة

Hateful memes pose a unique challenge for current machine learning systems because their message is derived from both text- and visual-modalities. To this effect, Facebook released the Hateful Memes Challenge, a dataset of memes with pre-extracted te xt captions, but it is unclear whether these synthetic examples generalize to memes in the wild'. In this paper, we collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset. We find that memes in the wild' differ in two key aspects: 1) Captions must be extracted via OCR, injecting noise and diminishing performance of multimodal models, and 2) Memes are more diverse than traditional memes', including screenshots of conversations or text on a plain background. This paper thus serves as a reality-check for the current benchmark of hateful meme detection and its applicability for detecting real world hate.

assessing the generalizability hateful memes challenge memes challenge dataset تقييم التعميمية تحدي الميمات البغيضة ميمات تحدي البيانات صناعة حمض الفوسفور المزيد..

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

Text-to-sql في البرية: مجموعة بيانات تحدث طبيعية تستند إلى بيانات تبادل المكدس

Ask ChatGPT about the research

Read More

suggested questions