CoDesc: A Large Code-Description Parallel Dataset

75 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Rifat Shahriyar

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Masum Hasan - Tanveer Muttaqueen - Abdullah Al Ishtiaq

الحساب واللغة الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDescs effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at url{https://github.com/csebuetnlp/CoDesc}.

قيم البحث

107 - Hongjin Qian , Xiaohe Li , Hanxun Zhong 2020

Natural language dialogue systems raise great attention recently. As many dialogue models are data-driven, high-quality datasets are essential to these systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset that contains two su bsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization, deduplication, segmentation, and filtering. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models. Besides, current dialogue datasets for personalized chatbot usually contain several persona sentences or attributes. Different from existing datasets, Pchatbot provides anonymized user IDs and timestamps for both posts and responses. This enables the development of personalized dialogue models that directly learn implicit user personality from the users dialogue history. Our preliminary experimental study benchmarks several state-of-the-art dialogue models to provide a comparison for future work. The dataset can be publicly accessed at Github.

الحساب واللغة الذكاء الاصطناعي

A Practical Chinese Dependency Parser Based on A Large-scale Dataset

287 - Shuai Zhang , Lijie Wang , Ke Sun 2020

Dependency parsing is a longstanding natural language processing task, with its outputs crucial to various downstream tasks. Recently, neural network based (NN-based) dependency parsing has achieved significant progress and obtained the state-of-the- art results. As we all know, NN-based approaches require massive amounts of labeled training data, which is very expensive because it requires human annotation by experts. Thus few industrial-oriented dependency parser tools are publicly available. In this report, we present Baidu Dependency Parser (DDParser), a new Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB). DuCTB consists of about one million annotated sentences from multiple sources including search logs, Chinese newswire, various forum discourses, and conversation programs. DDParser is extended on the graph-based biaffine parser to accommodate to the characteristics of Chinese dataset. We conduct experiments on two test sets: the standard test set with the same distribution as the training set and the random test set sampled from other sources, and the labeled attachment scores (LAS) of them are 92.9% and 86.9% respectively. DDParser achieves the state-of-the-art results, and is released at https://github.com/baidu/DDParser.

الحساب واللغة الذكاء الاصطناعي

TabFact: A Large-scale Dataset for Table-based Fact Verification

84 - Wenhu Chen , Hongmin Wang , Jianshu Chen 2019

The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies ar e mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains under-explored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities. The data and code of the dataset are provided in url{https://github.com/wenhuchen/Table-Fact-Checking}.

الحساب واللغة الذكاء الاصطناعي

ClarQ: A large-scale and diverse dataset for Clarification Question Generation

343 - Vaibhav Kumar , Alan W. black 2020

Question answering and conversational systems are often baffled and need help clarifying certain ambiguities. However, limitations of existing datasets hinder the development of large-scale models capable of generating and utilising clarification que stions. In order to overcome these limitations, we devise a novel bootstrapping framework (based on self-supervision) that assists in the creation of a diverse, large-scale dataset of clarification questions based on post-comment tuples extracted from stackexchange. The framework utilises a neural network based architecture for classifying clarification questions. It is a two-step method where the first aims to increase the precision of the classifier and second aims to increase its recall. We quantitatively demonstrate the utility of the newly created dataset by applying it to the downstream task of question-answering. The final dataset, ClarQ, consists of ~2M examples distributed across 173 domains of stackexchange. We release this dataset in order to foster research into the field of clarification question generation with the larger goal of enhancing dialog and question answering systems.

الحساب واللغة الذكاء الاصطناعي التعلم الآلي

MedDG: A Large-scale Medical Consultation Dataset for Building Medical Dialogue System

126 - Wenge Liu , Jianheng Tang , Jinghui Qin 2020

Developing conversational agents to interact with patients and provide primary clinical advice has attracted increasing attention due to its huge application potential, especially in the time of COVID-19 Pandemic. However, the training of end-to-end neural-based medical dialogue system is restricted by an insufficient quantity of medical dialogue corpus. In this work, we make the first attempt to build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG, with more than 17K conversations collected from the online health consultation community. Five different categories of entities, including diseases, symptoms, attributes, tests, and medicines, are annotated in each conversation of MedDG as additional labels. To push forward the future research on building expert-sensitive medical dialogue system, we proposes two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation. To acquire a clear comprehension on these two medical dialogue tasks, we implement several state-of-the-art benchmarks, as well as design two dialogue models with a further consideration on the predicted entities. Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset, and the response quality can be enhanced with the help of auxiliary entity information. From human evaluation, the simple retrieval model outperforms several state-of-the-art generative models, indicating that there still remains a large room for improvement on generating medically meaningful responses.

الحساب واللغة الذكاء الاصطناعي