A Practical Chinese Dependency Parser Based on A Large-scale Dataset

288 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Shuai Zhang

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Shuai Zhang - Lijie Wang - Ke Sun

الحساب واللغة الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Dependency parsing is a longstanding natural language processing task, with its outputs crucial to various downstream tasks. Recently, neural network based (NN-based) dependency parsing has achieved significant progress and obtained the state-of-the-art results. As we all know, NN-based approaches require massive amounts of labeled training data, which is very expensive because it requires human annotation by experts. Thus few industrial-oriented dependency parser tools are publicly available. In this report, we present Baidu Dependency Parser (DDParser), a new Chinese dependency parser trained on a large-scale manually labeled dataset called Baidu Chinese Treebank (DuCTB). DuCTB consists of about one million annotated sentences from multiple sources including search logs, Chinese newswire, various forum discourses, and conversation programs. DDParser is extended on the graph-based biaffine parser to accommodate to the characteristics of Chinese dataset. We conduct experiments on two test sets: the standard test set with the same distribution as the training set and the random test set sampled from other sources, and the labeled attachment scores (LAS) of them are 92.9% and 86.9% respectively. DDParser achieves the state-of-the-art results, and is released at https://github.com/baidu/DDParser.

قيم البحث

102 - Yida Wang , Pei Ke , Yinhe Zheng 2020

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we pres ent a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.

الحساب واللغة

BSTC: A Large-Scale Chinese-English Speech Translation Dataset

128 - Ruiqing Zhang , Xiyang Wang , Chuanqiang Zhang 2021

This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data , their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.

الحساب واللغة

TabFact: A Large-scale Dataset for Table-based Fact Verification

84 - Wenhu Chen , Hongmin Wang , Jianshu Chen 2019

The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies ar e mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains under-explored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities. The data and code of the dataset are provided in url{https://github.com/wenhuchen/Table-Fact-Checking}.

الحساب واللغة الذكاء الاصطناعي

The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service

126 - Meng Chen , Ruixue Liu , Lei Shen 2019

Human conversations are complicated and building a human-like dialogue agent is an extremely challenging task. With the rapid development of deep learning techniques, data-driven models become more and more prevalent which need a huge amount of real conversation data. In this paper, we construct a large-scale real scenario Chinese E-commerce conversation corpus, JDDC, with more than 1 million multi-turn dialogues, 20 million utterances, and 150 million words. The dataset reflects several characteristics of human-human conversations, e.g., goal-driven, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and question-answering. Extra intent information and three well-annotated challenge sets are also provided. Then, we evaluate several retrieval-based and generative models to provide basic benchmark performance on the JDDC corpus. And we hope JDDC can serve as an effective testbed and benefit the development of fundamental research in dialogue task

الحساب واللغة الذكاء الاصطناعي استرجاع المعلومات

A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing

118 - Hang Yan , Xipeng Qiu , Xuanjing Huang 2019

Chinese word segmentation and dependency parsing are two fundamental tasks for Chinese natural language processing. The dependency parsing is defined on word-level. Therefore word segmentation is the precondition of dependency parsing, which makes de pendency parsing suffer from error propagation and unable to directly make use of the character-level pre-trained language model (such as BERT). In this paper, we propose a graph-based model to integrate Chinese word segmentation and dependency parsing. Different from previous transition-based joint models, our proposed model is more concise, which results in fewer efforts of feature engineering. Our graph-based joint model achieves better performance than previous joint models and state-of-the-art results in both Chinese word segmentation and dependency parsing. Besides, when BERT is combined, our model can substantially reduce the performance gap of dependency parsing between joint models and gold-segmented word-based models. Our code is publicly available at https://github.com/fastnlp/JointCwsParser.

الحساب واللغة الذكاء الاصطناعي