No Arabic abstract
Many specialized domains remain untouched by deep learning, as large labeled datasets require expensive expert annotators. We address this bottleneck within the legal domain by introducing the Contract Understanding Atticus Dataset (CUAD), a new dataset for legal contract review. CUAD was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The task is to highlight salient portions of a contract that are important for a human to review. We find that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size. Despite these promising results, there is still substantial room for improvement. As one of the only large, specialized NLP benchmarks annotated by experts, CUAD can serve as a challenging research benchmark for the broader NLP community.
Knowledge Graph has been proven effective in modeling structured information and conceptual knowledge, especially in the medical domain. However, the lack of high-quality annotated corpora remains a crucial problem for advancing the research and applications on this task. In order to accelerate the research for domain-specific knowledge graphs in the medical domain, we introduce DiaKG, a high-quality Chinese dataset for Diabetes knowledge graph, which contains 22,050 entities and 6,890 relations in total. We implement recent typical methods for Named Entity Recognition and Relation Extraction as a benchmark to evaluate the proposed dataset thoroughly. Empirical results show that the DiaKG is challenging for most existing methods and further analysis is conducted to discuss future research direction for improvements. We hope the release of this dataset can assist the construction of diabetes knowledge graphs and facilitate AI-based applications.
We present a Chinese judicial reading comprehension (CJRC) dataset which contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts. The CJRC dataset can help researchers extract elements by reading comprehension technology. Element extraction is an important task in the legal field. However, it is difficult to predefine the element types completely due to the diversity of document types and causes of action. By contrast, machine reading comprehension technology can quickly extract elements by answering various questions from the long document. We build two strong baseline models based on BERT and BiDAF. The experimental results show that there is enough space for improvement compared to human annotators.
With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g.~what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g.~where does system A outperform system B? What if we combine systems A, B, and C?) and (iii) examine prediction results closely (e.g.~what are common errors made by multiple systems, or in what contexts do particular errors occur?). So far, ExplainaBoard covers more than 400 systems, 50 datasets, 40 languages, and 12 tasks. ExplainaBoard keeps updated and is recently upgraded by supporting (1) multilingual multi-task benchmark, (2) meta-evaluation, and (3) more complicated task: machine translation, which reviewers also suggested.} We not only released an online platform on the website url{http://explainaboard.nlpedia.ai/} but also make our evaluation tool an API with MIT Licence at Github url{https://github.com/neulab/explainaBoard} and PyPi url{https://pypi.org/project/interpret-eval/} that allows users to conveniently assess their models offline. We additionally release all output files from systems that we have run or collected to motivate output-driven research in the future.
The understanding of an offense is subjective and people may have different opinions about the offensiveness of a comment. Moreover, offenses and hate speech may occur through sarcasm, which hides the real intention of the comment and makes the decision of the annotators more confusing. Therefore, providing a well-structured annotation process is crucial to a better understanding of hate speech and offensive language phenomena, as well as supplying better performance for machine learning classifiers. In this paper, we describe a corpus annotation process proposed by a linguist, a hate speech specialist, and machine learning engineers in order to support the identification of hate speech and offensive language on social media. In addition, we provide the first robust dataset of this kind for the Brazilian Portuguese language. The corpus was collected from Instagram posts of political personalities and manually annotated, being composed by 7,000 annotated documents according to three different layers: a binary classification (offensive versus non-offensive language), the level of offense (highly offensive, moderately offensive, and slightly offensive messages), and the identification regarding the target of the discriminatory content (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology to the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement. The proposed annotation approach is also language and domain-independent nevertheless it is currently customized for Brazilian Portuguese.
In this paper, we introduce the textbf{C}hinese textbf{AI} and textbf{L}aw challenge dataset (CAIL2018), the first large-scale Chinese legal dataset for judgment prediction. dataset contains more than $2.6$ million criminal cases published by the Supreme Peoples Court of China, which are several times larger than other datasets in existing works on judgment prediction. Moreover, the annotations of judgment results are more detailed and rich. It consists of applicable law articles, charges, and prison terms, which are expected to be inferred according to the fact descriptions of cases. For comparison, we implement several conventional text classification baselines for judgment prediction and experimental results show that it is still a challenge for current models to predict the judgment results of legal cases, especially on prison terms. To help the researchers make improvements on legal judgment prediction, both dataset and baselines will be released after the CAIL competitionfootnote{http://cail.cipsc.org.cn/}.