Semi-Automated Labeling of Requirement Datasets for Relation Extraction

69 0 0.0 ( 0 )

Download Cite

Added by Jannik Fischbach

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jeremias Bohn - Jannik Fischbach - Martin Schmitt

Software Engineering

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Creating datasets manually by human annotators is a laborious task that can lead to biased and inhomogeneous labels. We propose a flexible, semi-automatic framework for labeling data for relation extraction. Furthermore, we provide a dataset of preprocessed sentences from the requirements engineering domain, including a set of automatically created as well as hand-crafted labels. In our case study, we compare the human and automatic labels and show that there is a substantial overlap between both annotations.

rate research

Automated synthesis of local time requirement for service composition

55 - Etienne Andre , Tian Huat Tan , Manman Chen 2020

Service composition aims at achieving a business goal by composing existing service-based applications or components. The response time of a service is crucial especially in time critical business environments, which is often stated as a clause in service level agreements between service providers and service users. To meet the guaranteed response time requirement of a composite service, it is important to select a feasible set of component services such that their response time will collectively satisfy the response time requirement of the composite service. In this work, we use the BPEL modeling language, that aims at specifying Web services. We extend it with timing parameters, and equip it with a formal semantics. Then, we propose a fully automated approach to synthesize the response time requirement of component services modeled using BPEL, in the form of a constraint on the local response times. The synthesized requirement will guarantee the satisfaction of the global response time requirement, statically or dynamically. We implemented our work into a tool, Selamat, and performed several experiments to evaluate the validity of our approach.

Software Engineering

Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics

423 - Shilin He , Jieming Zhu , Pinjia He 2020

Logs have been widely adopted in software system development and maintenance because of the rich system runtime information they contain. In recent years, the increase of software size and complexity leads to the rapid growth of the volume of logs. To handle these large volumes of logs efficiently and effectively, a line of research focuses on intelligent log analytics powered by AI (artificial intelligence) techniques. However, only a small fraction of these techniques have reached successful deployment in industry because of the lack of public log datasets and necessary benchmarking upon them. To fill this significant gap between academia and industry and also facilitate more research on AI-powered log analytics, we have collected and organized loghub, a large collection of log datasets. In particular, loghub provides 17 real-world log datasets collected from a wide range of systems, including distributed systems, supercomputers, operating systems, mobile systems, server applications, and standalone software. In this paper, we summarize the statistics of these datasets, introduce some practical log usage scenarios, and present a case study on anomaly detection to demonstrate how loghub facilitates the research and practice in this field. Up to the time of this paper writing, loghub datasets have been downloaded over 15,000 times by more than 380 organizations from both industry and academia.

Software Engineering

A Review on Semi-Supervised Relation Extraction

149 - Yusen Lin 2021

Relation extraction (RE) plays an important role in extracting knowledge from unstructured text but requires a large amount of labeled corpus. To reduce the expensive annotation efforts, semisupervised learning aims to leverage both labeled and unlabeled data. In this paper, we review and compare three typical methods in semi-supervised RE with deep learning or meta-learning: self-ensembling, which forces consistent under perturbations but may confront insufficient supervision; self-training, which iteratively generates pseudo labels and retrain itself with the enlarged labeled set; dual learning, which leverages a primal task and a dual task to give mutual feedback. Mean-teacher (Tarvainen and Valpola, 2017), LST (Li et al., 2019), and DualRE (Lin et al., 2019) are elaborated as the representatives to alleviate the weakness of these three methods, respectively.

Computation and Language Artificial Intelligence Information Retrieval

Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

98 - Paola Cascante-Bonilla , Fuwen Tan , Yanjun Qi 2020

In this paper we revisit the idea of pseudo-labeling in the context of semi-supervised learning where a learning algorithm has access to a small set of labeled samples and a large set of unlabeled samples. Pseudo-labeling works by applying pseudo-labels to samples in the unlabeled set by using a model trained on the combination of the labeled samples and any previously pseudo-labeled samples, and iteratively repeating this process in a self-training cycle. Current methods seem to have abandoned this approach in favor of consistency regularization methods that train models under a combination of different styles of self-supervised losses on the unlabeled samples and standard supervised losses on the labeled samples. We empirically demonstrate that pseudo-labeling can in fact be competitive with the state-of-the-art, while being more resilient to out-of-distribution samples in the unlabeled set. We identify two key factors that allow pseudo-labeling to achieve such remarkable results (1) applying curriculum learning principles and (2) avoiding concept drift by restarting model parameters before each self-training cycle. We obtain 94.91% accuracy on CIFAR-10 using only 4,000 labeled samples, and 68.87% top-1 accuracy on Imagenet-ILSVRC using only 10% of the labeled samples. The code is available at https://github.com/uvavision/Curriculum-Labeling

Machine Learning Computer Vision and Pattern Recognition Machine Learning

Semi-supervised Relation Extraction via Incremental Meta Self-Training

179 - Xuming Hu , Chenwei Zhang , Fukun Ma 2020

To alleviate human efforts from obtaining large-scale annotations, Semi-Supervised Relation Extraction methods aim to leverage unlabeled data in addition to learning from limited samples. Existing self-training methods suffer from the gradual drift problem, where noisy pseudo labels on unlabeled data are incorporated during training. To alleviate the noise in pseudo labels, we propose a method called MetaSRE, where a Relation Label Generation Network generates quality assessment on pseudo labels by (meta) learning from the successful and failed attempts on Relation Classification Network as an additional meta-objective. To reduce the influence of noisy pseudo labels, MetaSRE adopts a pseudo label selection and exploitation scheme which assesses pseudo label quality on unlabeled samples and only exploits high-quality pseudo labels in a self-training fashion to incrementally augment labeled samples for both robustness and accuracy. Experimental results on two public datasets demonstrate the effectiveness of the proposed approach.

Computation and Language Machine Learning