أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Lemao Liu

Rethinking Negative Sampling for Unlabeled Entity Problem in Named Entity Recognition

345 - Yangming Li , Lemao Liu , Shuming Shi 2021

In many situations (e.g., distant supervision), unlabeled entity problem seriously degrades the performances of named entity recognition (NER) models. Recently, this issue has been well addressed by a notable approach based on negative sampling. In t his work, we perform two studies along this direction. Firstly, we analyze why negative sampling succeeds both theoretically and empirically. Based on the observation that named entities are highly sparse in datasets, we show a theoretical guarantee that, for a long sentence, the probability of containing no unlabeled entities in sampled negatives is high. Missampling tests on synthetic datasets have verified our guarantee in practice. Secondly, to mine hard negatives and further reduce missampling rates, we propose a weighted and adaptive sampling distribution for negative sampling. Experiments on synthetic datasets and well-annotated datasets show that our method significantly improves negative sampling in robustness and effectiveness. We also have achieved new state-of-the-art results on real-world datasets.

الحساب واللغة

GWLAN: General Word-Level AutocompletioN for Computer-Aided Translation

105 - Huayang Li , Lemao Liu , Guoping Huang 2021

Computer-aided translation (CAT), the use of software to assist a human translator in the translation process, has been proven to be useful in enhancing the productivity of human translators. Autocompletion, which suggests translation results accordi ng to the text pieces provided by human translators, is a core function of CAT. There are two limitations in previous research in this line. First, most research works on this topic focus on sentence-level autocompletion (i.e., generating the whole translation as a sentence based on human input), but word-level autocompletion is under-explored so far. Second, almost no public benchmarks are available for the autocompletion task of CAT. This might be among the reasons why research progress in CAT is much slower compared to automatic MT. In this paper, we propose the task of general word-level autocompletion (GWLAN) from a real-world CAT scenario, and construct the first public benchmark to facilitate research in this topic. In addition, we propose an effective method for GWLAN and compare it with several strong baselines. Experiments demonstrate that our proposed method can give significantly more accurate predictions than the baseline methods on our benchmark datasets.

الحساب واللغة

TranSmart: A Practical Interactive Machine Translation System

158 - Guoping Huang , Lemao Liu , Xing Wang 2021

Automatic machine translation is super efficient to produce translations yet their quality is not guaranteed. This technique report introduces TranSmart, a practical human-machine interactive translation system that is able to trade off translation q uality and efficiency. Compared to existing publicly available interactive translation systems, TranSmart supports three key features, word-level autocompletion, sentence-level autocompletion and translation memory. By word-level and sentence-level autocompletion, TranSmart allows users to interactively translate words in their own manners rather than the strict manner from left to right. In addition, TranSmart has the potential to avoid similar translation mistakes by using translated sentences in history as its memory. This report presents major functions of TranSmart, algorithms for achieving these functions, how to use the TranSmart APIs, and evaluation results of some key functions. TranSmart is publicly available at its homepage (https://transmart.qq.com).

الحساب واللغة

Neural Sequence Segmentation as Determining the Leftmost Segments

309 - Yangming Li , Lemao Liu , Kaisheng Yao 2021

Prior methods to text segmentation are mostly at token level. Despite the adequacy, this nature limits their full potential to capture the long-term dependencies among segments. In this work, we propose a novel framework that incrementally segments n atural language sentences at segment level. For every step in segmentation, it recognizes the leftmost segment of the remaining sequence. Implementations involve LSTM-minus technique to construct the phrase representations and recurrent neural networks (RNN) to model the iterations of determining the leftmost segments. We have conducted extensive experiments on syntactic chunking and Chinese part-of-speech (POS) tagging across 3 datasets, demonstrating that our methods have significantly outperformed previous all baselines and achieved new state-of-the-art results. Moreover, qualitative analysis and the study on segmenting long-length sentences verify its effectiveness in modeling long-term dependencies.

الحساب واللغة

TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis

89 - Haisong Zhang , Lemao Liu , Haiyun Jiang 2020

This technique report introduces TexSmart, a text understanding system that supports fine-grained named entity recognition (NER) and enhanced semantic analysis functionalities. Compared to most previous publicly available text understanding systems a nd tools, TexSmart holds some unique features. First, the NER function of TexSmart supports over 1,000 entity types, while most other public tools typically support several to (at most) dozens of entity types. Second, TexSmart introduces new semantic analysis functions like semantic expansion and deep semantic representation, that are absent in most previous systems. Third, a spectrum of algorithms (from very fast algorithms to those that are relatively slow but more accurate) are implemented for one function in TexSmart, to fulfill the requirements of different academic and industrial applications. The adoption of unsupervised or weakly-supervised algorithms is especially emphasized, with the goal of easily updating our models to include fresh data with less human annotation efforts. The main contents of this report include major functions of TexSmart, algorithms for achieving these functions, how to use the TexSmart toolkit and Web APIs, and evaluation results of some key algorithms.

الحساب واللغة

Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition

136 - Yangming Li , Lemao Liu , Shuming Shi 2020

In many scenarios, named entity recognition (NER) models severely suffer from unlabeled entity problem, where the entities of a sentence may not be fully annotated. Through empirical studies performed on synthetic datasets, we find two causes of perf ormance degradation. One is the reduction of annotated entities and the other is treating unlabeled entities as negative instances. The first cause has less impact than the second one and can be mitigated by adopting pretraining language models. The second cause seriously misguides a model in training and greatly affects its performances. Based on the above observations, we propose a general approach, which can almost eliminate the misguidance brought by unlabeled entities. The key idea is to use negative sampling that, to a large extent, avoids training NER models with unlabeled entities. Experiments on synthetic datasets and real-world datasets show that our model is robust to unlabeled entity problem and surpasses prior baselines. On well-annotated datasets, our model is competitive with the state-of-the-art method.

الحساب واللغة

Segmenting Natural Language Sentences via Lexical Unit Analysis

160 - Yangming Li , Lemao Liu , Shuming Shi 2020

In this work, we present Lexical Unit Analysis (LUA), a framework for general sequence segmentation tasks. Given a natural language sentence, LUA scores all the valid segmentation candidates and utilizes dynamic programming (DP) to extract the maximu m scoring one. LUA enjoys a number of appealing properties such as inherently guaranteeing the predicted segmentation to be valid and facilitating globally optimal training and inference. Besides, the practical time complexity of LUA can be reduced to linear time, which is very efficient. We have conducted extensive experiments on 5 tasks, including syntactic chunking, named entity recognition (NER), slot filling, Chinese word segmentation, and Chinese part-of-speech (POS) tagging, across 15 datasets. Our models have achieved the state-of-the-art performances on 13 of them. The results also show that the F1 score of identifying long-length segments is notably improved.

الحساب واللغة

On the Branching Bias of Syntax Extracted from Pre-trained Language Models

109 - Huayang Li , Lemao Liu , Guoping Huang 2020

Many efforts have been devoted to extracting constituency trees from pre-trained language models, often proceeding in two stages: feature definition and parsing. However, this kind of methods may suffer from the branching bias issue, which will infla te the performances on languages with the same branch it biases to. In this work, we propose quantitatively measuring the branching bias by comparing the performance gap on a language and its reversed language, which is agnostic to both language models and extracting methods. Furthermore, we analyze the impacts of three factors on the branching bias, namely parsing algorithms, feature definitions, and language models. Experiments show that several existing works exhibit branching biases, and some implementations of these three factors can introduce the branching bias.

الحساب واللغة

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد