أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Tong Li

MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition

79 - Jiatong Li , Kui Meng 2021

Pre-trained language models lead Named Entity Recognition (NER) into a new era, while some more knowledge is needed to improve their performance in specific problems. In Chinese NER, character substitution is a complicated linguistic phenomenon. Some Chinese characters are quite similar for sharing the same components or having similar pronunciations. People replace characters in a named entity with similar characters to generate a new collocation but referring to the same object. It becomes even more common in the Internet age and is often used to avoid Internet censorship or just for fun. Such character substitution is not friendly to those pre-trained language models because the new collocations are occasional. As a result, it always leads to unrecognizable or recognition errors in the NER task. In this paper, we propose a new method, Multi-Feature Fusion Embedding for Chinese Named Entity Recognition (MFE-NER), to strengthen the language pattern of Chinese and handle the character substitution problem in Chinese Named Entity Recognition. MFE fuses semantic, glyph, and phonetic features together. In the glyph domain, we disassemble Chinese characters into components to denote structure features so that characters with similar structures can have close embedding space representation. Meanwhile, an improved phonetic system is also proposed in our work, making it reasonable to calculate phonetic similarity among Chinese characters. Experiments demonstrate that our method improves the overall performance of Chinese NER and especially performs well in informal language environments.

الحساب واللغة

CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented Dialog Systems

126 - Fei Mi , Yitong Li , Yasheng Wang 2021

As labeling cost for different modules in task-oriented dialog (ToD) systems is high, a major challenge in practice is to learn different tasks with the least amount of labeled data. Recently, prompting methods over pre-trained language models (PLMs) have shown promising results for few-shot learning in ToD. To better utilize the power of PLMs, this paper proposes Comprehensive Instruction (CINS) that exploits PLMs with extra task-specific instructions. We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD, i.e. intent classification, dialog state tracking, and natural language generation. A sequence-to-sequence model (T5) is adopted to solve these three tasks in a unified framework. Extensive experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data. Empirical results demonstrate that the proposed CINS approach consistently improves techniques that finetune PLMs with raw input or short prompts.

الحساب واللغة التعلم الآلي

Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural Machine Translation Training

123 - Minghao Wu , Yitong Li , Meng Zhang 2021

Learning multilingual and multi-domain translation model is challenging as the heterogeneous and imbalanced data make the model converge inconsistently over different corpora in real world. One common practice is to adjust the share of each corpus in the training, so that the learning process is balanced and low-resource cases can benefit from the high resource ones. However, automatic balancing methods usually depend on the intra- and inter-dataset characteristics, which is usually agnostic or requires human priors. In this work, we propose an approach, MultiUAT, that dynamically adjusts the training data usage based on the models uncertainty on a small set of trusted clean data for multi-corpus machine translation. We experiments with two classes of uncertainty measures on multilingual (16 languages with 4 settings) and multi-domain settings (4 for in-domain and 2 for out-of-domain on English-German translation) and demonstrate our approach MultiUAT substantially outperforms its baselines, including both static and dynamic strategies. We analyze the cross-domain transfer and show the deficiency of static and similarity based methods.

الحساب واللغة

Self-Supervised Detection of Contextual Synonyms in a Multi-Class Setting: Phenotype Annotation Use Case

85 - Jingqing Zhang , Luis Bolanos , Tong Li 2021

Contextualised word embeddings is a powerful tool to detect contextual synonyms. However, most of the current state-of-the-art (SOTA) deep learning concept extraction methods remain supervised and underexploit the potential of the context. In this pa per, we propose a self-supervised pre-training approach which is able to detect contextual synonyms of concepts being training on the data created by shallow matching. We apply our methodology in the sparse multi-class setting (over 15,000 concepts) to extract phenotype information from electronic health records. We further investigate data augmentation techniques to address the problem of the class sparsity. Our approach achieves a new SOTA for the unsupervised phenotype concept annotation on clinical text on F1 and Recall outperforming the previous SOTA with a gain of up to 4.5 and 4.0 absolute points, respectively. After fine-tuning with as little as 20% of the labelled data, we also outperform BioBERT and ClinicalBERT. The extrinsic evaluation on three ICU benchmarks also shows the benefit of using the phenotypes annotated by our model as features.

الحساب واللغة

Leveraging Documentation to Test Deep Learning Library Functions

336 - Danning Xie , Yitong Li , Mijung Kim 2021

It is integral to test API functions of widely used deep learning (DL) libraries. The effectiveness of such testing requires DL specific input constraints of these API functions. Such constraints enable the generation of valid inputs, i.e., inputs th at follow these DL specific constraints, to explore deep to test the core functionality of API functions. Existing fuzzers have no knowledge of such constraints, and existing constraint extraction techniques are ineffective for extracting DL specific input constraints. To fill this gap, we design and implement a document guided fuzzing technique, D2C, for API functions of DL libraries. D2C leverages sequential pattern mining to generate rules for extracting DL specific constraints from API documents and uses these constraints to guide the fuzzing to generate valid inputs automatically. D2C also generates inputs that violate these constraints to test the input validity checking code. In addition, D2C uses the constraints to generate boundary inputs to detect more bugs. Our evaluation of three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that D2Cs accuracy in extracting input constraints is 83.3% to 90.0%. D2C detects 121 bugs, while a baseline fuzzer without input constraints detects only 68 bugs. Most (89) of the 121 bugs are previously unknown, 54 of which have been fixed or confirmed by developers after we report them. In addition, D2C detects 38 inconsistencies within documents, including 28 that are fixed or confirmed after we report them.

هندسة البرمجيات

Real-complex transition driven by quasiperiodicity: a new universality class beyond $mathcal{PT}$ symmetric one

121 - Tong Liu , Xu Xia 2021

We study a one-dimensional lattice model subject to non-Hermitian quasiperiodic potentials. Firstly, we strictly demonstrate that there exists an interesting dual mapping relation between $|a|<1$ and $|a|>1$ with regard to the potential tuning parame ter $a$. The localization property of $|a|<1$ can be directly mapping to that of $|a|>1$, the analytical expression of the mobility edge of $|a|>1$ is therefore obtained through spectral properties of $|a|<1$. More impressive, we prove rigorously that even if the phase $theta eq 0$ in quasiperiodic potentials, the model becomes non-$mathcal{PT}$ symmetric, however, there still exists a new type of real-complex transition driven by non-Hermitian disorder, which is a new universality class beyond $mathcal{PT}$ symmetric class.

الأنظمة المضطربة والشبكات العصبية غازات الكم

Principal Gradient Direction and Confidence Reservoir Sampling for Continual Learning

59 - Zhiyi Chen , Tong Lin 2021

Task-free online continual learning aims to alleviate catastrophic forgetting of the learner on a non-iid data stream. Experience Replay (ER) is a SOTA continual learning method, which is broadly used as the backbone algorithm for other replay-based methods. However, the training strategy of ER is too simple to take full advantage of replayed examples and its reservoir sampling strategy is also suboptimal. In this work, we propose a general proximal gradient framework so that ER can be viewed as a special case. We further propose two improvements accordingly: Principal Gradient Direction (PGD) and Confidence Reservoir Sampling (CRS). In Principal Gradient Direction, we optimize a target gradient that not only represents the major contribution of past gradients, but also retains the new knowledge of the current gradient. We then present Confidence Reservoir Sampling for maintaining a more informative memory buffer based on a margin-based metric that measures the value of stored examples. Experiments substantiate the effectiveness of both our improvements and our new algorithm consistently boosts the performance of MIR-replay, a SOTA ER-based method: our algorithm increases the average accuracy up to 7.9% and reduces forgetting up to 15.4% on four datasets.

التعلم الآلي

Fine-Grained Element Identification in Complaint Text of Internet Fraud

325 - Tong Liu , Siyuan Wang , Jingchao Fu 2021

Existing system dealing with online complaint provides a final decision without explanations. We propose to analyse the complaint text of internet fraud in a fine-grained manner. Considering the complaint text includes multiple clauses with various f unctions, we propose to identify the role of each clause and classify them into different types of fraud element. We construct a large labeled dataset originated from a real finance service platform. We build an element identification model on top of BERT and propose additional two modules to utilize the context of complaint text for better element label classification, namely, global context encoder and label refiner. Experimental results show the effectiveness of our model.

الحساب واللغة

Connecting Primordial Black Hole to boosted sub-GeV Dark Matter through neutrino

116 - Wei Chao , Tong Li , Jiajun Liao 2021

The explorations of alternative dark matter (DM) candidates beyond WIMP motivated primordial black holes (PBHs) or sub-GeV DM particle in the Milky Way. Neutrinos from PBH evaporation at the present times play as a novel medium boosting sub-GeV DM an d leaving signatures in the terrestrial experiments. We explore the boosted DM by the neutrino flux from PBH evaporation (PBH$ u$BDM) so as to connect the macroscopic PBHs to sub-GeV DM particle. We consider this PBH$ u$BDM scenario to interpret the XENON1T keV excess. The projected bounds on the sub-GeV DM-electron scattering cross section and the fraction of DM composed of PBHs $f_{rm PBH}$ are imposed for future experiments.

فيزياء الطاقة العالية - الظواهر فيزياء الطاقة العالية - التجربة

Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning

282 - Pratik Ramprasad , Yuantong Li , Zhuoran Yang 2021

The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.

التعلم الالي الذكاء الاصطناعي التعلم الآلي

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد