Sentence Segmentation for Classical Chinese Based on LSTM with Radical Embedding

340 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Xu Han

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xu Han - Hongsu Wang - Sanqian Zhang

الحساب واللغة التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we develop a low than character feature embedding called radical embedding, and apply it on LSTM model for sentence segmentation of pre modern Chinese texts. The datasets includes over 150 classical Chinese books from 3 different dynasties and contains different literary styles. LSTM CRF model is a state of art method for the sequence labeling problem. Our new model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrates a better accuracy than earlier methods on sentence segmentation, especial in Tang Epitaph texts.

قيم البحث

66 - Chao-Lin Liu , Chang-Ting Chu , Wei-Ting Chang 2020

We consider three major text sources about the Tang Dynasty of China in our experiments that aim to segment text written in classical Chinese. These corpora include a collection of Tang Tomb Biographies, the New Tang Book, and the Old Tang Book. We s how that it is possible to achieve satisfactory segmentation results with the deep learning approach. More interestingly, we found that some of the relative superiority that we observed among different designs of experiments may be explainable. The relative relevance among the training corpora provides hints/explanation for the observed differences in segmentation results that were achieved when we employed different combinations of corpora to train the classifiers.

الحساب واللغة التعلم الآلي

Bidirectional LSTM-CRF Attention-based Model for Chinese Word Segmentation

119 - Chen Jin , Zhuangwei Shi , Weihua Li 2021

Chinese word segmentation (CWS) is the basic of Chinese natural language processing (NLP). The quality of word segmentation will directly affect the rest of NLP tasks. Recently, with the artificial intelligence tide rising again, Long Short-Term Memo ry (LSTM) neural network, as one of easily modeling in sequence, has been widely utilized in various kinds of NLP tasks, and functions well. Attention mechanism is an ingenious method to solve the memory compression problem on LSTM. Furthermore, inspired by the powerful abilities of bidirectional LSTM models for modeling sequence and CRF model for decoding, we propose a Bidirectional LSTM-CRF Attention-based Model in this paper. Experiments on PKU and MSRA benchmark datasets show that our model performs better than the baseline methods modeling by other neural networks.

التعلم الآلي

Glyph-aware Embedding of Chinese Characters

74 - Falcon Z. Dai , Zheng Cai 2017

Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common s ubstructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a characters glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each characters task-relevant semantic and syntactic information in the character-level embedding.

الحساب واللغة التعلم الآلي

Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

154 - Zhiqing Sun , Zhi-Hong Deng 2018

Previous traditional approaches to unsupervised Chinese word segmentation (CWS) can be roughly classified into discriminative and generative models. The former uses the carefully designed goodness measures for candidate segmentation, while the latter focuses on finding the optimal segmentation of the highest generative probability. However, while there exists a trivial way to extend the discriminative models into neural version by using neural language models, those of generative ones are non-trivial. In this paper, we propose the segmental language models (SLMs) for CWS. Our approach explicitly focuses on the segmental nature of Chinese, as well as preserves several properties of language models. In SLMs, a context encoder encodes the previous context and a segment decoder generates each segment incrementally. As far as we know, we are the first to propose a neural model for unsupervised CWS and achieve competitive performance to the state-of-the-art statistical models on four different datasets from SIGHAN 2005 bakeoff.

الحساب واللغة التعلم الآلي

Using Sentence-Level LSTM Language Models for Script Inference

133 - Karl Pichotta , Raymond J. Mooney 2016

There is a small but growing body of research on statistical scripts, models of event sequences that allow probabilistic inference of implicit events from documents. These systems operate on structured verb-argument events produced by an NLP pipeline . We compare these systems with recent Recurrent Neural Net models that directly operate on raw tokens to predict sentences, finding the latter to be roughly comparable to the former in terms of predicting missing events in documents.

الحساب واللغة