Glyph-aware Embedding of Chinese Characters

75 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Falcon Dai

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Falcon Z. Dai - Zheng Cai

الحساب واللغة التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a characters glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each characters task-relevant semantic and syntactic information in the character-level embedding.

قيم البحث

137 - Zijun Sun , Xiaoya Li , Xiaofei Sun 2021

Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which inc orporates both the {it glyph} and {it pinyin} information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The porpsoed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition. Code and pretrained models are publicly available at https://github.com/ShannonAI/ChineseBert.

الحساب واللغة

Rank-frequency relation for Chinese characters

119 - W.B. Deng , A.E. Allahverdyan , B. Li 2013

We show that the Zipfs law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipfs law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those writing within alphabetical systems.

الحساب واللغة تحليل البيانات والإحصاءات والاحتمال

Squared English Word: A Method of Generating Glyph to Use Super Characters for Sentiment Analysis

195 - Baohua Sun , Lin Yang , Catherine Chi 2019

The Super Characters method addresses sentiment analysis problems by first converting the input text into images and then applying 2D-CNN models to classify the sentiment. It achieves state of the art performance on many benchmark datasets. However, it is not as straightforward to apply in Latin languages as in Asian languages. Because the 2D-CNN model is designed to recognize two-dimensional images, it is better if the inputs are in the form of glyphs. In this paper, we propose SEW (Squared English Word) method generating a squared glyph for each English word by drawing Super Characters images of each English word at the alphabet level, combining the squared glyph together into a whole Super Characters image at the sentence level, and then applying the CNN model to classify the sentiment within the sentence. We applied the SEW method to Wikipedia dataset and obtained a 2.1% accuracy gain compared to the original Super Characters method. For multi-modal data with both structured tabular data and unstructured natural language text, the modified SEW method integrates the data into a single image and classifies sentiment with one unified CNN model.

الحساب واللغة

Sentence Segmentation for Classical Chinese Based on LSTM with Radical Embedding

339 - Xu Han , Hongsu Wang , Sanqian Zhang 2018

In this paper, we develop a low than character feature embedding called radical embedding, and apply it on LSTM model for sentence segmentation of pre modern Chinese texts. The datasets includes over 150 classical Chinese books from 3 different dynas ties and contains different literary styles. LSTM CRF model is a state of art method for the sequence labeling problem. Our new model adds a component of radical embedding, which leads to improved performances. Experimental results based on the aforementioned Chinese books demonstrates a better accuracy than earlier methods on sentence segmentation, especial in Tang Epitaph texts.

الحساب واللغة التعلم الآلي

Context Aware Document Embedding

105 - Zhaocheng Zhu , Junfeng Hu 2017

Recently, doc2vec has achieved excellent results in different tasks. In this paper, we present a context aware variant of doc2vec. We introduce a novel weight estimating mechanism that generates weights for each word occurrence according to its contr ibution in the context, using deep neural networks. Our context aware model can achieve similar results compared to doc2vec initialized byWikipedia trained vectors, while being much more efficient and free from heavy external corpus. Analysis of context aware weights shows they are a kind of enhanced IDF weights that capture sub-topic level keywords in documents. They might result from deep neural networks that learn hidden representations with the least entropy.

الحساب واللغة