No Arabic abstract
The hierarchy of classical Chinese poetry has been broadly acknowledged by a number of studies in Chinese literature. However, quantitative investigations about the evolutionary linkages of classical Chinese poetry are limited. The primary goal of this study is to provide quantitative evidence of the evolutionary linkages, with emphasis on character usage, among different period genres of classical Chinese poetry. Specifically, various statistical analyses are performed to find and compare the patterns of character usage in the poems of nine period genres, including shi jing, chu ci, Han shi , Jin shi, Tang shi, Song shi, Yuan shi, Ming shi, and Qing shi. The result of analysis indicates that each of nine period genres has unique patterns of character usage, with some Chinese characters that are preferably used in the poems of a particular period genre. The analysis on the general pattern of character preference implies a decreasing trend in the use of Chinese characters that rarely occur in modern Chinese literature along the timeline of dynastic types of classical Chinese poetry. The phylogenetic analysis based on the distance matrix suggests that the evolutionary linkages of different types of classical Chinese poetry are congruent with their chronological order, suggesting that character frequencies contain phylogenetic information that is useful for inferring evolutionary linkages among various types of classical Chinese poetry. The estimated phylogenetic tree identifies four groups (shi jing, chu ci), (Han shi, Jin shi), (Tang shi, Song shi, Yuan shi), and (Ming shi, Qing shi). The statistical analyses conducted in this study can be generalized to analyze the data sets of general Chinese literature. Such analyses can provide quantitative insights about the evolutionary linkages of general Chinese literature.
Poetry is one of the most important art forms of human languages. Recently many studies have focused on incorporating some linguistic features of poetry, such as style and sentiment, into its understanding or generation system. However, there is no focus on understanding or evaluating the semantics of poetry. Therefore, we propose a novel task to assess a models semantic understanding of poetry by poem matching. Specifically, this task requires the model to select one line of Chinese classical poetry among four candidates according to the modern Chinese translation of a line of poetry. To construct this dataset, we first obtain a set of parallel data of Chinese classical poetry and modern Chinese translation. Then we retrieve similar lines of poetry with the lines in a poetry corpus as negative choices. We name the dataset Chinese Classical Poetry Matching Dataset (CCPM) and release it at https://github.com/THUNLP-AIPoet/CCPM. We hope this dataset can further enhance the study on incorporating deep semantics into the understanding and generation system of Chinese classical poetry. We also preliminarily run two variants of BERT on this dataset as the baselines for this dataset.
Recently, many works have tried to augment the performance of Chinese named entity recognition (NER) using word lexicons. As a representative, Lattice-LSTM (Zhang and Yang, 2018) has achieved new benchmark results on several public Chinese NER datasets. However, Lattice-LSTM has a complex model architecture. This limits its application in many industrial areas where real-time NER responses are needed. In this work, we propose a simple but effective method for incorporating the word lexicon into the character representations. This method avoids designing a complicated sequence modeling architecture, and for any neural NER model, it requires only subtle adjustment of the character representation layer to introduce the lexicon information. Experimental studies on four benchmark Chinese NER datasets show that our method achieves an inference speed up to 6.15 times faster than those of state-ofthe-art methods, along with a better performance. The experimental results also show that the proposed method can be easily incorporated with pre-trained models like BERT.
Recent studies in sequence-to-sequence learning demonstrate that RNN encoder-decoder structure can successfully generate Chinese poetry. However, existing methods can only generate poetry with a given first line or users intent theme. In this paper, we proposed a three-stage multi-modal Chinese poetry generation approach. Given a picture, the first line, the title and the other lines of the poem are successively generated in three stages. According to the characteristics of Chinese poems, we propose a hierarchy-attention seq2seq model which can effectively capture character, phrase, and sentence information between contexts and improve the symmetry delivered in poems. In addition, the Latent Dirichlet allocation (LDA) model is utilized for title generation and improve the relevance of the whole poem and the title. Compared with strong baseline, the experimental results demonstrate the effectiveness of our approach, using machine evaluations as well as human judgments.
The formation of sentences is a highly structured and history-dependent process. The probability of using a specific word in a sentence strongly depends on the history of word-usage earlier in that sentence. We study a simple history-dependent model of text generation assuming that the sample-space of word usage reduces along sentence formation, on average. We first show that the model explains the approximate Zipf law found in word frequencies as a direct consequence of sample-space reduction. We then empirically quantify the amount of sample-space reduction in the sentences of ten famous English books, by analysis of corresponding word-transition tables that capture which words can follow any given word in a text. We find a highly nested structure in these transition tables and show that this `nestedness is tightly related to the power law exponents of the observed word frequency distributions. With the proposed model it is possible to understand that the nestedness of a text can be the origin of the actual scaling exponent, and that deviations from the exact Zipf law can be understood by variations of the degree of nestedness on a book-by-book basis. On a theoretical level we are able to show that in case of weak nesting, Zipfs law breaks down in a fast transition. Unlike previous attempts to understand Zipfs law in language the sample-space reducing model is not based on assumptions of multiplicative, preferential, or self-organised critical mechanisms behind language formation, but simply used the empirically quantifiable parameter nestedness to understand the statistics of word frequencies.
This report summarizes the results of a short-term student research project focused on the usage of Swedish Wikipedia. It is trying to answer the following question: To what extent (and why) do people from non-English language communities use the English Wikipedia instead of the one in their local language? Article access time series and article edit time series from major Wikipedias including Swedish Wikipedia are analyzed with various tools.