No Arabic abstract
We show that the Zipfs law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipfs law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those writing within alphabetical systems.
Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a characters glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each characters task-relevant semantic and syntactic information in the character-level embedding.
Complex systems often comprise many kinds of components which vary over many orders of magnitude in size: Populations of cities in countries, individual and corporate wealth in economies, species abundance in ecologies, word frequency in natural language, and node degree in complex networks. Comparisons of component size distributions for two complex systems---or a system with itself at two different time points---generally employ information-theoretic instruments, such as Jensen-Shannon divergence. We argue that these methods lack transparency and adjustability, and should not be applied when component probabilities are non-sensible or are problematic to estimate. Here, we introduce `allotaxonometry along with `rank-turbulence divergence, a tunable instrument for comparing any two (Zipfian) ranked lists of components. We analytically develop our rank-based divergence in a series of steps, and then establish a rank-based allotaxonograph which pairs a map-like histogram for rank-rank pairs with an ordered list of components according to divergence contribution. We explore the performance of rank-turbulence divergence for a series of distinct settings including: Language use on Twitter and in books, species abundance, baby name popularity, market capitalization, performance in sports, mortality causes, and job titles. We provide a series of supplementary flipbooks which demonstrate the tunability and storytelling power of rank-based allotaxonometry.
Named Entity Recognition and Relation Extraction for Chinese literature text is regarded as the highly difficult problem, partially because of the lack of tagging sets. In this paper, we build a discourse-level dataset from hundreds of Chinese literature articles for improving this task. To build a high quality dataset, we propose two tagging methods to solve the problem of data inconsistency, including a heuristic tagging method and a machine auxiliary tagging method. Based on this corpus, we also introduce several widely used models to conduct experiments. Experimental results not only show the usefulness of the proposed dataset, but also provide baselines for further research. The dataset is available at https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset
Thermodynamic fluctuations in mechanical resonators cause uncertainty in their frequency measurement, fundamentally limiting performance of frequency-based sensors. Recently, integrating nanophotonic motion readout with micro- and nano-mechanical resonators allowed practical chip-scale sensors to routinely operate near this limit in high-bandwidth measurements. However, the exact and general expressions for either thermodynamic frequency measurement uncertainty or efficient, real-time frequency estimators are not well established, particularly for fast and weakly-driven resonators. Here, we derive, and numerically validate, the Cramer-Rao lower bound (CRLB) and an efficient maximum-likelihood estimator for the frequency of a classical linear harmonic oscillator subject to thermodynamic fluctuations. For a fluctuating oscillator without external drive, the frequency Allan deviation calculated from simulated resonator motion data agrees with the derived CRLB $sigma_f = {1 over 2pi}sqrt{Gamma over 2tau}$ for averaging times $tau$ below, as well as above, the relaxation time $1overGamma$. The CRLB approach is general and can be extended to driven resonators, non-negligible motion detection imprecision, as well as backaction from a continuous linear quantum measurement.
Graphical passwords (GPWs) are in many areas of the current world. Topological graphic passwords (Topsnut-gpws) are a new type of cryptography, and they differ from the existing GPWs. A Topsnut-gpw consists of two parts: one is a topological structure (graph), and one is a set of discrete elements (a graph labelling, or coloring), the topological structure connects these discrete elements together to form an interesting story. Our idea is to transform Chinese characters into computer and electronic equipments with touch screen by speaking, writing and keyboard for forming Hanzi-graphs and Hanzi-gpws. We will use Hanzigpws to produce text-based passwords (TB-paws). We will introduce flawed graph labellings on disconnected Hanzi-graphs.