ﻻ يوجد ملخص باللغة العربية
Japan is a unique country with a distinct cultural heritage, which is reflected in billions of historical documents that have been preserved. However, the change in Japanese writing system in 1900 made these documents inaccessible for the general public. A major research project has been to make these historical documents accessible and understandable. An increasing amount of research has focused on the character recognition task and the location of characters on image, yet less research has focused on how to predict the sequential ordering of the characters. This is because sequence in classical Japanese is very different from modern Japanese. Ordering characters into a sequence is important for making the document text easily readable and searchable. Additionally, it is a necessary step for any kind of natural language processing on the data (e.g. machine translation, language modeling, and word embeddings). We explore a few approaches to the task of predicting the sequential ordering of the characters: one using simple hand-crafted rules, another using hand-crafted rules with adaptive thresholds, and another using a deep recurrent sequence model trained with teacher forcing. We provide a quantitative and qualitative comparison of these techniques as well as their distinct trade-offs. Our best-performing system has an accuracy of 98.65% and has a perfect accuracy on 49% of the books in our dataset, suggesting that the technique is able to predict the order of the characters well enough for many tasks.
We analyzed historical and literary documents in Chinese to gain insights into research issues, and overview our studies which utilized four different sources of text materials in this paper. We investigated the history of concepts and transliterated
If someone is looking for a certain publication in the field of computer science, the searching person is likely to use the DBLP to find the desired publication. The DBLP data set is continuously extended with new publications, or rather their metada
We present a hierarchical convolutional document model with an architecture designed to support introspection of the document structure. Using this model, we show how to use visualisation techniques from the computer vision literature to identify and
Event extraction is a classic task in natural language processing with wide use in handling large amount of yet rapidly growing financial, legal, medical, and government documents which often contain multiple events with their elements scattered and
Historical documents present many challenges for offline handwriting recognition systems, among them, the segmentation and labeling steps. Carefully annotated textlines are needed to train an HTR system. In some scenarios, transcripts are only availa