No Arabic abstract
Dependency parsing is one of the important natural language processing tasks that assigns syntactic trees to texts. Due to the wider availability of dependency corpora and improved parsing and machine learning techniques, parsing accuracies of supervised learning-based systems have been significantly improved. However, due to the nature of supervised learning, those parsing systems highly rely on the manually annotated training corpora. They work reasonably good on the in-domain data but the performance drops significantly when tested on out-of-domain texts. To bridge the performance gap between in-domain and out-of-domain, this thesis investigates three semi-supervised techniques for out-of-domain dependency parsing, namely co-training, self-training and dependency language models. Our approaches use easily obtainable unlabelled data to improve out-of-domain parsing accuracies without the need of expensive corpora annotation. The evaluations on several English domains and multi-lingual data show quite good improvements on parsing accuracy. Overall this work conducted a survey of semi-supervised methods for out-of-domain dependency parsing, where I extended and compared a number of important semi-supervised methods in a unified framework. The comparison between those techniques shows that self-training works equally well as co-training on out-of-domain parsing, while dependency language models can improve both in- and out-of-domain accuracies.
A recent advance in monolingual dependency parsing is the idea of a treebank embedding vector, which allows all treebanks for a particular language to be used as training data while at the same time allowing the model to prefer training data from one treebank over others and to select the preferred treebank at test time. We build on this idea by 1) introducing a method to predict a treebank vector for sentences that do not come from a treebank used in training, and 2) exploring what happens when we move away from predefined treebank embedding vectors during test time and instead devise tailored interpolations. We show that 1) there are interpolated vectors that are superior to the predefined ones, and 2) treebank vectors can be predicted with sufficient accuracy, for nine out of ten test languages, to match the performance of an oracle approach that knows the most suitable predefined treebank embedding for the test set.
In this paper, we present an approach to improve the accuracy of a strong transition-based dependency parser by exploiting dependency language models that are extracted from a large parsed corpus. We integrated a small number of features based on the dependency language models into the parser. To demonstrate the effectiveness of the proposed approach, we evaluate our parser on standard English and Chinese data where the base parser could achieve competitive accuracy scores. Our enhanced parser achieved state-of-the-art accuracy on Chinese data and competitive results on English data. We gained a large absolute improvement of one point (UAS) on Chinese and 0.5 points for English.
In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effectiveness of our learned parser on image retrieval applications.
Dependency parsing is needed in different applications of natural language processing. In this paper, we present a thorough error analysis for dependency parsing for the Vietnamese language, using two state-of-the-art parsers: MSTParser and MaltParser. The error analysis results provide us insights in order to improve the performance of dependency parsing for the Vietnamese language.
Parsers are available for only a handful of the worlds languages, since they require lots of training data. How far can we get with just a small amount of training data? We systematically compare a set of simple strategies for improving low-resource parsers: data augmentation, which has not been tested before; cross-lingual training; and transliteration. Experimenting on three typologically diverse low-resource languages---North Sami, Galician, and Kazah---We find that (1) when only the low-resource treebank is available, data augmentation is very helpful; (2) when a related high-resource treebank is available, cross-lingual training is helpful and complements data augmentation; and (3) when the high-resource treebank uses a different writing system, transliteration into a shared orthographic spaces is also very helpful.