ﻻ يوجد ملخص باللغة العربية
As the Portable Document Format (PDF) file format increases in popularity, research in analysing its structure for text extraction and analysis is necessary. Detecting headings can be a crucial component of classifying and extracting meaningful data. This research involves training a supervised learning model to detect headings with features carefully selected through recursive feature elimination. The best performing classifier had an accuracy of 96.95%, sensitivity of 0.986 and a specificity of 0.953. This research into heading detection contributes to the field of PDF based text extraction and can be applied to the automation of large scale PDF text analysis in a variety of professional and policy based contexts.
Topic models are widely used unsupervised models capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that ar
For many real-world classification problems, e.g., sentiment classification, most existing machine learning methods are biased towards the majority class when the Imbalance Ratio (IR) is high. To address this problem, we propose a set convolution (Se
Traditional sentiment analysis approaches tackle problems like ternary (3-category) and fine-grained (5-category) classification by learning the tasks separately. We argue that such classification tasks are correlated and we propose a multitask appro
Given a task of predicting $Y$ from $X$, a loss function $L$, and a set of probability distributions $Gamma$ on $(X,Y)$, what is the optimal decision rule minimizing the worst-case expected loss over $Gamma$? In this paper, we address this question b
Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual informa