Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Semantic classifier approach to document classification

79 0 0.0 ( 0 )

Download Cite

Added by Mieczys{\\l}aw K{\\l}opotek

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Piotr Borkowski - Krzysztof Ciesielski - Mieczys{l}aw A.n K{l}opotek

Information Retrieval Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper we propose a new document classification method, bridging discrepancies (so-called semantic gap) between the training set and the application sets of textual data. We demonstrate its superiority over classical text classification approaches, including traditional classifier ensembles. The method consists in combining a document categorization technique with a single classifier or a classifier ensemble (SEMCOM algorithm - Committee with Semantic Categorizer).

rate research

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles

87 - Malte Ostendorff , Terry Ruas , Moritz Schubotz 2020

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

Digital Libraries Computation and Language Information Retrieval

Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

88 - Saar Kuzi , Mingyang Zhang , Cheng Li 2020

Search engines often follow a two-phase paradigm where in the first stage (the retrieval stage) an initial set of documents is retrieved and in the second stage (the re-ranking stage) the documents are re-ranked to obtain the final result list. While deep neural networks were shown to improve the performance of the re-ranking stage in previous works, there is little literature about using deep neural networks to improve the retrieval stage. In this paper, we study the merits of combining deep neural network models and lexical models for the retrieval stage. A hybrid approach, which leverages both semantic (deep neural network-based) and lexical (keyword matching-based) retrieval models, is proposed. We perform an empirical study, using a publicly available TREC collection, which demonstrates the effectiveness of our approach and sheds light on the different characteristics of the semantic approach, the lexical approach, and their combination.

Information Retrieval

Knowledge Distillation in Document Retrieval

132 - Siamak Shakeri , Abhinav Sethy , Cheng Cheng 2019

Complex deep learning models now achieve state of the art performance for many document retrieval tasks. The best models process the query or claim jointly with the document. However for fast scalable search it is desirable to have document embeddings which are independent of the claim. In this paper we show that knowledge distillation can be used to encourage a model that generates claim independent document encodings to mimic the behavior of a more complex model which generates claim dependent encodings. We explore this approach in document retrieval for a fact extraction and verification task. We show that by using the soft labels from a complex cross attention teacher model, the performance of claim independent student LSTM or CNN models is improved across all the ranking metrics. The student models we use are 12x faster in runtime and 20x smaller in number of parameters than the teacher

Information Retrieval Computation and Language Machine Learning

Using Semantic Role Knowledge for Relevance Ranking of Key Phrases in Documents: An Unsupervised Approach

150 - Prateeti Mohapatra , Neelamadhav Gantayat , Gargi B. Dasgupta 2019

In this paper, we investigate the integration of sentence position and semantic role of words in a PageRank system to build a key phrase ranking method. We present the evaluation results of our approach on three scientific articles. We show that semantic role information, when integrated with a PageRank system, can become a new lexical feature. Our approach had an overall improvement on all the data sets over the state-of-art baseline approaches.

Information Retrieval Computation and Language

KRED: Knowledge-Aware Document Representation for News Recommendations

90 - Danyang Liu , Jianxun Lian , Shiyin Wang 2019

News articles usually contain knowledge entities such as celebrities or organizations. Important entities in articles carry key messages and help to understand the content in a more direct way. An industrial news recommender system contains various key applications, such as personalized recommendation, item-to-item recommendation, news category classification, news popularity prediction and local news detection. We find that incorporating knowledge entities for better document understanding benefits these applications consistently. However, existing document understanding models either represent news articles without considering knowledge entities (e.g., BERT) or rely on a specific type of text encoding model (e.g., DKN) so that the generalization ability and efficiency is compromised. In this paper, we propose KRED, which is a fast and effective model to enhance arbitrary document representation with a knowledge graph. KRED first enriches entities embeddings by attentively aggregating information from their neighborhood in the knowledge graph. Then a context embedding layer is applied to annotate the dynamic context of different entities such as frequency, category and position. Finally, an information distillation layer aggregates the entity embeddings under the guidance of the original document representation and transforms the document vector into a new one. We advocate to optimize the model with a multi-task framework, so that different news recommendation applications can be united and useful information can be shared across different tasks. Experiments on a real-world Microsoft News dataset demonstrate that KRED greatly benefits a variety of news recommendation applications.

Information Retrieval Computation and Language Social and Information Networks

comments

Fetching comments

Information Technology Institute ITI

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Semantic classifier approach to document classification

Ask ChatGPT about the research

No Arabic abstract

Read More