A Query-Driven Topic Model

135 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zheng Fang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zheng Fang - Yulan He - Rob Procter

استرجاع المعلومات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Topic modeling is an unsupervised method for revealing the hidden semantic structure of a corpus. It has been increasingly widely adopted as a tool in the social sciences, including political science, digital humanities and sociological research in general. One desirable property of topic models is to allow users to find topics describing a specific aspect of the corpus. A possible solution is to incorporate domain-specific knowledge into topic modeling, but this requires a specification from domain experts. We propose a novel query-driven topic model that allows users to specify a simple query in words or phrases and return query-related topics, thus avoiding tedious work from domain experts. Our proposed approach is particularly attractive when the user-specified query has a low occurrence in a text corpus, making it difficult for traditional topic models built on word cooccurrence patterns to identify relevant topics. Experimental results demonstrate the effectiveness of our model in comparison with both classical topic models and neural topic models.

قيم البحث

91 - Khandaker Tayef Shahriar , Iqbal H. Sarker , Muhammad Nazrul Islam andn Mohammad Ali Moni 2021

This paper formulates the problem of dynamically identifying key topics with proper labels from COVID-19 Tweets to provide an overview of wider public opinion. Nowadays, social media is one of the best ways to connect people through Internet technolo gy, which is also considered an essential part of our daily lives. In late December 2019, an outbreak of the novel coronavirus, COVID-19 was reported, and the World Health Organization declared an emergency due to its rapid spread all over the world. The COVID-19 epidemic has affected the use of social media by many people across the globe. Twitter is one of the most influential social media services, which has seen a dramatic increase in its use from the epidemic. Thus dynamic extraction of specific topics with labels from tweets of COVID-19 is a challenging issue for highlighting conversation instead of manual topic labeling approach. In this paper, we propose a framework that automatically identifies the key topics with labels from the tweets using the top Unigram feature of aspect terms cluster from Latent Dirichlet Allocation (LDA) generated topics. Our experiment result shows that this dynamic topic identification and labeling approach is effective having the accuracy of 85.48% with respect to the manual static approach.

استرجاع المعلومات التعلم الآلي

A Topic Coverage Approach to Evaluation of Topic Models

143 - Damir Korenv{c}ic 2020

Topic models are widely used unsupervised models capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that ar ises naturally is how well the model-induced topics correspond to topics of interest to the analyst. In this paper we revisit and extend a so far neglected approach to topic model evaluation based on measuring topic coverage - computationally matching model topics with a set of reference topics that models are expected to uncover. The approach is well suited for analyzing models performance in topic discovery and for large-scale analysis of both topic models and measures of model quality. We propose new measures of coverage and evaluate, in a series of experiments, different types of topic models on two distinct text domains for which interest for topic discovery exists. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the analysis of the relationship between coverage and other methods of topic model evaluation. The paper contributes a new supervised measure of coverage, and the first unsupervised measure of coverage. The supervised measure achieves topic matching accuracy close to human agreement. The unsupervised measure correlates highly with the supervised one (Spearmans $rho geq 0.95$). Other contributions include insights into both topic models and different methods of model evaluation, and the datasets and code for facilitating future research on topic coverage.

استرجاع المعلومات الحساب واللغة التعلم الآلي

Recurrent Coupled Topic Modeling over Sequential Documents

117 - Jinjin Guo , Longbing Cao , Zhiguo Gong 2021

The abundant sequential documents such as online archival, social media and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive res earch on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward-forward filter algorithm efficiently learns latent timeevolving parameters in a closed-form. In addition, the latent Indian Buffet Process (IBP) compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.

استرجاع المعلومات التعلم الآلي

Recommending Researchers in Machine Learning based on Author-Topic Model

111 - Deepak Sharma , Bijendra Kumar , Satish Chand 2021

The aim of this paper is to uncover the researchers in machine learning using the author-topic model (ATM). We collect 16,855 scientific papers from six top journals in the field of machine learning published from 1997 to 2016 and analyze them using ATM. The dataset is broken down into 4 intervals to identify the top researchers and find similar researchers using their similarity score. The similarity score is calculated using Hellinger distance. The researchers are plotted using t-SNE, which reduces the dimensionality of the data while keeping the same distance between the points. The analysis of our study helps the upcoming researchers to find the top researchers in their area of interest.

استرجاع المعلومات

Building a Structured Query Engine

71 - Amanpreet Singh , Karthik Venkatesan , Simranjyot Singh Gill 2017

Finding patterns in data and being able to retrieve information from those patterns is an important task in Information retrieval. Complex search requirements which are not fulfilled by simple string matching and require exploring certain patterns in data demand a better query engine that can support searching via structured queries. In this article, we built a structured query engine which supports searching data through structured queries on the lines of ElasticSearch. We will show how we achieved real time indexing and retrieving of data through a RESTful API and how complex queries can be created and processed using efficient data structures we created for storing the data in structured way. Finally, we will conclude with an example of movie recommendation system built on top of this query engine.

استرجاع المعلومات قواعد البيانات