No Arabic abstract
This paper formulates the problem of dynamically identifying key topics with proper labels from COVID-19 Tweets to provide an overview of wider public opinion. Nowadays, social media is one of the best ways to connect people through Internet technology, which is also considered an essential part of our daily lives. In late December 2019, an outbreak of the novel coronavirus, COVID-19 was reported, and the World Health Organization declared an emergency due to its rapid spread all over the world. The COVID-19 epidemic has affected the use of social media by many people across the globe. Twitter is one of the most influential social media services, which has seen a dramatic increase in its use from the epidemic. Thus dynamic extraction of specific topics with labels from tweets of COVID-19 is a challenging issue for highlighting conversation instead of manual topic labeling approach. In this paper, we propose a framework that automatically identifies the key topics with labels from the tweets using the top Unigram feature of aspect terms cluster from Latent Dirichlet Allocation (LDA) generated topics. Our experiment result shows that this dynamic topic identification and labeling approach is effective having the accuracy of 85.48% with respect to the manual static approach.
Topic models are widely used unsupervised models capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. In this paper we revisit and extend a so far neglected approach to topic model evaluation based on measuring topic coverage - computationally matching model topics with a set of reference topics that models are expected to uncover. The approach is well suited for analyzing models performance in topic discovery and for large-scale analysis of both topic models and measures of model quality. We propose new measures of coverage and evaluate, in a series of experiments, different types of topic models on two distinct text domains for which interest for topic discovery exists. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the analysis of the relationship between coverage and other methods of topic model evaluation. The paper contributes a new supervised measure of coverage, and the first unsupervised measure of coverage. The supervised measure achieves topic matching accuracy close to human agreement. The unsupervised measure correlates highly with the supervised one (Spearmans $rho geq 0.95$). Other contributions include insights into both topic models and different methods of model evaluation, and the datasets and code for facilitating future research on topic coverage.
This report describes the participation of two Danish universities, University of Copenhagen and Aalborg University, in the international search engine competition on COVID-19 (the 2020 TREC-COVID Challenge) organised by the U.S. National Institute of Standards and Technology (NIST) and its Text Retrieval Conference (TREC) division. The aim of the competition was to find the best search engine strategy for retrieving precise biomedical scientific information on COVID-19 from the largest, at that point in time, dataset of curated scientific literature on COVID-19 -- the COVID-19 Open Research Dataset (CORD-19). CORD-19 was the result of a call to action to the tech community by the U.S. White House in March 2020, and was shortly thereafter posted on Kaggle as an AI competition by the Allen Institute for AI, the Chan Zuckerberg Initiative, Georgetown Universitys Center for Security and Emerging Technology, Microsoft, and the National Library of Medicine at the US National Institutes of Health. CORD-19 contained over 200,000 scholarly articles (of which more than 100,000 were with full text) about COVID-19, SARS-CoV-2, and related coronaviruses, gathered from curated biomedical sources. The TREC-COVID challenge asked for the best way to (a) retrieve accurate and precise scientific information, in response to some queries formulated by biomedical experts, and (b) rank this information decreasingly by its relevance to the query. In this document, we describe the TREC-COVID competition setup, our participation to it, and our resulting reflections and lessons learned about the state-of-art technology when faced with the acute task of retrieving precise scientific information from a rapidly growing corpus of literature, in response to highly specialised queries, in the middle of a pandemic.
The world has seen in 2020 an unprecedented global outbreak of SARS-CoV-2, a new strain of coronavirus, causing the COVID-19 pandemic, and radically changing our lives and work conditions. Many scientists are working tirelessly to find a treatment and a possible vaccine. Furthermore, governments, scientific institutions and companies are acting quickly to make resources available, including funds and the opening of large-volume data repositories, to accelerate innovation and discovery aimed at solving this pandemic. In this paper, we develop a novel automated theme-based visualisation method, combining advanced data modelling of large corpora, information mapping and trend analysis, to provide a top-down and bottom-up browsing and search interface for quick discovery of topics and research resources. We apply this method on two recently released publications datasets (Dimensions COVID-19 dataset and the Allen Institute for AIs CORD-19). The results reveal intriguing information including increased efforts in topics such as social distancing; cross-domain initiatives (e.g. mental health and education); evolving research in medical topics; and the unfolding trajectory of the virus in different territories through publications. The results also demonstrate the need to quickly and automatically enable search and browsing of large corpora. We believe our methodology will improve future large volume visualisation and discovery systems but also hope our visualisation interfaces will currently aid scientists, researchers, and the general public to tackle the numerous issues in the fight against the COVID-19 pandemic.
Topic modeling is an unsupervised method for revealing the hidden semantic structure of a corpus. It has been increasingly widely adopted as a tool in the social sciences, including political science, digital humanities and sociological research in general. One desirable property of topic models is to allow users to find topics describing a specific aspect of the corpus. A possible solution is to incorporate domain-specific knowledge into topic modeling, but this requires a specification from domain experts. We propose a novel query-driven topic model that allows users to specify a simple query in words or phrases and return query-related topics, thus avoiding tedious work from domain experts. Our proposed approach is particularly attractive when the user-specified query has a low occurrence in a text corpus, making it difficult for traditional topic models built on word cooccurrence patterns to identify relevant topics. Experimental results demonstrate the effectiveness of our model in comparison with both classical topic models and neural topic models.
We describe our system for WNUT-2020 shared task on the identification of informative COVID-19 English tweets. Our system is an ensemble of various machine learning methods, leveraging both traditional feature-based classifiers as well as recent advances in pre-trained language models that help in capturing the syntactic, semantic, and contextual features from the tweets. We further employ pseudo-labelling to incorporate the unlabelled Twitter data released on the pandemic. Our best performing model achieves an F1-score of 0.9179 on the provided validation set and 0.8805 on the blind test-set.