ترغب بنشر مسار تعليمي؟ اضغط هنا

Entropic selection of concepts unveils hidden topics in documents corpora

78   0   0.0 ( 0 )
 نشر من قبل Alessio Cardillo
 تاريخ النشر 2017
والبحث باللغة English




اسأل ChatGPT حول البحث

The organization and evolution of science has recently become itself an object of scientific quantitative investigation, thanks to the wealth of information that can be extracted from scientific documents, such as citations between papers and co-authorship between researchers. However, only few studies have focused on the concepts that characterize full documents and that can be extracted and analyzed, revealing the deeper organization of scientific knowledge. Unfortunately, several concepts can be so common across documents that they hinder the emergence of the underlying topical structure of the document corpus, because they give rise to a large amount of spurious and trivial relations among documents. To identify and remove common concepts, we introduce a method to gauge their relevance according to an objective information-theoretic measure related to the statistics of their occurrence across the document corpus. After progressively removing concepts that, according to this metric, can be considered as generic, we find that the topic organization displays a correspondingly more refined structure.



قيم البحث

اقرأ أيضاً

We show how the Shannon entropy function can be used as a basis to set up complexity measures weighting the economic efficiency of countries and the specialization of products beyond bare diversification. This entropy function guarantees the existenc e of a fixed point which is rapidly reached by an iterative scheme converging to our self-consistent measures. Our approach naturally allows to decompose into inter-sectorial and intra-sectorial contributions the country competitivity measure if products are partitioned into larger categories. Besides outlining the technical features and advantages of the method, we describe a wide range of results arising from the analysis of the obtained rankings and we benchmark these observations against those established with other economical parameters. These comparisons allow to partition countries and products into various main typologies, with well-revealed characterizing features. Our methods have wide applicability to general problems of ranking in bipartite networks.
199 - Herbert Roitblat 2021
In the United States, the parties to a lawsuit are required to search through their electronically stored information to find documents that are relevant to the specific case and produce them to their opposing party. Negotiations over the scope of th ese searches often reflect a fear that something will be missed (Fear of Missing Out: FOMO). A Recall level of 80%, for example, means that 20% of the relevant documents will be left unproduced. This paper makes the argument that eDiscovery is the process of identifying responsive information, not identifying documents. Documents are the carriers of the information; they are not the direct targets of the process. A given document may contain one or more topics or factoids and a factoid may appear in more than one document. The coupon collectors problem, Heaps law, and other analyses provide ways to model the problem of finding information from among documents. In eDiscovery, however, the parties do not know how many factoids there might be in a collection or their probabilities. This paper describes a simple model that estimates the confidence that a fact will be omitted from the produced set (the identified set), while being contained in the missed set. Two data sets are then analyzed, a small set involving microaggressions and larger set involving classification of web pages. Both show that it is possible to discover at least one example of each available topic within a relatively small number of documents, meaning the further effort will not return additional novel information. The smaller data set is also used to investigate whether the non-random order of searching for responsive documents commonly used in eDiscovery (called continuous active learning) affects the distribution of topics-it does not.
70 - Olga Klopp 2021
Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various areas, such as image analysis, e-commerce, social networks, population genetics. A common approach to topic modeling is to associate each topic with a probability distribution on the dictionary of words and to consider each document as a mixture of topics. Since the number of topics is typically substantially smaller than the size of the corpus and of the dictionary, the methods of topic modeling can lead to a dramatic dimension reduction. In this paper, we study the problem of estimating topics distribution for each document in the given corpus, that is, we focus on the clustering aspect of the problem. We introduce an algorithm that we call Successive Projection Overlapping Clustering (SPOC) inspired by the Successive Projection Algorithm for separable matrix factorization. This algorithm is simple to implement and computationally fast. We establish theoretical guarantees on the performance of the SPOC algorithm, in particular, near matching minimax upper and lower bounds on its estimation risk. We also propose a new method that estimates the number of topics. We complement our theoretical results with a numerical study on synthetic and semi-synthetic data to analyze the performance of this new algorithm in practice. One of the conclusions is that the error of the algorithm grows at most logarithmically with the size of the dictionary, in contrast to what one observes for Latent Dirichlet Allocation.
Due to the discovery of the hidden-charm pentaquark $P_c$ states by the LHCb collaboration, the interests on the candidates of hidden-bottom pentaquark $P_b$ states are increasing. They are anticipated to exist as the analogues of the $P_c$ states in the bottom sector and predicted by many models. We give an exploration of searching for a typical $P_b$ in the $gamma p to Upsilon p$ reaction, which shows a promising potential to observe it at an electron-ion collider. The possibility of searching for $P_b$ in open-bottom channels are also briefly discussed. Meanwhile, the $t$-channel non-resonant contribution, which in fact covers several interesting topics at low energies, is systematically investigated.
Do scientists follow hot topics in their scientific investigations? In this paper, by performing analysis to papers published in the American Physical Society (APS) Physical Review journals, it is found that papers are more likely to be attracted by hot fields, where the hotness of a field is measured by the number of papers belonging to the field. This indicates that scientists generally do follow hot topics. However, there are qualitative differences among scientists from various countries, among research works regarding different number of authors, different number of affiliations and different number of references. These observations could be valuable for policy makers when deciding research funding and also for individual researchers when searching for scientific projects.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا