ترغب بنشر مسار تعليمي؟ اضغط هنا

Assigning Topics to Documents by Successive Projections

71   0   0.0 ( 0 )
 نشر من قبل Suzanne Sigalla
 تاريخ النشر 2021
  مجال البحث الاحصاء الرياضي
والبحث باللغة English
 تأليف Olga Klopp




اسأل ChatGPT حول البحث

Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various areas, such as image analysis, e-commerce, social networks, population genetics. A common approach to topic modeling is to associate each topic with a probability distribution on the dictionary of words and to consider each document as a mixture of topics. Since the number of topics is typically substantially smaller than the size of the corpus and of the dictionary, the methods of topic modeling can lead to a dramatic dimension reduction. In this paper, we study the problem of estimating topics distribution for each document in the given corpus, that is, we focus on the clustering aspect of the problem. We introduce an algorithm that we call Successive Projection Overlapping Clustering (SPOC) inspired by the Successive Projection Algorithm for separable matrix factorization. This algorithm is simple to implement and computationally fast. We establish theoretical guarantees on the performance of the SPOC algorithm, in particular, near matching minimax upper and lower bounds on its estimation risk. We also propose a new method that estimates the number of topics. We complement our theoretical results with a numerical study on synthetic and semi-synthetic data to analyze the performance of this new algorithm in practice. One of the conclusions is that the error of the algorithm grows at most logarithmically with the size of the dictionary, in contrast to what one observes for Latent Dirichlet Allocation.



قيم البحث

اقرأ أيضاً

199 - Herbert Roitblat 2021
In the United States, the parties to a lawsuit are required to search through their electronically stored information to find documents that are relevant to the specific case and produce them to their opposing party. Negotiations over the scope of th ese searches often reflect a fear that something will be missed (Fear of Missing Out: FOMO). A Recall level of 80%, for example, means that 20% of the relevant documents will be left unproduced. This paper makes the argument that eDiscovery is the process of identifying responsive information, not identifying documents. Documents are the carriers of the information; they are not the direct targets of the process. A given document may contain one or more topics or factoids and a factoid may appear in more than one document. The coupon collectors problem, Heaps law, and other analyses provide ways to model the problem of finding information from among documents. In eDiscovery, however, the parties do not know how many factoids there might be in a collection or their probabilities. This paper describes a simple model that estimates the confidence that a fact will be omitted from the produced set (the identified set), while being contained in the missed set. Two data sets are then analyzed, a small set involving microaggressions and larger set involving classification of web pages. Both show that it is possible to discover at least one example of each available topic within a relatively small number of documents, meaning the further effort will not return additional novel information. The smaller data set is also used to investigate whether the non-random order of searching for responsive documents commonly used in eDiscovery (called continuous active learning) affects the distribution of topics-it does not.
The organization and evolution of science has recently become itself an object of scientific quantitative investigation, thanks to the wealth of information that can be extracted from scientific documents, such as citations between papers and co-auth orship between researchers. However, only few studies have focused on the concepts that characterize full documents and that can be extracted and analyzed, revealing the deeper organization of scientific knowledge. Unfortunately, several concepts can be so common across documents that they hinder the emergence of the underlying topical structure of the document corpus, because they give rise to a large amount of spurious and trivial relations among documents. To identify and remove common concepts, we introduce a method to gauge their relevance according to an objective information-theoretic measure related to the statistics of their occurrence across the document corpus. After progressively removing concepts that, according to this metric, can be considered as generic, we find that the topic organization displays a correspondingly more refined structure.
Blockchains based on the celebrated Nakamoto consensus protocol have shown promise in several applications, including cryptocurrencies. However, these blockchains have inherent scalability limits caused by the protocols consensus properties. In parti cular, the consistency property demonstrates a tight trade-off between block production speed and the systems security in terms of resisting adversarial attacks. This paper proposes a novel method, Ironclad, that improves blockchain consistency by assigning different weights to randomly selected blocks. We analyze the fundamental properties of our method and show that the combination of our method with Nakamoto consensus protocols can lead to significant improvement in consistency. A direct result is that Nakamoto+Ironclad can enable a much faster ($10sim 50$ times with normal parameter settings) block production rate than Nakamoto protocol under the same security guarantee with the same proportion of malicious mining power.
We discuss a general approach to handling multiple hypotheses testing in the case when a particular hypothesis states that the vector of parameters identifying the distribution of observations belongs to a convex compact set associated with the hypot hesis. With our approach, this problem reduces to testing the hypotheses pairwise. Our central result is a test for a pair of hypotheses of the outlined type which, under appropriate assumptions, is provably nearly optimal. The test is yielded by a solution to a convex programming problem, so that our construction admits computationally efficient implementation. We further demonstrate that our assumptions are satisfied in several important and interesting applications. Finally, we show how our approach can be applied to a rather general detection problem encompassing several classical statistical settings such as detection of abrupt signal changes, cusp detection and multi-sensor detection.
Distance correlation is a new measure of dependence between random vectors. Distance covariance and distance correlation are analogous to product-moment covariance and correlation, but unlike the classical definition of correlation, distance correlat ion is zero only if the random vectors are independent. The empirical distance dependence measures are based on certain Euclidean distances between sample elements rather than sample moments, yet have a compact representation analogous to the classical covariance and correlation. Asymptotic properties and applications in testing independence are discussed. Implementation of the test and Monte Carlo results are also presented.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا