ترغب بنشر مسار تعليمي؟ اضغط هنا

Efficient tree-structured categorical retrieval

197   0   0.0 ( 0 )
 نشر من قبل Djamal Belazzougui
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We study a document retrieval problem in the new framework where $D$ text documents are organized in a {em category tree} with a pre-defined number $h$ of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern $p$ and a category (level in the category tree), we wish to efficiently retrieve the $t$ emph{categorical units} containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses $n(logsigma(1+o(1))+log D+O(h)) + O(Delta)$ bits of space and $O(|p|+t)$ query time, where $n$ is the total length of the documents, $sigma$ the size of the alphabet used in the documents and $Delta$ is the total number of nodes in the category tree. Another solution uses $n(logsigma(1+o(1))+O(log D))+O(Delta)+O(Dlog n)$ bits of space and $O(|p|+tlog D)$ query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

قيم البحث

اقرأ أيضاً

91 - Sean Cleary , Roland Maio 2020
It is an open question whether there exists a polynomial-time algorithm for computing the rotation distances between pairs of extended ordered binary trees. The problem of computing the rotation distance between an arbitrary pair of trees, (S, T), ca n be efficiently reduced to the problem of computing the rotation distance between a difficult pair of trees (S, T), where there is no known first step which is guaranteed to be the beginning of a minimal length path. Of interest, therefore, is how to sample such difficult pairs of trees of a fixed size. We show that it is possible to do so efficiently, and present such an algorithm that runs in time $O(n^4)$.
119 - Haim Kaplan , Jay Tenenbaum 2021
Locality Sensitive Hashing (LSH) is an effective method of indexing a set of items to support efficient nearest neighbors queries in high-dimensional spaces. The basic idea of LSH is that similar items should produce hash collisions with higher proba bility than dissimilar items. We study LSH for (not necessarily convex) polygons, and use it to give efficient data structures for similar shape retrieval. Arkin et al. represent polygons by their turning function - a function which follows the angle between the polygons tangent and the $ x $-axis while traversing the perimeter of the polygon. They define the distance between polygons to be variations of the $ L_p $ (for $p=1,2$) distance between their turning functions. This metric is invariant under translation, rotation and scaling (and the selection of the initial point on the perimeter) and therefore models well the intuitive notion of shape resemblance. We develop and analyze LSH near neighbor data structures for several variations of the $ L_p $ distance for functions (for $p=1,2$). By applying our schemes to the turning functions of a collection of polygons we obtain efficient near neighbor LSH-based structures for polygons. To tune our structures to turning functions of polygons, we prove some new properties of these turning functions that may be of independent interest. As part of our analysis, we address the following problem which is of independent interest. Find the vertical translation of a function $ f $ that is closest in $ L_1 $ distance to a function $ g $. We prove tight bounds on the approximation guarantee obtained by the translation which is equal to the difference between the averages of $ g $ and $ f $.
The seminal work of Chow and Liu (1968) shows that approximation of a finite probabilistic system by Markov trees can achieve the minimum information loss with the topology of a maximum spanning tree. Our current paper generalizes the result to Marko v networks of tree width $leq k$, for every fixed $kgeq 2$. In particular, we prove that approximation of a finite probabilistic system with such Markov networks has the minimum information loss when the network topology is achieved with a maximum spanning $k$-tree. While constructing a maximum spanning $k$-tree is intractable for even $k=2$, we show that polynomial algorithms can be ensured by a sufficient condition accommodated by many meaningful applications. In particular, we prove an efficient algorithm for learning the optimal topology of higher order correlations among random variables that belong to an underlying linear structure.
Dual-tree algorithms are a widely used class of branch-and-bound algorithms. Unfortunately, developing dual-tree algorithms for use with different trees and problems is often complex and burdensome. We introduce a four-part logical split: the tree, t he traversal, the point-to-point base case, and the pruning rule. We provide a meta-algorithm which allows development of dual-tree algorithms in a tree-independent manner and easy extension to entirely new types of trees. Representations are provided for five common algorithms; for k-nearest neighbor search, this leads to a novel, tighter pruning bound. The meta-algorithm also allows straightforward extensions to massively parallel settings.
129 - Namrata Vaswani 2020
Phase retrieval (PR), also sometimes referred to as quadratic sensing, is a problem that occurs in numerous signal and image acquisition domains ranging from optics, X-ray crystallography, Fourier ptychography, sub-diffraction imaging, and astronomy. In each of these domains, the physics of the acquisition system dictates that only the magnitude (intensity) of certain linear projections of the signal or image can be measured. Without any assumptions on the unknown signal, accurate recovery necessarily requires an over-complete set of measurements. The only way to reduce the measurements/sample complexity is to place extra assumptions on the unknown signal/image. A simple and practically valid set of assumptions is obtained by exploiting the structure inherently present in many natural signals or sequences of signals. Two commonly used structural assumptions are (i) sparsity of a given signal/image or (ii) a low rank model on the matrix formed by a set, e.g., a time sequence, of signals/images. Both have been explored for solving the PR problem in a sample-efficient fashion. This article describes this work, with a focus on non-convex approaches that come with sample complexity guarantees under simple assumptions. We also briefly describe other different types of structural assumptions that have been used in recent literature.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا