No Arabic abstract
In the big data era, massive amount of multimedia data with geo-tags has been generated and collected by mobile smart devices equipped with mobile communications module and position sensor module. This trend has put forward higher request on large-scale of geo-multimedia data retrieval. Spatial similarity join is one of the important problem in the area of spatial database. Previous works focused on textual document with geo-tags, rather than geo-multimedia data such as geo-images. In this paper, we study a novel search problem named spatial visual similarity join (SVS-JOIN for short), which aims to find similar geo-image pairs in both the aspects of geo-location and visual content. We propose the definition of SVS-JOIN at the first time and present how to measure geographical similarity and visual similarity. Then we introduce a baseline inspired by the method for textual similarity join and a extension named SVS-JOIN$_G$ which applies spatial grid strategy to improve the efficiency. To further improve the performance of search, we develop a novel approach called SVS-JOIN$_Q$ which utilizes a quadtree and a global inverted index. Experimental evaluations on real geo-image datasets demonstrate that our solution has a really high performance.
We introduce and study the problem of computing the similarity self-join in a streaming context (SSSJ), where the input is an unbounded stream of items arriving continuously. The goal is to find all pairs of items in the stream whose similarity is greater than a given threshold. The simplest formulation of the problem requires unbounded memory, and thus, it is intractable. To make the problem feasible, we introduce the notion of time-dependent similarity: the similarity of two items decreases with the difference in their arrival time. By leveraging the properties of this time-dependent similarity function, we design two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch (MB), uses existing index-based filtering techniques for the static version of the problem, and combines them in a pipeline. The second framework, Streaming (STR), adds time filtering to the existing indexes, and integrates new time-based bounds deeply in the working of the algorithms. We also introduce a new indexing technique (L2), which is based on an existing state-of-the-art indexing technique (L2AP), but is optimized for the streaming case. Extensive experiments show that the STR algorithm, when instantiated with the L2 index, is the most scalable option across a wide array of datasets and parameters.
Similarity join, which can find similar objects (e.g., products, names, addresses) across different sources, is powerful in dealing with variety in big data, especially web data. Threshold-driven similarity join, which has been extensively studied in the past, assumes that a user is able to specify a similarity threshold, and then focuses on how to efficiently return the object pairs whose similarities pass the threshold. We argue that the assumption about a well set similarity threshold may not be valid for two reasons. The optimal thresholds for different similarity join tasks may vary a lot. Moreover, the end-to-end time spent on similarity join is likely to be dominated by a back-and-forth threshold-tuning process. In response, we propose preference-driven similarity join. The key idea is to provide several result-set preferences, rather than a range of thresholds, for a user to choose from. Intuitively, a result-set preference can be considered as an objective function to capture a users preference on a similarity join result. Once a preference is chosen, we automatically compute the similarity join result optimizing the preference objective. As the proof of concept, we devise two useful preferences and propose a novel preference-driven similarity join framework coupled with effective optimization techniques. Our approaches are evaluated on four real-world web datasets from a diverse range of application scenarios. The experiments show that preference-driven similarity join can achieve high-quality results without a tedious threshold-tuning process.
With the proliferation of online social networking services and mobile smart devices equipped with mobile communications module and position sensor module, massive amount of multimedia data has been collected, stored and shared. This trend has put forward higher request on massive multimedia data retrieval. In this paper, we investigate a novel spatial query named region of visual interests query (RoVIQ), which aims to search users containing geographical information and visual words. Three baseline methods are presented to introduce how to exploit existing techniques to address this problem. Then we propose the definition of this query and related notions at the first time. To improve the performance of query, we propose a novel spatial indexing structure called quadtree based inverted visual index which is a combination of quadtree, inverted index and visual words. Based on it, we design a efficient search algorithm named region of visual interests search to support RoVIQ. Experimental evaluations on real geo-image datasets demonstrate that our solution outperforms state-of-the-art method.
Online social networking techniques and large-scale multimedia systems are developing rapidly, which not only has brought great convenience to our daily life, but generated, collected, and stored large-scale multimedia data. This trend has put forward higher requirements and greater challenges on massive multimedia data retrieval. In this paper, we investigate the problem of image similarity measurement which is used to lots of applications. At first we propose the definition of similarity measurement of images and the related notions. Based on it we present a novel basic method of similarity measurement named SMIN. To improve the performance of calculation, we propose a novel indexing structure called SMI Temp Index (SMII for short). Besides, we establish an index of potential similar visual words off-line to solve to problem that the index cannot be reused. Experimental evaluations on two real image datasets demonstrate that our solution outperforms state-of-the-art method.
We propose the algorithms for performing multiway joins using a new type of coarse grain reconfigurable hardware accelerator~-- ``Plasticine~-- that, compared with other accelerators, emphasizes high compute capability and high on-chip communication bandwidth. Joining three or more relations in a single step, i.e. multiway join, is efficient when the join of any two relations yields too large an intermediate relation. We show at least 200X speedup for a sequence of binary hash joins execution on Plasticine over CPU. We further show that in some realistic cases, a Plasticine-like accelerator can make 3-way joins more efficient than a cascade of binary hash joins on the same hardware, by a factor of up to 45X.