No Arabic abstract
We propose an image representation and matching approach that substantially improves visual-based location estimation for images. The main novelty of the approach, called distinctive visual element matching (DVEM), is its use of representations that are specific to the query image whose location is being predicted. These representations are based on visual element clouds, which robustly capture the connection between the query and visual evidence from candidate locations. We then maximize the influence of visual elements that are geo-distinctive because they do not occur in images taken at many other locations. We carry out experiments and analysis for both geo-constrained and geo-unconstrained location estimation cases using two large-scale, publicly-available datasets: the San Francisco Landmark dataset with $1.06$ million street-view images and the MediaEval 15 Placing Task dataset with $5.6$ million geo-tagged images from Flickr. We present examples that illustrate the highly-transparent mechanics of the approach, which are based on common sense observations about the visual patterns in image collections. Our results show that the proposed method delivers a considerable performance improvement compared to the state of the art.
With the proliferation of online social networking services and mobile smart devices equipped with mobile communications module and position sensor module, massive amount of multimedia data has been collected, stored and shared. This trend has put forward higher request on massive multimedia data retrieval. In this paper, we investigate a novel spatial query named region of visual interests query (RoVIQ), which aims to search users containing geographical information and visual words. Three baseline methods are presented to introduce how to exploit existing techniques to address this problem. Then we propose the definition of this query and related notions at the first time. To improve the performance of query, we propose a novel spatial indexing structure called quadtree based inverted visual index which is a combination of quadtree, inverted index and visual words. Based on it, we design a efficient search algorithm named region of visual interests search to support RoVIQ. Experimental evaluations on real geo-image datasets demonstrate that our solution outperforms state-of-the-art method.
Nowadays, users are encouraged to activate across multiple online social networks simultaneously. Anchor link prediction, which aims to reveal the correspondence among different accounts of the same user across networks, has been regarded as a fundamental problem for user profiling, marketing, cybersecurity, and recommendation. Existing methods mainly address the prediction problem by utilizing profile, content, or structural features of users in symmetric ways. However, encouraged by online services, users would also post asymmetric information across networks, such as geo-locations and texts. It leads to an emerged challenge in aligning users with asymmetric information across networks. Instead of similarity evaluation applied in previous works, we formalize correlation between geo-locations and texts and propose a novel anchor link prediction framework for matching users across networks. Moreover, our model can alleviate the label scarcity problem by introducing external data. Experimental results on real-world datasets show that our approach outperforms existing methods and achieves state-of-the-art results.
Audio-visual (AV) lip biometrics is a promising authentication technique that leverages the benefits of both the audio and visual modalities in speech communication. Previous works have demonstrated the usefulness of AV lip biometrics. However, the lack of a sizeable AV database hinders the exploration of deep-learning-based audio-visual lip biometrics. To address this problem, we compile a moderate-size database using existing public databases. Meanwhile, we establish the DeepLip AV lip biometrics system realized with a convolutional neural network (CNN) based video module, a time-delay neural network (TDNN) based audio module, and a multimodal fusion module. Our experiments show that DeepLip outperforms traditional speaker recognition models in context modeling and achieves over 50% relative improvements compared with our best single modality baseline, with an equal error rate of 0.75% and 1.11% on the test datasets, respectively.
Distributed visual analysis applications, such as mobile visual search or Visual Sensor Networks (VSNs) require the transmission of visual content on a bandwidth-limited network, from a peripheral node to a processing unit. Traditionally, a Compress-Then-Analyze approach has been pursued, in which sensing nodes acquire and encode the pixel-level representation of the visual content, that is subsequently transmitted to a sink node in order to be processed. This approach might not represent the most effective solution, since several analysis applications leverage a compact representation of the content, thus resulting in an inefficient usage of network resources. Furthermore, coding artifacts might significantly impact the accuracy of the visual task at hand. To tackle such limitations, an orthogonal approach named Analyze-Then-Compress has been proposed. According to such a paradigm, sensing nodes are responsible for the extraction of visual features, that are encoded and transmitted to a sink node for further processing. In spite of improved task efficiency, such paradigm implies the central processing node not being able to reconstruct a pixel-level representation of the visual content. In this paper we propose an effective compromise between the two paradigms, namely Hybrid-Analyze-Then-Compress (HATC) that aims at jointly encoding visual content and local image features. Furthermore, we show how a target tradeoff between image quality and task accuracy might be achieved by accurately allocating the bitrate to either visual content or local features.
Due to the rapid development of mobile Internet techniques, cloud computation and popularity of online social networking and location-based services, massive amount of multimedia data with geographical information is generated and uploaded to the Internet. In this paper, we propose a novel type of cross-modal multimedia retrieval called geo-multimedia cross-modal retrieval which aims to search out a set of geo-multimedia objects based on geographical distance proximity and semantic similarity between different modalities. Previous studies for cross-modal retrieval and spatial keyword search cannot address this problem effectively because they do not consider multimedia data with geo-tags and do not focus on this type of query. In order to address this problem efficiently, we present the definition of $k$NN geo-multimedia cross-modal query at the first time and introduce relevant conceptions such as cross-modal semantic representation space. To bridge the semantic gap between different modalities, we propose a method named cross-modal semantic matching which contains two important component, i.e., CorrProj and LogsTran, which aims to construct a common semantic representation space for cross-modal semantic similarity measurement. Besides, we designed a framework based on deep learning techniques to implement common semantic representation space construction. In addition, a novel hybrid indexing structure named GMR-Tree combining geo-multimedia data and R-Tree is presented and a efficient $k$NN search algorithm called $k$GMCMS is designed. Comprehensive experimental evaluation on real and synthetic dataset clearly demonstrates that our solution outperforms the-state-of-the-art methods.