No Arabic abstract
Normalized Google distance (NGD) is a relative semantic distance based on the World Wide Web (or any other large electronic database, for instance Wikipedia) and a search engine that returns aggregate page counts. The earlier NGD between pairs of search terms (including phrases) is not sufficient for all applications. We propose an NGD of finite multisets of search terms that is better for many applications. This gives a relative semantics shared by a multiset of search terms. We give applications and compare the results with those obtained using the pairwise NGD. The derivation of NGD method is based on Kolmogorov complexity.
The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.
Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society is `database, and the equivalent of `use is `way to search the database. We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide-web using Google page counts. The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WordNet categories.
Embedding is a useful technique to project a high-dimensional feature into a low-dimensional space, and it has many successful applications including link prediction, node classification and natural language processing. Current approaches mainly focus on static data, which usually lead to unsatisfactory performance in applications involving large changes over time. How to dynamically characterize the variation of the embedded features is still largely unexplored. In this paper, we introduce a dynamic variational embedding (DVE) approach for sequence-aware data based on recent advances in recurrent neural networks. DVE can model the nodes intrinsic nature and temporal variation explicitly and simultaneously, which are crucial for exploration. We further apply DVE to sequence-aware recommender systems, and develop an end-to-end neural architecture for link prediction.
We study the properties of the Google matrix of an Ulam network generated by intermittency maps. This network is created by the Ulam method which gives a matrix approximant for the Perron-Frobenius operator of dynamical map. The spectral properties of eigenvalues and eigenvectors of this matrix are analyzed. We show that the PageRank of the system is characterized by a power law decay with the exponent $beta$ dependent on map parameters and the Google damping factor $alpha$. Under certain conditions the PageRank is completely delocalized so that the Google search in such a situation becomes inefficient.
We study the properties of the Google matrix generated by a coarse-grained Perron-Frobenius operator of the Chirikov typical map with dissipation. The finite size matrix approximant of this operator is constructed by the Ulam method. This method applied to the simple dynamical model creates the directed Ulam networks with approximate scale-free scaling and characteristics being rather similar to those of the World Wide Web. The simple dynamical attractors play here the role of popular web sites with a strong concentration of PageRank. A variation of the Google parameter $alpha$ or other parameters of the dynamical map can drive the PageRank of the Google matrix to a delocalized phase with a strange attractor where the Google search becomes inefficient.