No Arabic abstract
Multidimensional data distributions can have complex topologies and variable local dimensions. To approximate complex data, we propose a new type of low-dimensional ``principal object: a principal cubic complex. This complex is a generalization of linear and non-linear principal manifolds and includes them as a particular case. To construct such an object, we combine a method of topological grammars with the minimization of an elastic energy defined for its embedment into multidimensional data space. The whole complex is presented as a system of nodes and springs and as a product of one-dimensional continua (represented by graphs), and the grammars describe how these continua transform during the process of optimal complex construction. The simplest case of a topological grammar (``add a node, ``bisect an edge) is equivalent to the construction of ``principal trees, an object useful in many practical applications. We demonstrate how it can be applied to the analysis of bacterial genomes and for visualization of cDNA microarray data using the ``metro map representation. The preprint is supplemented by animation: ``How the topological grammar constructs branching principal components (AnimatedBranchingPCA.gif).
Principal manifolds are defined as lines or surfaces passing through ``the middle of data distribution. Linear principal manifolds (Principal Components Analysis) are routinely used for dimension reduction, noise filtering and data visualization. Recently, methods for constructing non-linear principal manifolds were proposed, including our elastic maps approach which is based on a physical analogy with elastic membranes. We have developed a general geometric framework for constructing ``principal objects of various dimensions and topologies with the simplest quadratic form of the smoothness penalty which allows very effective parallel implementations. Our approach is implemented in three programming languages (C++, Java and Delphi) with two graphical user interfaces (VidaExpert http://bioinfo.curie.fr/projects/vidaexpert and ViMiDa http://bioinfo-out.curie.fr/projects/vimida applications). In this paper we overview the method of elastic maps and present in detail one of its major applications: the visualization of microarray data in bioinformatics. We show that the method of elastic maps outperforms linear PCA in terms of data approximation, representation of between-point distance structure, preservation of local point neighborhood and representing point classes in low-dimensional spaces.
We provide a short introduction to the field of topological data analysis and discuss its possible relevance for the study of complex systems. Topological data analysis provides a set of tools to characterise the shape of data, in terms of the presence of holes or cavities between the points. The methods, based on notion of simplicial complexes, generalise standard network tools by naturally allowing for many-body interactions and providing results robust under continuous deformations of the data. We present strengths and weaknesses of current methods, as well as a range of empirical studies relevant to the field of complex systems, before identifying future methodological challenges to help understand the emergence of collective phenomena.
We analyze the connectivity structure of weighted brain networks extracted from spontaneous magnetoencephalographic (MEG) signals of healthy subjects and epileptic patients (suffering from absence seizures) recorded at rest. We find that, for the activities in the 5-14 Hz range, healthy brains exhibit a sparse connectivity, whereas the brain networks of patients display a rich connectivity with clear modular structure. Our results suggest that modularity plays a key role in the functional organization of brain areas during normal and pathological neural activities at rest.
Life and language are discrete combinatorial systems (DCSs) in which the basic building blocks are finite sets of elementary units: nucleotides or codons in a DNA sequence and letters or words in a language. Different combinations of these finite units give rise to potentially infinite numbers of genes or sentences. This type of DCS can be represented as an Alphabetic Bipartite Network ($alpha$-BiN) where there are two kinds of nodes, one type represents the elementary units while the other type represents their combinations. There is an edge between a node corresponding to an elementary unit $u$ and a node corresponding to a particular combination $v$ if $u$ is present in $v$. Naturally, the partition consisting of the nodes representing elementary units is fixed, while the other partition is allowed to grow unboundedly. Here, we extend recently analytical findings for $alpha$-BiNs derived in [Peruani et al., Europhys. Lett. 79, 28001 (2007)] and empirically investigate two real world systems: the codon-gene network and the phoneme-language network. The evolution equations for $alpha$-BiNs under different growth rules are derived, and the corresponding degree distributions computed. It is shown that asymptotically the degree distribution of $alpha$-BiNs can be described as a family of beta distributions. The one-mode projections of the theoretical as well as the real world $alpha$-BiNs are also studied. We propose a comparison of the real world degree distributions and our theoretical predictions as a means for inferring the mechanisms underlying the growth of real world systems.
An algorithm for optimization of signal significance or any other classification figure of merit suited for analysis of high energy physics (HEP) data is described. This algorithm trains decision trees on many bootstrap replicas of training data with each tree required to optimize the signal significance or any other chosen figure of merit. New data are then classified by a simple majority vote of the built trees. The performance of this algorithm has been studied using a search for the radiative leptonic decay B->gamma l nu at BaBar and shown to be superior to that of all other attempted classifiers including such powerful methods as boosted decision trees. In the B->gamma e nu channel, the described algorithm increases the expected signal significance from 2.4 sigma obtained by an original method designed for the B->gamma l nu analysis to 3.0 sigma.