No Arabic abstract
In community detection on graphs, the semi-supervised learning problem entails inferring the ground-truth membership of each node in a graph, given the connectivity structure and a limited number of revealed node labels. Different subsets of revealed labels can in principle lead to higher or lower information gains and induce different reconstruction accuracies. In the framework of the dense stochastic block model, we employ statistical physics methods to derive a large deviation analysis for this problem, in the high-dimensional limit. This analysis allows the characterization of the fluctuations around the typical behaviour, capturing the effect of correlated label choices and yielding an estimate of their informativeness and their rareness among subsets of the same size. We find theoretical evidence of a non-monotonic relationship between reconstruction accuracy and the free energy associated to the posterior measure of the inference problem. We further discuss possible implications for active learning applications in community detection.
Active learning is a branch of machine learning that deals with problems where unlabeled data is abundant yet obtaining labels is expensive. The learning algorithm has the possibility of querying a limited number of samples to obtain the corresponding labels, subsequently used for supervised learning. In this work, we consider the task of choosing the subset of samples to be labeled from a fixed finite pool of samples. We assume the pool of samples to be a random matrix and the ground truth labels to be generated by a single-layer teacher random neural network. We employ replica methods to analyze the large deviations for the accuracy achieved after supervised learning on a subset of the original pool. These large deviations then provide optimal achievable performance boundaries for any active learning algorithm. We show that the optimal learning performance can be efficiently approached by simple message-passing active learning algorithms. We also provide a comparison with the performance of some other popular active learning strategies.
Random constraint satisfaction problems undergo several phase transitions as the ratio between the number of constraints and the number of variables is varied. When this ratio exceeds the satisfiability threshold no more solutions exist; the satisfiable phase, for less constrained problems, is itself divided in an unclustered regime and a clustered one. In the latter solutions are grouped in clusters of nearby solutions separated in configuration space from solutions of other clusters. In addition the rigidity transition signals the appearance of so-called frozen variables in typical solutions: beyond this threshold most solutions belong to clusters with an extensive number of variables taking the same values in all solutions of the cluster. In this paper we refine the description of this phenomenon by estimating the location of the freezing transition, corresponding to the disappearance of all unfrozen solutions (not only typical ones). We also unveil phase transitions for the existence and uniqueness of locked solutions, in which all variables are frozen. From a technical point of view we characterize atypical solutions with a number of frozen variables different from the typical value via a large deviation study of the dynamics of a stripping process (whitening) that unveils the frozen variables of a solution, building upon recent works on atypical trajectories of the bootstrap percolation dynamics. Our results also bear some relevance from an algorithmic perspective, previous numerical studies having shown that heuristic algorithms of various kinds usually output unfrozen solutions.
The ground-state energy E_0 of a spin glass is an example of an extreme statistic. We consider the large deviations of this energy for a variety of models when the number of spins N goes to infinity. In most cases, the behavior can be understood qualitatively, in particular with the help of semi-analytical results for hierarchical lattices. Particular attention is paid to the Sherrington-Kirkpatrick model; after comparing to the Tracy-Widom distribution which follows from the spherical approximation, we find that the large deviations give rise to non-trivial scaling laws with N.
Supervised models of NLP rely on large collections of text which closely resemble the intended testing setting. Unfortunately matching text is often not available in sufficient quantity, and moreover, within any domain of text, data is often highly heterogenous. In this paper we propose a method to distill the important domain signal as part of a multi-domain learning system, using a latent variable model in which parts of a neural model are stochastically gated based on the inferred domain. We compare the use of discrete versus continuous latent variables, operating in a domain-supervised or a domain semi-supervised setting, where the domain is known only for a subset of training inputs. We show that our model leads to substantial performance improvements over competitive benchmark domain adaptation methods, including methods using adversarial learning.
Mesoscopic pattern extraction (MPE) is the problem of finding a partition of the nodes of a complex network that maximizes some objective function. Many well-known network inference problems fall in this category, including, for instance, community detection, core-periphery identification, and imperfect graph coloring. In this paper, we show that the most popular algorithms designed to solve MPE problems can in fact be understood as special cases of the maximum likelihood formulation of the stochastic block model (SBM), or one of its direct generalizations. These equivalence relations show that the SBM is nearly universal with respect to MPE problems.