No Arabic abstract
In this paper we investigate how gradient-based algorithms such as gradient descent, (multi-pass) stochastic gradient descent, its persistent variant, and the Langevin algorithm navigate non-convex loss-landscapes and which of them is able to reach the best generalization error at limited sample complexity. We consider the loss landscape of the high-dimensional phase retrieval problem as a prototypical highly non-convex example. We observe that for phase retrieval the stochastic variants of gradient descent are able to reach perfect generalization for regions of control parameters where the gradient descent algorithm is not. We apply dynamical mean-field theory from statistical physics to characterize analytically the full trajectories of these algorithms in their continuous-time limit, with a warm start, and for large system sizes. We further unveil several intriguing properties of the landscape and the algorithms such as that the gradient descent can obtain better generalization properties from less informed initializations.
Using a non-thermal local search, called Extremal Optimization (EO), in conjunction with a recently developed scheme for classifying the valley structure of complex systems, we analyze a short-range spin glass. In comparison with earlier studies using a thermal algorithm with detailed balance, we determine which features of the landscape are algorithm dependent and which are inherently geometrical. Apparently a characteristic for any local search in complex energy landscapes, the time series of successive energy records found by EO also is characterized approximately by a log-Poisson statistics. Differences in the results provide additional insights into the performance of EO. In contrast with a thermal search, the extremal search visits dramatically higher energies while returning to more widely separated low-energy configurations. Two important properties of the energy landscape are independent of either algorithm: first, to find lower energy records, progressively higher energy barriers need to be overcome. Second, the Hamming distance between two consecutive low-energy records is linearly related to the height of the intervening barrier.
The conjugate gradient (CG) method, a standard and vital way of minimizing the energy of a variational state, is applied to solve several problems in Skyrmion physics. The single-Skyrmion profile optimizing the energy of a two-dimensional chiral magnet is found without relying on specific boundary conditions. The two-dimensional Skyrmion lattice and three-dimensional hedgehog crystal state is recovered with efficiency using the modified CG (p-GD) method. The p-GD method is proposed as a complement to the traditional Monte Carlo annealing method, which still gives better results for the ground state but at the far greater cost in computation time.
We study the scaling properties of the solid-on-solid front of the infinite cluster in two-dimensional gradient percolation. We show that such an object is self affine with a Hurst exponent equal to 2/3 up to a cutoff-length proportional to the gradient to the power (-4/7). Beyond this length scale, the front position has the character of uncorrelated noise. Importantly, the self-affine behavior is robust even after removing local jumps of the front. The previously observed multi affinity, is due to the dominance of overhangs at small distances in the structure function. This is a crossover effect.
We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.
We study a bandit version of phase retrieval where the learner chooses actions $(A_t)_{t=1}^n$ in the $d$-dimensional unit ball and the expected reward is $langle A_t, theta_starrangle^2$ where $theta_star in mathbb R^d$ is an unknown parameter vector. We prove that the minimax cumulative regret in this problem is $smash{tilde Theta(d sqrt{n})}$, which improves on the best known bounds by a factor of $smash{sqrt{d}}$. We also show that the minimax simple regret is $smash{tilde Theta(d / sqrt{n})}$ and that this is only achievable by an adaptive algorithm. Our analysis shows that an apparently convincing heuristic for guessing lower bounds can be misleading and that uniform bounds on the information ratio for information-directed sampling are not sufficient for optimal regret.