No Arabic abstract
Let $z=(x,y)$ be coordinates for the product space $mathbb{R}^{m_1}times mathbb{R}^{m_2}$. Let $f:mathbb{R}^{m_1}times mathbb{R}^{m_2}rightarrow mathbb{R}$ be a $C^1$ function, and $ abla f=(partial _xf,partial _yf)$ its gradient. Fix $0<alpha <1$. For a point $(x,y) in mathbb{R}^{m_1}times mathbb{R}^{m_2}$, a number $delta >0$ satisfies Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} f(x-delta partial _xf,y-delta partial _yf)-f(x,y)leq -alpha delta (||partial _xf||^2+||partial _yf||^2). end{eqnarray*} When $f(x,y)=f_1(x)+f_2(y)$ is a coordinate-wise sum map, we propose the following {bf coordinate-wise} Armijos condition. Fix again $0<alpha <1$. A pair of positive numbers $delta _1,delta _2>0$ satisfies the coordinate-wise variant of Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} [f_1(x-delta _1 abla f_1(x))+f_2(y-delta _2 abla f_2(y))]-[f_1(x)+f_2(y)]leq -alpha (delta _1|| abla f_1(x)||^2+delta _2|| abla f_2(y)||^2). end{eqnarray*} We then extend results in our recent previous results, on Backtracking Gradient Descent and some variants, to this setting. We show by an example the advantage of using coordinate-wise Armijos condition over the usual Armijos condition.
Let $z=(x,y)$ be coordinates for the product space $mathbb{R}^{m_1}times mathbb{R}^{m_2}$. Let $f:mathbb{R}^{m_1}times mathbb{R}^{m_2}rightarrow mathbb{R}$ be a $C^1$ function, and $ abla f=(partial _xf,partial _yf)$ its gradient. Fix $0<alpha <1$. For a point $(x,y) in mathbb{R}^{m_1}times mathbb{R}^{m_2}$, a number $delta >0$ satisfies Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} f(x-delta partial _xf,y-delta partial _yf)-f(x,y)leq -alpha delta (||partial _xf||^2+||partial _yf||^2). end{eqnarray*} In one previous paper, we proposed the following {bf coordinate-wise} Armijos condition. Fix again $0<alpha <1$. A pair of positive numbers $delta _1,delta _2>0$ satisfies the coordinate-wise variant of Armijos condition at $(x,y)$ if the following inequality holds: begin{eqnarray*} [f(x-delta _1partial _xf(x,y), y-delta _2partial _y f(x,y))]-[f(x,y)]leq -alpha (delta _1||partial _xf(x,y)||^2+delta _2||partial _yf(x,y)||^2). end{eqnarray*} Previously we applied this condition for functions of the form $f(x,y)=f(x)+g(y)$, and proved various convergent results for them. For a general function, it is crucial - for being able to do real computations - to have a systematic algorithm for obtaining $delta _1$ and $delta _2$ satisfying the coordinate-wise version of Armijos condition, much like Backtracking for the usual Armijos condition. In this paper we propose such an algorithm, and prove according convergent results. We then analyse and present experimental results for some functions such as $f(x,y)=a|x|+y$ (given by Asl and Overton in connection to Wolfes method), $f(x,y)=x^3 sin (1/x) + y^3 sin(1/y)$ and Rosenbrocks function.
Fix a constant $0<alpha <1$. For a $C^1$ function $f:mathbb{R}^krightarrow mathbb{R}$, a point $x$ and a positive number $delta >0$, we say that Armijos condition is satisfied if $f(x-delta abla f(x))-f(x)leq -alpha delta || abla f(x)||^2$. It is a basis for the well known Backtracking Gradient Descent (Backtracking GD) algorithm. Consider a sequence ${x_n}$ defined by $x_{n+1}=x_n-delta _n abla f(x_n)$, for positive numbers $delta _n$ for which Armijos condition is satisfied. We show that if ${x_n}$ converges to a non-degenerate critical point, then ${delta _n}$ must be bounded. Moreover this boundedness can be quantified in terms of the norms of the Hessian $ abla ^2f$ and its inverse at the limit point. This complements the first authors results on Unbounded Backtracking GD, and shows that in case of convergence to a non-degenerate critical point the behaviour of Unbounded Backtracking GD is not too different from that of usual Backtracking GD. On the other hand, in case of convergence to a degenerate critical point the behaviours can be very much different. We run some experiments to illustrate that both scenrios can really happen. In another part of the paper, we argue that Backtracking GD has the correct unit (according to a definition by Zeiler in his Adadeltas paper). The main point is that since learning rate in Backtracking GD is bound by Armijos condition, it is not unitless.
The method of block coordinate gradient descent (BCD) has been a powerful method for large-scale optimization. This paper considers the BCD method that successively updates a series of blocks selected according to a Markov chain. This kind of block selection is neither i.i.d. random nor cyclic. On the other hand, it is a natural choice for some applications in distributed optimization and Markov decision process, where i.i.d. random and cyclic selections are either infeasible or very expensive. By applying mixing-time properties of a Markov chain, we prove convergence of Markov chain BCD for minimizing Lipschitz differentiable functions, which can be nonconvex. When the functions are convex and strongly convex, we establish both sublinear and linear convergence rates, respectively. We also present a method of Markov chain inertial BCD. Finally, we discuss potential applications.
Difference-of-Convex (DC) minimization, referring to the problem of minimizing the difference of two convex functions, has been found rich applications in statistical learning and studied extensively for decades. However, existing methods are primarily based on multi-stage convex relaxation, only leading to weak optimality of critical points. This paper proposes a coordinate descent method for minimizing DC functions based on sequential nonconvex approximation. Our approach iteratively solves a nonconvex one-dimensional subproblem globally, and it is guaranteed to converge to a coordinate-wise stationary point. We prove that this new optimality condition is always stronger than the critical point condition and the directional point condition when the objective function is weakly convex. For comparisons, we also include a naive variant of coordinate descent methods based on sequential convex approximation in our study. When the objective function satisfies an additional regularity condition called emph{sharpness}, coordinate descent methods with an appropriate initialization converge emph{linearly} to the optimal solution set. Also, for many applications of interest, we show that the nonconvex one-dimensional subproblem can be computed exactly and efficiently using a breakpoint searching method. We present some discussions and extensions of our proposed method. Finally, we have conducted extensive experiments on several statistical learning tasks to show the superiority of our approach. Keywords: Coordinate Descent, DC Minimization, DC Programming, Difference-of-Convex Programs, Nonconvex Optimization, Sparse Optimization, Binary Optimization.
This paper is concerned with improving the empirical convergence speed of block-coordinate descent algorithms for approximate nonnegative tensor factorization (NTF). We propose an extrapolation strategy in-between block updates, referred to as heuristic extrapolation with restarts (HER). HER significantly accelerates the empirical convergence speed of most existing block-coordinate algorithms for dense NTF, in particular for challenging computational scenarios, while requiring a negligible additional computational budget.