FEG: Fast Extra Gradient Method

7 minute read

Published:

论文阅读笔记: Fast Extra Gradient Methods for Smooth Structured Nonconvex-Nonconcave Minimax Problems

Method Overview

文章关注于非凸非凹的极小极大问题,可以看作是 EG+ 算法的加速版本,利用锚点 (Anchor)的思想进行加速,

将EG+中 $\mathcal{O}(\frac{1}{k})$ 的收敛率加速为 $\mathcal{O}(\frac{1}{k^2})$.

对于问题, \(\begin{align} \min_x \max_y f(x,y) \end{align}\)

以及定义的算子,

\[\begin{align} Fz = [\nabla_x f(x,y), \nabla_t f(x,y)]^\top, z = [x,y]^\top \end{align}\]

满足假设为,

\[\begin{align} (Fz - Fz')^\top(z -z') \ge \frac{\rho}{2} \Vert Fz - Fz' \Vert, \rho \in (-L, +\infty) \end{align}\]

可以证明的是,该假设的蕴含了EG+中的假设,而且在该算法中 $\rho$ 的范围更大,同时该算法的收敛率也更快,

算法使用如下迭代格式,

\[\begin{align} z_{k+1/2} &= z_k + \beta_k (z_0 - z_k) - (1- \beta_k) (\alpha_k + \rho_k) Fz_k \\ z_{k+1} &= z_k + \beta_k (z_0 - z_k) - \alpha_k Fz_{k+1/2} - (1- \beta_k)\rho_k Fz_{k} \end{align}\]

Convergence Analysis

算法的收敛率证明依赖于下述定义的函数,

\[\begin{align} \mathcal{V}_k &= A_k \Vert Fz_k \Vert^2 - B_k Fz_k^\top (z_0 - z_k) \\ \text{With } A_k &= \frac{B_k(1 - \beta_k)(\alpha_k + \rho_k)}{2 \beta_k} - \frac{B_k \rho_k}{2}, B_{k+1} = \frac{B_k}{1 - \beta_k} \end{align}\]

文章将证明该序列满足单调递减的性质,从而可以通过简单地推导得到 $\mathcal{O}(\frac{1}{k^2})$ 的收敛率,

根据迭代公式,有,

\[\begin{align} z_{k+1} - z_k &= \beta_k (z_0 - z_k) - \alpha_k Fz_{k+1/2} - (1- \beta_k)\rho_k Fz_{k} \\ z_{k+1} - z_k &= \frac{\beta_k}{1 - \beta_k} (z_0 - z_{k+1}) - \frac{\alpha_k}{1 - \beta_k} Fz_{k+1/2} - \rho_k Fz_{k} \\ z_{k+1} - z_{k+1/2} &= \alpha_k (1- \beta_k) Fz_k - \alpha_k Fz_{k+1/2} \end{align}\]

因此,

\[\begin{align} \mathcal{V}_{k+1} - \mathcal{V}_k &\le \mathcal{V}_{k+1} - \mathcal{V}_k + \frac{B_k}{\beta_k} (Fz_{k+1} - Fz_k )^\top(z_{k+1} - z_k) - \frac{\rho_k B_k}{2 \beta_k} \Vert F z_{k+1} - F z_{k} \Vert^2 \\ &= \mathcal{V}_{k+1} - \mathcal{V}_k + \frac{B_k}{\beta_k} Fz_{k+1}^\top(z_{k+1} - z_k)- \frac{B_k}{\beta_k} Fz_{k}^\top(z_{k+1} - z_k) - \frac{\rho_k B_k}{2 \beta_k} \Vert F z_{k+1} - F z_{k} \Vert^2 \\ &= (A_{k+1} \Vert Fz_{k+1} \Vert^2 - B_{k+1} Fz_{k+1}^\top (z_0 - z_{k+1})) - (A_k \Vert Fz_k \Vert^2 - B_k Fz_k^\top (z_0 - z_k) ) \\ &\quad + \frac{B_k}{\beta_k} Fz_{k+1}^\top(\frac{\beta_k}{1 - \beta_k} (z_0 - z_{k+1}) - \frac{\alpha_k}{1 - \beta_k} Fz_{k+1/2} - \rho_k Fz_{k}) \\ &\quad - \frac{B_k}{\beta_k} Fz_{k}^\top(\beta_k (z_0 - z_k) - \alpha_k Fz_{k+1/2} - (1- \beta_k)\rho_k Fz_{k} ) - \frac{\rho_k B_k}{2 \beta_k} \Vert F z_{k+1} - F z_{k} \Vert^2 \\ &= (A_{k+1} - \frac{\rho_k B_k}{2 \beta_k}) \Vert F z_{k+1} \Vert^2 -( A_k - \frac{B_k (1 - \beta_k) \rho_k}{\beta_k} + \frac{\rho_k B_k}{2 \beta_k}) \Vert Fz_k \Vert^2 \\ &\quad + (\frac{B_k}{1 - \beta_k}- B_{k+1})Fz_{k+1}^\top(z_0 - z_{k+1}) - \frac{B_k \alpha_k}{\beta_k(1- \beta_k)} F z_{k+1}^\top F_{k+ 1/2} + \frac{B_k \alpha_k}{\beta_k} F z_k^\top Fz_{k+1/2} \\ &= (A_{k+1} - \frac{\rho_k B_k}{2 \beta_k}) \Vert F z_{k+1} \Vert^2 -( A_k - \frac{B_k (1 - \beta_k) \rho_k}{\beta_k} + \frac{\rho_k B_k}{2 \beta_k}) \Vert Fz_k \Vert^2 \\ &\quad-\frac{B_k \alpha_k}{\beta_k(1- \beta_k)} F z_{k+1}^\top F_{k+ 1/2} + \frac{B_k \alpha_k}{\beta_k} F z_k^\top Fz_{k+1/2}, \text{Let } B_{k+1} = \frac{B_k}{1 - \beta_k}\\ \end{align}\]

利用Lipschitz连续性,

\[\begin{align} \Vert Fz_{k+1} - Fz_{k+1/2} \Vert^2 &\le L^2 \Vert z_{k+1} - z_{k+1/2} \Vert^2 \\ &= L^2 \Vert \alpha_k (1- \beta_k) Fz_k - \alpha_k Fz_{k+1/2} \Vert^2 \\ &= L^2\alpha_k^2 ( 1- \beta_k)^2 \Vert Fz_k \Vert^2 -2 L^2 \alpha_k^2(1- \beta_k) Fz_k^\top Fz_{k+1/2} + L^2 \alpha_k^2 \Vert Fz_{k+ 1/2} \Vert^2 \\ \end{align}\]

移项后有,

\[\begin{align} Fz_k^\top Fz_{k+1/2} \le \frac{1 - \beta_k}{2} \Vert Fz_k \Vert^2 + \frac{1}{2(1 - \beta_k)} \Vert Fz_{k+1/2} \Vert^2 - \frac{1}{2L^2 \alpha_k^2(1- \beta_k)} \Vert Fz_{k+1} - F z_{k+ 1/2} \Vert^2 \end{align}\]

代入上式得到,

\[\begin{align} \mathcal{V}_{k+1} - \mathcal{V}_k &\le (A_{k+1} - \frac{\rho_k B_k}{2 \beta_k}) \Vert F z_{k+1} \Vert^2 -( A_k - \frac{B_k (1 - \beta_k) \rho_k}{\beta_k} + \frac{\rho_k B_k}{2 \beta_k}) \Vert Fz_k \Vert^2 -\frac{B_k \alpha_k}{\beta_k(1- \beta_k)} F z_{k+1}^\top F_{k+ 1/2}\\ &\quad+\frac{B_k \alpha_k}{\beta_k}(\frac{1 - \beta_k}{2} \Vert Fz_k \Vert^2 + \frac{1}{2(1 - \beta_k)} \Vert Fz_{k+1/2} \Vert^2 - \frac{1}{2L^2 \alpha_k^2(1- \beta_k)} \Vert Fz_{k+1} - F z_{k+ 1/2} \Vert^2) \\ &= (A_{k+1} - \frac{\rho_k B_k}{2 \beta_k}) \Vert F z_{k+1} \Vert^2 -( A_k - \frac{B_k (1 - \beta_k) \rho_k}{\beta_k} + \frac{\rho_k B_k}{2 \beta_k}) \Vert Fz_k \Vert^2 -\frac{B_k \alpha_k}{\beta_k(1- \beta_k)} F z_{k+1}^\top F_{k+ 1/2} \\ &+ \frac{B_k \alpha_k(1-\beta_k)}{2 \beta_k} \Vert Fz_k \Vert^2 + \frac{B_k \alpha_k}{2 \beta_k(1-\beta_k)} \Vert Fz_{k+1/2} \Vert^2 - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)} \Vert F z_{k+1} - Fz_{k+1/2} \Vert^2 \\ &= (A_{k+1} - \frac{\rho_k B_k}{2 \beta_k} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert F z_{k+1} \Vert^2 - ( A_k+ \frac{\rho_k B_k}{2 \beta_k} - \frac{B_k (1 - \beta_k) \rho_k}{\beta_k} - \frac{B_k \alpha_k(1-\beta_k)}{2 \beta_k}) \Vert Fz_k \Vert^2 \\ &\quad + (\frac{B_k \alpha_k}{2 \beta_k(1-\beta_k)} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert Fz_{k+1/2} \Vert^2 + (\frac{B_k}{L^2 \beta_k \alpha_k(1 - \beta_k)}- \frac{B_k \alpha_k}{\beta_k(1-\beta_k)}) Fz_{k+1}^\top F z_{k+1/2} \\ \end{align}\]

为了消去 $\Vert Fz_k \Vert^2 $ 项,只需要令,

\[\begin{align} A_k &= \frac{B_k \alpha_k(1-\beta_k)}{2 \beta_k} + \frac{B_k (1 - \beta_k) \rho_k}{\beta_k} - \frac{\rho_k B_k}{2 \beta_k} \\ &= \frac{B_k(1-\beta_k)(\alpha_k + \rho_k)}{2 \beta_k} - \frac{B_k \rho_k }{2} \end{align}\]

并且令 $\alpha_{k+1} = \alpha_k =\alpha, \rho_{k+1} = \rho_k = \rho, \beta_{k+1} \le \beta_k$ ,此时很神奇地成立,

\[\begin{align} A_{k+1} &= \frac{B_{k+1}(1-\beta_{k+1})(\alpha_k + \rho_k)}{2 \beta_{k+1}} - \frac{B_{k+1} \rho_k }{2} \\ &\le \frac{B_{k+1}(\alpha_k + \rho_k)}{2 \beta_{k}} - \frac{B_{k+1} \rho_k }{2} \\ &= \frac{B_{k+1} \alpha_k}{ 2 \beta_k} + \frac{B_{k+1} (1-\beta_k) \rho_k}{2 \beta_k} \\ &= \frac{B_{k} \alpha_k}{ 2 \beta_k(1- \beta_k)} + \frac{B_{k} \rho_k}{2 \beta_k} \\ \end{align}\]

因此,

\[\begin{align} \mathcal{V}_{k+1} - \mathcal{V}_k &\le (A_{k+1} - \frac{\rho_k B_k}{2 \beta_k} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert F z_{k+1} \Vert^2 - ( A_k+ \frac{\rho_k B_k}{2 \beta_k} - \frac{B_k (1 - \beta_k) \rho_k}{\beta_k} - \frac{B_k \alpha_k(1-\beta_k)}{2 \beta_k}) \Vert Fz_k \Vert^2 \\ &\quad + (\frac{B_k \alpha_k}{2 \beta_k(1-\beta_k)} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert Fz_{k+1/2} \Vert^2 + (\frac{B_k}{L^2 \beta_k \alpha_k(1 - \beta_k)}- \frac{B_k \alpha_k}{\beta_k(1-\beta_k)}) Fz_{k+1}^\top F z_{k+1/2} \\ &= (A_{k+1} - \frac{\rho_k B_k}{2 \beta_k} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert F z_{k+1} \Vert^2 \\ &\quad + (\frac{B_k \alpha_k}{2 \beta_k(1-\beta_k)} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert Fz_{k+1/2} \Vert^2 + (\frac{B_k}{L^2 \beta_k \alpha_k(1 - \beta_k)}- \frac{B_k \alpha_k}{\beta_k(1-\beta_k)}) Fz_{k+1}^\top F z_{k+1/2} \\ &\le (\frac{B_k \alpha_k}{2 \beta_k(1-\beta_k)} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert F z_{k+1} \Vert^2 \\ &\quad + (\frac{B_k \alpha_k}{2 \beta_k(1-\beta_k)} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert Fz_{k+1/2} \Vert^2 + (\frac{B_k}{L^2 \beta_k \alpha_k(1 - \beta_k)}- \frac{B_k \alpha_k}{\beta_k(1-\beta_k)}) Fz_{k+1}^\top F z_{k+1/2} \\ &= (\frac{B_k \alpha_k}{2 \beta_k(1-\beta_k)} - \frac{B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert Fz_{k+1} - F_{k+1/2} \Vert^2 \\ &= \frac{(1- L^2 \alpha_k^2)B_k}{2L^2 \beta_k \alpha_k(1 - \beta_k)}) \Vert Fz_{k+1} - F_{k+1/2} \Vert^2 \\ &\le 0 , \text{Set } \alpha_k = \frac{1}{L} \end{align}\]

从而证明了该Lyapunov函数单调递减,也即,

\[\begin{align} \mathcal{V}_{k+1} &\le \mathcal{V}_k\\ \text{With } \mathcal{V}_k &= A_k \Vert Fz_k \Vert^2 - B_k Fz_k^\top (z_0 - z_k) \\ A_k &= \frac{B_k(1 - \beta_k)(\alpha_k + \rho_k)}{2 \beta_k} - \frac{B_k \rho_k}{2}, B_{k+1} = \frac{B_k}{1 - \beta_k} \end{align}\]

然后令,

\[\begin{align} \beta_k &= \frac{1}{k+1}, \alpha_k = \frac{1}{L}, \rho_k = \rho \\ \text{Then } B_k &= k,A_k = \frac{k^2 }{2} (\frac{1}{L} + \rho) - \frac{k \rho}{2} \end{align}\]

进而根据 $\mathcal{V}_0 = 0$, 以及单调递减性,

\[\begin{align} \mathcal{V}_k &\le \mathcal{V}_0 \\ A_k \Vert Fz_k \Vert^2 &\le B_k Fz_k^\top (z_0 - z_k) \\ \frac{B_k(1 - \beta_k)(\alpha_k + \rho_k)}{2 \beta_k} \Vert Fz_k \Vert^2 &\le B_k Fz_k^\top (z_0 - z_k) + \frac{B_k \rho_k}{2} \Vert Fz_k \Vert^2 \\ &= B_k Fz_k^\top (z_0 - z_{\ast}) + B_k Fz_k^\top (z_{\ast} - z_{k}) + \frac{B_k \rho_k}{2} \Vert Fz_k - Fz_{\ast}\Vert^2 \\ &\le B_k Fz_k^\top (z_0 - z_{\ast}) \\ &\le B_k \Vert Fz_k \Vert \Vert z_0 - z_{\ast} \Vert \end{align}\]

代入设定的参数,

\[\begin{align} \Vert Fz_k \Vert &\le \frac{2 \beta_k\Vert z_0 - z_{\ast} \Vert}{(1 - \beta_k)(\alpha_k + \rho_k)} = \frac{2 \Vert z_0 -z_{\ast} \Vert}{k (\frac{1}{L} +\rho)} \end{align}\]

因而算法达到了 $\mathcal{O}(\frac{1}{k^2})$ 的收敛阶,

\[\begin{align} \Vert Fz_k \Vert^2 \le \frac{4 \Vert z_0 -z_{\ast} \Vert^2}{k^2 (\frac{1}{L} +\rho)^2} \end{align}\]

将上述参数代入则得到了文章的FEG算法类中最基础算法 FEG,其实也是最核心的算法,

\[\begin{align} z_{k+1/2} &= z_k + \frac{1}{k+1} (z_0 - z_k) - \frac{k}{k+1} (\frac{1}{L} + \rho) Fz_k \\ z_{k+1} &= z_k + \frac{1}{k+1} (z_0 - z_k) - \frac{1}{L} Fz_{k+1/2} - \frac{k}{k+1} \rho Fz_{k} \end{align}\]

基于FEG算法,加入线搜索可以产生自适应的方法FEG-A(Adaptive ),

如果在随机设定下采用类似SGD的更新可以产生 随机性算法FEG-S(Stochastic), 但核心的分析内容和理论基础并无大差,