AGDA: Alternative Gradient Descent Ascent
Published:
Problem Set-up
文章关注于以下非凸非凹的极小极大问题,且问题形式为有限和,
\[\min_x g(x) = \min_x \max_y f(x,y) =\min_x \max_y \frac{1}{n} \sum_{i=1}^n f_i(x,y)\]优化的函数满足 $L$- 光滑,
\[\begin{align} \Vert \nabla f_x(x,y) - \nabla f_x(x',y') \Vert &\le L\Vert x - x' \Vert + L \Vert y - y' \Vert \\ \Vert \nabla f_y(x,y) - \nabla f_y(x',y') \Vert &\le L\Vert x - x' \Vert + L \Vert y - y' \Vert \\ \end{align}\]且文章假设函数满足双边的PL条件,可以理解为函数虽然非凸非凹,但非凸非凹性并不严重,
\[\Vert \nabla_x f(x,y) \Vert^2 \ge 2 \mu (f(x,y ) -\min_x f(x,y)) \\ \Vert \nabla_y f(x,y) \Vert^2 \ge 2\mu (\max_y(x,y) - f(x,y)) \\\]且可以简单证明,在上述的PL条件下,以下三种点是等价的,因此只需要寻找某一个点即可,证明依照定义即可,此处从略,
\[\begin{align} \text{Gloabal Minimax Point: }f(x_{\ast}, y) &\le f(x_{\ast},y_{\ast}) \le \max_y (x,y) , \forall x,y \\ \text{Saddle Point: }f(x_{\ast}, y) &\le f(x_{\ast},y_{\ast}) \le (x,y_{\ast}) , \forall x,y \\ \text{Stationary Point: } \nabla f_x(x_{\ast},y_{\ast}) &= \nabla_y f(x_{\ast},y_{\ast} ) = 0 \end{align}\]且此时可以证明函数 $g$ 也同样满足系数为 $\mu$ 的PL条件,
\[\begin{align} \text{By } g(x) &= f(x,y^{\ast}(x)), \text{With } y = \text{argmax}_y f(x,y) \\ \Vert \nabla_x g(x) \Vert^2 &= \Vert \nabla_x f(x,y^{\ast}(x)) \Vert^2 \\ &\ge 2 \mu (f(x,y^{\ast}(x) ) -\min_{x'} f(x',y^{\ast}(x))) \\ &\ge 2 \mu(f(x,y^{\ast}(x) ) -\min_{x'} f(x',y^{\ast}(x))) \\ &\ge 2 \mu(f(x,y^{\ast}(x) ) -\min_{x'} f(x',y^{\ast}(x'))) \\ &=2\mu (g(x) - g(x_{\ast})) \end{align}\]Preparation
在正式进入文章的方法之前,首先看较为简单的满足PL条件的非凸极小问题如何使用随机梯度下降法,以及其收敛性分析,算法如下,例如我们如果已经知道 $g(x)$ 的显示表达式,可以直接采用下面的随机梯度下降求解,
\[x_{k+1} = x_k - \tau G(x_k,\xi_k)\]假设根据 $\xi_k$ 所采样的随机梯度为真实梯度的无偏估计,且方差有限,记作为 $\sigma^2$,
\[\begin{align} \mathbb{E} [G(x, \xi) - \nabla f(x)] &= 0 \\ \mathbb{E} [\Vert G(x,\xi) - \nabla f(x) \Vert^2] &\le \sigma^2 \end{align}\]则算法的收敛性可以采用如下方式证明,
\[\begin{align} \mathbb{E}[g(x_{k+1}) - g(x_{\ast})] &\le g(x_k) - g(x_{\ast}) -\tau \mathbb{E}[\nabla g(x_k)^\top G(x_k, \xi_k)]+ \frac{\tau^2L}{2} \mathbb{E} [\Vert G(x_k,\xi_k) \Vert^2] \\ &= g(x_k) - g(x_{\ast}) - \tau \mathbb{E}[\nabla g(x_k)^\top G(x_k, \xi_k)]+ \frac{\tau^2L}{2} \mathbb{E} [\Vert G(x_k,\xi_k) - \nabla g(x_k) \Vert^2]+ \frac{\tau^2L}{2} \Vert \nabla g(x_k) \Vert^2 \\ &= g(x_k) - g(x_{\ast}) +(\frac{\tau^2 L}{2} - \tau) \Vert \nabla g(x_k) \Vert^2+ \frac{\tau^2L}{2} \mathbb{E} [\Vert G(x_k,\xi_k) - \nabla g(x_k) \Vert^2] \\ &\le g(x_k) - g(x_{\ast}) - \frac{\tau}{2} \Vert \nabla g(x_k) \Vert^2+ \frac{\sigma^2 \tau^2 L}{2}, \text{Set } \tau \le \frac{1}{L} \\ &\le (1-\mu \tau)(g(x_k) - g(x_{\ast})) + \frac{\sigma^2\tau^2L}{2} \\ \end{align}\]当方差为 $0$ 的情况,或者使用全批量梯度下降的时候,只需要选取步长 $\tau = \frac{1}{L}$, 则可以得到,
\[\begin{align} \mathbb{E}[g(x_{k+1}) - g(x_{\ast})] &\le (1-\kappa)(g(x_k) - g(x_{\ast})) ,\text{With } \kappa = \frac{L}{\mu}\\ \end{align}\]再考虑到全批量梯度的计算代价,达到 $\epsilon$-最优解的计算代价为,
\[T = O(n \kappa \log \frac{1}{\epsilon})\]而对于方差不为 $0$ 的情况,虽然达不到线性收敛,但选取步长逐渐减小的合适的 $\tau$ ,可以达到 $O(\frac{1}{k})$ 速率的次线性收敛,
下面使用数学归纳法证明该结论,假设在 $k=0$ 的时候成立,只需要根据如下条件选取对应的 $\nu$ 即可,
\[g(x_0) - g(x_{\ast}) \le \frac{\nu}{k+1}\]希望递推式可以成立,可以选取合适的 $\nu$ 所实现,
\[\begin{align} \mathbb{E}[g(x_{k+1}) - g(x_{\ast})] &\le (1-\mu \tau)(g(x_k) - g(x_{\ast})) + \frac{\sigma^2\tau^2L}{2} \\ &= (1- \frac{\beta\mu}{k+\gamma})(g(x_k ) - g(x_{\ast})) + \frac{\sigma^2 \beta^2L}{2(k+\gamma)^2} ,\text{Set } \tau = \frac{\beta}{k+\gamma}\\ &\le (1- \frac{\beta\mu}{k+\gamma}) \frac{\nu}{k+\gamma} + \frac{\sigma^2 \beta^2 L}{2(k+\gamma)^2} \\ &= \frac{k \nu}{(k+1)^2} + \frac{(1- \beta\mu) \nu}{(k+\gamma)^2} + \frac{\beta^2\sigma^2 L}{2(k+\gamma)^2} \\ &\le\frac{\nu}{k+\gamma+1} + \frac{(1- \beta\mu) \nu}{(k+\gamma)^2} + \frac{\beta^2\sigma^2 L}{2(k+\gamma)^2} \\ &\le\frac{\nu}{k+\gamma+1}, \text{Set } \nu \ge \frac{\sigma^2 \beta^2L}{2(\beta\mu-1)} \\ &\le \frac{\sigma^2 \beta^2 L}{2(\beta\mu - 1)(k+\gamma+1)} \\ &= \frac{2 \sigma^2 L}{\mu^2(k+\gamma+1)}, \text{Set } \beta = \frac{2}{\mu} \end{align}\]综上,基于随机梯度的方法,为了达到 $\epsilon$- 最优解需要的梯度计算次数为,
\[T = O(\frac{\kappa \sigma^2}{\mu \epsilon})\]可以看到在随机优化中,限制收敛率的主要因素为方差 $\sigma^2$, 因此方差缩减技术应运而生,此类技术通过方差缩减使得算法达到线性收敛的结果,代表性的方法如 SVRG 以及对应的动量加速版本 Katyusha
AGDA
下面正式提出文章的方法,随机交替梯度下降上升算法,Stochastic Alternative Gradient Descent Ascent(AGDA),
\[\begin{align} x_{k+1} &= x_k - \tau_{k_1} G_x(x_k,y_k ,\xi_{k_1}) \\ y_{k+1} &= y_k + \tau_{k_2} G_y(x_k,y_k, \xi_{k_2}) \end{align}\]此时用到了在PL条件之下的一个重要性质,也即此时的函数 $g(x)$ 满足 $L’ = L + \frac{L^2}{\mu}$- 光滑, 此处与论文一样直接使用该结论,
基于同样的方法,一方面有,
\[\begin{align} \mathbb{E}[g(x_{k+1}) - g(x_{\ast})] &\le g(x_k) - g(x_{\ast}) -\tau_1 \mathbb{E}[\nabla g(x_k)^\top G_{x}]+ \frac{\tau_1^2L'}{2} \mathbb{E} [\Vert G_x \Vert^2] \\ &\le g(x_k) - g(x_{\ast}) -\tau_1 \nabla g(x_k)^\top \nabla_x(x_k , y_k) + \frac{\tau_1^2L'}{2} \Vert \nabla_x(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2L'}{2} \\ &\le g(x_k) - g(x_{\ast}) -\tau_1 \nabla g(x_k)^\top \nabla_x(x_k , y_k) + \frac{\tau_1}{2} \Vert \nabla_x(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2L}{2} ,\text{Set } \tau_1 \le \frac{1}{L'} \le \frac{1}{L}\\ &\le g(x_k) - g(x_{\ast}) - \frac{\tau_1}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\tau_1}{2} \Vert g(x_k) - \nabla_x(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2 (L + \frac{L^2}{\mu})}{2} \\ \end{align}\]另一方面,
\[\begin{align} \mathbb{E}[g(x_{k+1}) - f(x_{k+1},y_{k+1})] &\le \mathbb{E}[g(x_{k+1}) -f(x_{k+1},y_k)] -\tau_2 \mathbb{E}[\nabla_yf(x_{k+1},y_k)^\top G_y)] + \frac{\tau_2^2 L}{2} \mathbb{E}[\Vert G_y \Vert^2 ] \\ &\le \mathbb{E}[g(x_{k+1}) -f(x_{k+1},y_k)] + (\tau_2 - \frac{\tau_2^2L}{2}) \Vert \nabla_y f(x_{k+1},y_k) \Vert^2 + \frac{\sigma^2\tau_2^2 L}{2} \\ &\le (1-\mu \tau_2) \mathbb{E}[g(x_{k+1}) - f(x_{k+1},y_k)] + \frac{\sigma^2\tau_2^2 L}{2} ,\text{Set } \tau_2 \le \frac{1}{L} \end{align}\]对于出现的 $g(x_{k+1}) - f(x_{k+1},y_k)$ 一项,利用拆分的方式进行Bound,
\[\begin{align} \mathbb{E} [f(x_k,y_k) - f(x_{k+1},y_k)] &\le \tau_1 \mathbb{E} [\nabla_x f(x_k,y_k)^\top G_x] + \frac{\tau_1^2L }{2} \mathbb{E}[\Vert G_x \Vert^2] \\ &\le \tau_1 \mathbb{E} [\nabla_x f(x_k,y_k)^\top G_x] + \frac{\tau_1^2L}{2} \Vert\nabla_x f(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2L}{2} \\ &\le (\tau_1 +\frac{\tau_1^2L}{2}) \Vert\nabla_x f(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2L}{2} \\ \end{align}\]以及根据上面的结论,
\[\begin{align} \mathbb{E}[g(x_{k+1}) - g(x_k)] \le - \frac{\tau_1}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\tau_1}{2} \Vert g(x_k) - \nabla_x(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2 (L+ \frac{L^2}{\mu})}{2} \end{align}\]因此合起来有,
\[\begin{align} \mathbb{E}[g(x_{k+1}) - f(x_{k+1},y_{k+1})] &\le (1-\mu \tau_2) \mathbb{E}[g(x_{k+1}) - f(x_{k+1},y_k)] + \frac{\sigma^2\tau_2^2 L}{2} \\ &=(1-\mu \tau_2) \mathbb{E}[g(x_{k}) - f(x_k,y_k) + f(x_k,y_k)-f(x_{k+1},y_k) + g(x_{k+1}) - g(x_k)] + \frac{\sigma^2\tau_2^2 L}{2} \\ &\le(1-\mu \tau_2) (g(x_k) - f(x_k ,y_k))+\frac{\sigma^2\tau_2^2 L}{2} \\ &\quad +(1-\mu \tau_2)(\tau_1 +\frac{\tau_1^2L}{2}) \Vert\nabla_x f(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2L}{2} ) \\ &\quad + (1-\mu \tau_2)(- \frac{\tau_1}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\tau_1}{2} \Vert g(x_k) - \nabla_x(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2 (L+ \frac{L^2}{\mu})}{2}) \\ \end{align}\]考虑仍然需要Bound的项,可以根据目标函数的性质进行处理,首先论文同样不加证明地利用以下一个PL条件下的引理,
\[g(x_k)- f(x_k,y_k) \ge \frac{\mu}{2}\Vert y^{\ast} (x_k) - y_k \Vert^2\]从而可以得到,
\[\begin{align} \Vert \nabla_x f(x_k ,y_k) - g(x_k) \Vert^2 \le L^2 \Vert y^{\ast}(x_k) - y_k \Vert^2 \le \frac{2L^2}{\mu}(g(x_k) - f(x_k,y_k)) \end{align}\]以及利用函数 $g$ 也满足参数为 $\mu$ 的PL条件,
\[\Vert \nabla g(x_k) \Vert^2 \ge 2 \mu (g(x_k) - g(x_{\ast}))\]利用参数 $\lambda$ 对上述两个衡量最优性质的量进行线性组合,首先为了符号的简洁,如下定义,
\[\begin{align} a_k &= g(x_{k}) - g(x_{\ast}) \\ b_k &= g(x_k) - f(x_k,y_k) \end{align}\]因此,
\[\begin{align} \mathbb{E}[a_{k+1} + \lambda b_k] &\le a_k+\lambda(1-\mu \tau_2) b_k - \frac{\tau_1}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\tau_1}{2} \Vert g(x_k) - \nabla_x(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2 (L + \frac{L^2}{\mu})}{2} +\frac{\lambda \sigma^2\tau_2^2 L}{2} +\lambda(1-\mu \tau_2)(\tau_1 +\frac{\tau_1^2L}{2}) \Vert\nabla_x f(x_k,y_k) \Vert^2 \\ &\quad + \lambda(1-\mu \tau_2)(- \frac{\tau_1}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\tau_1}{2} \Vert g(x_k) - \nabla_x(x_k,y_k) \Vert^2 + \frac{\sigma^2\tau_1^2 (L+ \frac{L^2}{\mu})}{2}) \\ &\le a_k + \lambda(1- \mu \tau_2) b_k +\frac{\lambda\sigma^2\tau_2^2 L}{2}+ \frac{(\lambda(1-\mu \tau_2)+1)\sigma^2 \tau_1^2(L + \frac{L^2}{\mu})}{2} \\ &\quad -\frac{(\lambda (1- \mu \tau_2) +1)\tau_1}{2} \Vert \nabla g(x_k) \Vert^2+ \lambda(1-\mu \tau_2) (\tau_1 +\frac{\tau_1^2L}{2}) \Vert \nabla_x f(x_k,y_k) \Vert^2 + \frac{(\lambda (1-\mu \tau_2) +1)\tau_1}{2} \Vert g(x_k) - \nabla_x f(x_k,y_k) \Vert^2 \\ &\le a_k + \lambda(1-\mu \tau_2) b_k +\frac{\lambda\sigma^2\tau_2^2 L}{2}+ \frac{\sigma^2 \tau_1^2(L + \frac{L^2}{\mu})}{2} + \frac{\lambda(1-\mu \tau_2)\sigma^2 \tau_1^2(2L + \frac{L^2}{\mu})}{2} \\ &\quad -(1 - \lambda(1-\mu \tau_2)(3+2\tau_1L)) \frac{\tau_1}{2} \Vert \nabla g(x_k) \Vert^2 + (1+ \lambda(1-\mu \tau_2)(5+ 2\tau_1L)) \frac{\tau_1}{2} \Vert g(x_k) - \nabla f_x(x_k,y_k) \Vert^2 \\ &\le [1-\mu \tau_1(1-\lambda(1-\mu \tau_2)(3 +2 \tau_1L))]a_k + [1-\mu \tau_2 + \frac{L^2 \tau_1}{\mu}(\frac{1}{\lambda}+ (1-\mu \tau_2)(5+2 \tau_1L)) ]\lambda b_k \\ &\quad + \frac{\lambda\sigma^2\tau_2^2 L}{2}+ \frac{\sigma^2 \tau_1^2(L + \frac{L^2}{\mu})}{2} + \frac{\lambda(1-\mu \tau_2)\sigma^2 \tau_1^2(2L + \frac{L^2}{\mu})}{2} \\ &= k_1 a_k + k_2 \lambda b_k + \frac{\lambda\sigma^2\tau_2^2 L}{2}+ \frac{\sigma^2 \tau_1^2(L + \frac{L^2}{\mu})}{2} + \frac{\lambda(1-\mu \tau_2)\sigma^2 \tau_1^2(2L + \frac{L^2}{\mu})}{2} \\ \text{Let } k_1 &= 1-\mu \tau_1(1-\lambda(1-\mu \tau_2)(3 +2 \tau_1L)) \\ k_2 &= 1-\mu \tau_2 + \frac{L^2 \tau_1}{\mu}(\frac{1}{\lambda}+ (1-\mu \tau_2)(5+2 \tau_1L)) \end{align}\]由于极小极大问题的困难之处,上式比原本的极小问题复杂了很多,但整体的形式类似,下面同样基于上面这个关键的不等式分别对全批量梯度算法和随机梯度算法进行讨论。
如果使用全批量梯度算法,方差 $\sigma^2 =0$, 此时我们需要选取合适的步长使得 $\max (k_1,k_2)<1$,从而达到线性收敛的结果,
\[\begin{align} k_1 &= 1-\mu \tau_1(1-\lambda(1-\mu \tau_2)(3 +2 \tau_1L)) \\ &= 1 - \mu \tau_1 + \lambda \mu \tau_1 (1 - \mu \tau_2)(3+ 2 \tau_1 L) \\ &\le 1- \mu \tau_1 + 5\lambda \mu \tau_1 , \text{By } \tau_1 \le \frac{1}{L} \\ &=1- (1-5\lambda) \mu \tau_1 \\ &= 1- \frac{1}{2}\mu \tau_1 , \text{Let } \lambda = \frac{1}{10}\\ k_2&= 1-\mu \tau_2 + \frac{L^2 \tau_1}{\mu}(\frac{1}{\lambda}+ (1-\mu \tau_2)(5+2 \tau_1L)) \\ &= 1- \mu \tau_2 + \frac{L^2 \tau_1}{\mu \lambda} + \frac{L^2 \tau_1}{\mu} (1-\mu \tau_2)(5+2 \tau_1L) \\ &\le 1- \mu \tau_2 + \frac{L^2 \tau_1}{\mu \lambda} + \frac{7L^2 \tau_1}{\mu}, \text{By } \tau_1 \le \frac{1}{L} \\ &= 1- \mu \tau_2 + \frac{17 L^2 \tau_1}{\mu} , \text{By } \lambda = \frac{1}{2}\\ &\le 1-\frac{1}{2}\mu \tau_2 , \text{Set } \tau_1 \le \frac{ \mu^2 \tau_2}{34L^2} \end{align}\]合起来,
\[\begin{align} a_{k+1} + \frac{b_{k+1}}{10} &\le k_1 a_k + k_2 \frac{b_k}{10} \\ &\le (1- \frac{\mu \tau_1}{2}) a_k + (1- \frac{\mu \tau_2}{2}) \frac{b_k}{10} \\ &\le \max(1- \frac{\mu \tau_1}{2}, 1 - \frac{\mu \tau_2}{2}) (a_k + \frac{b_k}{10}) \\ &= (1- \frac{\mu \tau_1}{2})(a_k + \frac{b_k}{10}), \text{By } \mu \tau_1 \le \mu \tau_2 \end{align}\]因此可以得到为了达到 $\epsilon$ -最优解,需要的梯度计算复杂度为,
\[\begin{align} T &= O( \frac{2n}{\mu \tau_1} \log \frac{1}{\epsilon}) = O(n \kappa^3 \log \frac{1}{\epsilon}) \end{align}\]对于随机梯度算法,同样地使用数学归纳法,并且在批量梯度算法所选取的步长的基础上,进一步具体化步长的系数,
\[\begin{align} \mathbb{E}[a_{k+1} + \frac{b_{k+1}}{10}] &\le (1- \frac{\mu \tau_1}{2})(a_k+ \frac{b_k}{10}) + \frac{\lambda\sigma^2\tau_2^2 L}{2}+ \frac{\sigma^2 \tau_1^2(L + \frac{L^2}{\mu})}{2} + \frac{\lambda(1-\mu \tau_2)\sigma^2 \tau_1^2(2L + \frac{L^2}{\mu})}{2} \\ &\le (1- \frac{\mu \tau_1}{2})(a_k+ \frac{b_k}{10}) + \frac{\lambda\sigma^2\tau_2^2 L}{2}+ \frac{(1+ \lambda)\sigma^2 \tau_1^2(2L + \frac{L^2}{\mu})}{2} \\ &\le (1- \frac{\mu \tau_1}{2}) \frac{\nu}{k+\gamma} + \frac{\lambda\sigma^2\tau_2^2 L}{2}+ \frac{(1+ \lambda)\sigma^2 \tau_1^2(2L + \frac{L^2}{\mu})}{2} ,\text{Assume } a_k + \frac{b_k}{10} \le \frac{\nu}{k+\gamma}\\ &= (1- \frac{\mu \beta}{k+\gamma}) \frac{\nu}{k+\gamma} +\frac{2 \times 34^2 L^5 \beta^2 \sigma^2}{\mu^4 (k+\gamma)^2}+ \frac{(1+ \lambda) \beta^2(2L + \frac{L^2}{\mu})\sigma^2}{(k+\gamma)^2} ,\text{Set } \tau_1 = \frac{2 \beta}{k+\gamma}, \tau_2 = \frac{34 L^2 \tau_1 }{\mu^2} = \frac{78 L^2 \beta }{\mu^2(k+\gamma)^2} \\ &\le \frac{\nu}{k+\gamma+1} + \frac{(1- \mu \beta)\nu}{(k+\gamma)^2} +\frac{2312 L^5 \beta^2 \sigma^2}{\mu^4 (k+\gamma)^2}+ \frac{(1+ \lambda) \beta^2(2L + \frac{L^2}{\mu})\sigma^2}{(k+\gamma)^2} \\ &\le \frac{\nu}{k+\gamma+1} , \text{Let } \nu \ge \frac{\sigma^2}{\mu \beta - 1} ( \frac{2312 L^5 \beta^2}{\mu^4} +\frac{1.1 \beta^2 L^2}{\mu} + 2.2 \beta^2 L ) \\ &\le \frac{(2312 \kappa^5 + 1.1 \kappa^2+ 2.2 \kappa) \sigma^2}{\mu(k+\gamma+1)} \end{align}\]因此达到 $\epsilon$ -最优解所需要的计算复杂度为,
\[T = O(\frac{\kappa^5 \sigma^2}{\mu\epsilon})\]由此完成了AGDA和全批量梯度版本和随机梯度版本的复杂度证明,可以看到结果虽然与极小优化问题类似,但由于极小极大问题的困难性,复杂度中对于条件数 $\kappa$ 的依赖更高阶,可以说明此类问题的确更加困难。