VR-AGDA: Variance Reduced Alternative Gradient Descent Ascent

12 minute read

Published:

论文阅读笔记:Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems

Method Overview

文章提出的 AGDA 算法,分为全批量和随机梯度两种方法,全批量梯度可以保证线性收敛,但计算代价高,而随机梯度算法计算代价低,却只能保证 $\mathcal{O}(\frac{1}{k})$ 速率的此线性收敛,文章研究如何利用以 SVRG 为代表的方差缩减技术,使得随机梯度算法满足线性收敛,

算法可以视作为带重启动的SVRG类方法,外循环中每隔 $T$ 轮随机选取上一轮中的某一个位置重新启动,内循环中每隔 $N$ 轮代养一个位置作为梯度方差缩减的基准点,内循环采用如下的流程,

\[\begin{align} g_{k} &= \nabla_x f(x_{k}, y_{k}, \xi_1) - \nabla_x f(\tilde x, \tilde y, \xi_1) + \nabla_x f(\tilde x, \tilde y) \\ h_{k} &= \nabla_y f(x_{k+1}, y_{k}, \xi_2) - \nabla_y f(\tilde x, \tilde y, \xi_2) + \nabla_x f(\tilde x, \tilde y) \\ x_{k+1} &= x_{k} - \alpha_k g_k \\ y_{k+1} &= y_{k} - \beta_k h_k \\ \end{align}\]

并且设定恒定步长,可以知道使用的是梯度的一个无偏估计,

\[\begin{align} \mathbb{E}[g_k] &= u_k , \text{With } u_k = \nabla_x f(x_k,y_k) \\ \mathbb{E}[h_k] &= v_k , \text{With } v_k = \nabla_y f(x_{k+1},y_k) \\ \end{align}\]

Convergence Sequence

定义与AGDA相同的Ly

使用方差缩减技术之后,对于梯度的方差,可以有,

\[\begin{align} Var [g_k ] &=\mathbb{E}[\Vert g_k - \nabla_x f(x_k, y_k) \Vert^2] \\&= \mathbb{E}[\nabla_x f(x_{k}, y_{k}, \xi_1) - \nabla_x f(\tilde x, \tilde y, \xi_1) + \nabla_x f(\tilde x_k, \tilde y_k) - \nabla_x f(x_k,y_k) \Vert^2] \\ &\le \mathbb{E}[ \Vert \nabla_x f(x_{k}, y_{k}, \xi_1) - \nabla_x f(\tilde x, \tilde y, \xi_1) \Vert^2] \\ &\le L^2 \Vert x_k - \tilde x \Vert^2 + L^2 \Vert y_k - \tilde y \Vert^2 \end{align}\]

以及,

\[\begin{align} Var[h_k ] &=\mathbb{E}[\Vert h_k - \nabla_y f(x_{k+1},y_k) \Vert^2] \\ &\le \mathbb{E}[ \Vert \nabla_y f(x_{k+1}, y_{k}, \xi_2) - \nabla_y f(\tilde x, \tilde y, \xi_2) + \nabla_x f(\tilde x, \tilde y) \Vert^2] \\ &\le L^2 \mathbb{E}[ \Vert x_{k+1} - \tilde x \Vert^2] + L^2 \Vert y_{k} - \tilde y \Vert^2 \end{align}\]

同样地定义如下的 Lyapunov 函数,

\[\begin{align} \mathcal{R_k} &= \mathcal{A_k} + \lambda \mathcal{B_k} +c_k \Vert x_{k} - \tilde x \Vert^2 + d_k \Vert y_k - \tilde y \Vert^2 \\ \text{With } \mathcal{A_k} &= g( x_k) - g(x_{\ast}) , \mathcal{B_k} = g( x_k) - f( x_k, y_k) \end{align}\]

利用与AGDA类似的证明手段,

\[\begin{align} \mathbb{E}[\mathcal{A_{k+1}}] &= \mathbb{E}[ g(x_{k+1})- g(x_{\ast})] \\ &\le g(x_k) - g(x_{\ast}) - \alpha_k \mathbb{E}[ \nabla g( x_k)^\top g_k ] + \frac{\alpha_k^2 L'}{2} \mathbb{E}[\Vert g_k \Vert^2] ,\text{With } L' = L + \frac{L^2}{\mu}\\ &\le g(x_k) - g(x_{\ast}) - \alpha_k \nabla g( x_k)^\top u_k + \frac{\alpha_k^2 L'}{2} \Vert u_k \Vert^2 + \frac{\alpha_k^2 L'}{2} Var[g_k] \\ &\le g( x_k) - g(x_{\ast}) - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 + \frac{\alpha_k^2 L'}{2} Var[g_k] \\ \end{align}\]

以及,

\[\begin{align} \mathbb{E}[ \mathcal{B}_{k+1}] &=\mathbb{E}[g( x_{k+1})- f( x_{k+1}, y_{k+1})]\\ &\le g( x_{k+1}) - f(x_{k+1}, y_{k}) - \beta_k \Vert v_k \Vert^2 + \frac{\beta_k^2 L}{2} \Vert v_k \Vert^2 + \frac{\beta_k^2 L}{2} Var[h_k]\\ &\le g( x_{k+1}) - f(x_{k+1}, y_{k}) - \frac{\beta_k}{2} \Vert v_k \Vert^2 + \frac{ \beta_k^2 L}{2} Var[h_k]\\ \end{align}\]

进行线性组合得到了,

\[\begin{align} \mathbb{E}[\mathcal{A}_{k+1} + \lambda \mathcal{B}_{k+1}] &\le \mathcal{A}_k - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 + \frac{\alpha_k^2 L'}{2} Var[g_k] \\ &\quad + \lambda \mathbb{E} [ g(x_{k+1}) - f(x_{k+1},y_k)] - \frac{\lambda \beta_k}{2} \mathbb{E} [\Vert v_k \Vert^2] + \frac{\lambda \beta_k^2 L}{2} Var[h_k]\\ \end{align}\]

而对于距离,使用Young不等式,成立,

\[\begin{align} \mathbb{E}[\Vert x_{k+1} - \tilde x \Vert^2] &= \mathbb{E}[\Vert x_{k} - \alpha_k g_k - \tilde x \Vert^2 ] \\ &\le \Vert x_k - \tilde x \Vert^2 - 2 \alpha_k \mathbb{E} [g_k^\top (x_k - \tilde x)] + \alpha_k^2\mathbb{E}[ \Vert g_k \Vert^2] \\ &= \Vert x_k - \tilde x \Vert^2 - 2 \alpha_k u_k^\top(x_k - \tilde x) + \alpha_k^2 \Vert u_k \Vert^2 + \alpha_k^2 Var[g_k] \\ &\le (1 + \alpha_k \lambda_1) \Vert x_k - \tilde x \Vert^2 + (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) \Vert u_k \Vert^2 + \alpha_k^2 Var[g_k] \end{align}\]

以及类似地,

\[\begin{align} \mathbb{E}[\Vert y_{k+1} -\tilde y \Vert^2] &= \mathbb{E}[ \Vert y_k + \beta_k h_k - \tilde y \Vert^2] \\ &\le(1+ \beta_k \lambda_2) \Vert y_k - \tilde y \Vert^2 +( \beta_k^2 + \frac{\beta_k}{\lambda_2}) \Vert v_k \Vert^2 + \beta_k^2 Var[h_k] \end{align}\]

因此对于 Lyapunov 函数, \(\begin{align} \mathbb{E}[\mathcal{R}_{k+1}] &= \mathbb{E}[ \mathcal{A}_{k+1} + \lambda \mathcal{B}_{k+1} + c_{k+1} \Vert x_{k+1} - \tilde x \Vert^2 + d_{k+1} \Vert y_{k+1} - \tilde y \Vert^2] \\ &\le \mathcal{A}_k - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 + \frac{\alpha_k^2 L'}{2} Var[g_k] \\ &\quad + \lambda \mathbb{E} [ g(x_{k+1}) - f(x_{k+1},y_k)] - \frac{\lambda \beta_k}{2} \mathbb{E} [\Vert v_k \Vert^2] + \frac{\lambda \beta_k^2 L}{2} Var[h_k]\\ &\quad + c_{k+1} \Vert x_{k+1} - \tilde x \Vert^2 +d_{k+1}(1+ \beta_k \lambda_2) \Vert y_k - \tilde y \Vert^2 +d_{k+1}( \beta_k^2 + \frac{\beta_k}{\lambda_2}) \Vert v_k \Vert^2 + d_{k+1}\beta_k^2 Var[h_k] \end{align}\)

消去 $h_k$ 的方差,得到了,

\[\begin{align} \mathbb{E}[\mathcal{R}_{k+1}] &\le \mathcal{A}_k - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 + \frac{\alpha_k^2 L'}{2} Var[g_k] \\ &\quad + \lambda \mathbb{E} [ g(x_{k+1}) - f(x_{k+1},y_k)] - (\frac{\lambda \beta_k}{2} -d_{k+1}( \beta_k^2 + \frac{\beta_k}{\lambda_2})) \mathbb{E} [\Vert v_k \Vert^2] \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) \Vert x_{k+1} - \tilde x \Vert^2 +(d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \frac{\lambda \beta_k^2 L^3}{2}) \Vert y_k - \tilde y \Vert^2 \end{align}\]

利用得到的关系式,

\[\begin{align} \mathbb{E}[\Vert x_{k+1} - \tilde x \Vert^2] &\le (1 + \alpha_k \lambda_1) \Vert x_k - \tilde x \Vert^2 + (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) \Vert u_k \Vert^2 + \alpha_k^2 Var[g_k] \end{align}\]

进一步消去 $x_{k+1}$, 得到,

\[\begin{align} \mathbb{E} [\mathcal{R}_{k+1}] &\le \mathcal{A}_k - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 \\ &\quad+ \lambda \mathbb{E}[ g(x_{k+1})- f(x_{k+1},y_k)] - (\frac{\lambda \beta_k}{2} -d_{k+1}( \beta_k^2 + \frac{\beta_k}{\lambda_2})) \mathbb{E} [\Vert v_k \Vert^2] \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2}) \alpha_k^2 Var[g_k] \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) \Vert x_k- \tilde x \Vert^2 \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) \Vert u_k \Vert^2 \\ &\quad + (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \frac{\lambda \beta_k^2 L^3}{2}) \Vert y_k - \tilde y \Vert^2 \end{align}\]

假设选取足够小的 $d_{k+1}$ 使得下式成立,

\[\begin{align} \frac{\lambda \beta_k}{2} -d_{k+1}( \beta_k^2 + \frac{\beta_k}{\lambda_2}) &> 0 \\ \frac{\lambda}{2} - d_{k+1}(\beta_k+ \frac{1}{\lambda_2}) &> 0 \end{align}\]

再利用PL条件可以得到

\[\begin{align} \mathbb{E} [\mathcal{R}_{k+1}] &\le \mathcal{A}_k - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 \\ &\quad+ \lambda (1- \mu \beta_k + \frac{\mu d_{k+1} }{\lambda} ( \beta_k^2 + \frac{\beta_k}{\lambda_2}))\mathbb{E}[ g(x_{k+1})- f(x_{k+1},y_k)] \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2}) \alpha_k^2 Var[g_k] \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) \Vert x_k- \tilde x \Vert^2 \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) \Vert u_k \Vert^2 \\ &\quad +(d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \frac{\lambda \beta_k^2 L^3}{2}) \Vert y_k - \tilde y \Vert^2 \end{align}\]

利用,

\[\begin{align} \mathbb{E}[g(x_{k+1}) - g(x_k)] &\le - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 + \frac{\alpha_k^2 L'}{2} Var[g_k] \\ \end{align}\]

以及利用,

\[\begin{align} \mathbb{E} [f(x_{k},y_k) - f(x_{k+1},y_k)] & \le -\mathbb{E}[u_k^\top(x_{k+1} -x_k)] + \frac{L}{2} \mathbb{E} [\Vert x_{k+1} - x_k \Vert^2] \\ &= \alpha_k \Vert u_k \Vert^2+ \frac{\alpha_k^2L}{2} \Vert u_k \Vert^2 + \frac{\alpha_k^2L}{2} Var[g_k] \\ &=(\alpha_k + \frac{\alpha_k^2 L }{2}) \Vert u_k \Vert^2 + \frac{\alpha_k^2L}{2} Var[g_k] \\ \end{align}\]

可以得到,

\[\begin{align} \mathbb{E} [g(x_{k+1}) - f(x_{k+1},y_k)] &= \mathbb{E} [g(x_{k+1}) - g(x_k) + g(x_k) - f(x_k,y_k ) + f(x_k,y_k) - f(x_{k+1},y_k)] \\ &\le g(x_k) - f(x_k,y_k) - \frac{\alpha_k}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k}{2} \Vert \nabla g(x_k) - u_k \Vert^2 \\ &\quad + (\alpha_k + \frac{\alpha_k^2 L }{2}) \Vert u_k \Vert^2 + \frac{\alpha_k^2(L+L')}{2} Var[g_k] \end{align}\]

为了符号方便,定义,

\[\begin{align} \phi_1 = 1- \mu \beta_k + \frac{\mu d_{k+1} }{\lambda} ( \beta_k^2 + \frac{\beta_k}{\lambda_2}) \end{align}\]

继续代入则有, \(\begin{align} \mathbb{E} [\mathcal{R}_{k+1}] &\le \mathcal{A}_k + \lambda \theta \mathcal{B}_k \\ &\quad - \frac{\alpha_k( \lambda \phi_1+1)}{2} \Vert \nabla g(x_k) \Vert^2 + \frac{\alpha_k(\lambda \phi_1+1)}{2} \Vert \nabla g(x_k) - u_k \Vert^2 \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2} +\frac{\lambda \phi_1(L +L')}{2}) \alpha_k^2 Var[g_k] \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) \Vert x_k- \tilde x \Vert^2 \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) \Vert u_k \Vert^2 + \lambda \phi_1(\alpha_k + \frac{\alpha_k^2 L }{2}) \Vert u_k \Vert^2 \\ &\quad + (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \frac{\lambda \beta_k^2 L^3}{2}) \Vert y_k - \tilde y \Vert^2 \end{align}\) 定义系数, \(\begin{align} \phi_2 = (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) + \lambda \phi_1(\alpha_k + \frac{\alpha_k^2 L }{2}) \\ \end{align}\) 并且利用, \(\begin{align} \Vert u_k \Vert^2 \le 2\Vert u_k - \nabla g(x_k) \Vert^2 + 2 \Vert \nabla g(x_k) \Vert^2 \end{align}\) 可以得到,

\[\begin{align} \mathbb{E} [\mathcal{R}_{k+1}] &\le \mathcal{A}_k + \lambda \phi_1 \mathcal{B}_k - (\frac{\alpha_k( \lambda \phi_1+1)}{2} - 2 \phi_2)\Vert \nabla g(x_k) \Vert^2 + (\frac{\alpha_k(\lambda \phi_1+1)}{2} + 2 \phi_2)\Vert \nabla g(x_k) - u_k \Vert^2 \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2} +\frac{\lambda \phi_1(L +L')}{2}) \alpha_k^2 Var[g_k] \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) \Vert x_k- \tilde x \Vert^2 \\ &\quad + (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \frac{\lambda \beta_k^2 L^3}{2}) \Vert y_k - \tilde y \Vert^2 \end{align}\]

假设下式成立,该条件会在后面得到验证,

\(\begin{align} \frac{\alpha_k( \lambda \phi_1+1)}{2} - 2 \phi_2 > 0 \end{align}\) 再次令,

\[\begin{align} \phi_3 &= (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2} +\frac{\lambda \phi_1(L +L')}{2}) \alpha_k^2 \end{align}\]

消去 $g_k$ 的方差可以得到,

\[\begin{align} \mathbb{E} [\mathcal{R}_{k+1}] &\le \mathcal{A}_k + \lambda \phi_1 \mathcal{B}_k - (\frac{\alpha_k( \lambda \phi_1+1)}{2} - 2 \phi_2)\Vert \nabla g(x_k) \Vert^2 + (\frac{\alpha_k(\lambda \phi_1+1)}{2} + 2 \phi_2)\Vert \nabla g(x_k) - u_k \Vert^2 \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) \Vert x_k- \tilde x \Vert^2 + \phi_3 L^2 \Vert x_k - \tilde x \Vert^2\\ &\quad + (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \frac{\lambda \beta_k^2 L^3}{2}) \Vert y_k - \tilde y \Vert^2 + \phi_3 L^2 \Vert y_k - \tilde y\Vert^2 \end{align}\]

如果假设,

\[\begin{align} \frac{\alpha_k(\lambda \phi_1+1)}{2} - 2 \phi_2 > 0 \end{align}\]

则可以利用PL条件告诉我们的结论,

\[\begin{align} \Vert u_k - \nabla g(x_k) \Vert^2 &\le \frac{2 L^2}{\mu}(g(x_k) - f(x_k ,y_k)) = \frac{2L^2}{\mu} \mathcal{B}_k \\ \Vert \nabla g(x_k) \Vert^2 & \ge 2 \mu (g(x_k)- g(x_{\ast}) ) = 2 \mu \mathcal{A_k} \\ \end{align}\]

进行化简得到递推不等式,

\[\begin{align} \mathbb{E} [\mathcal{R}_{k+1}] &\le (1- \alpha_k( \lambda \phi_1+1) +4 \phi_2) \mathcal{A_k} + (\lambda\phi_1+ \frac{\alpha_kL^2(\lambda \phi_1 +1)}{\mu} + \frac{4L^2 \phi_2}{\mu}) \mathcal{B_k} \\ &\quad + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) \Vert x_k- \tilde x \Vert^2 + \phi_3 L^2 \Vert x_k - \tilde x \Vert^2\\ &\quad + (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \frac{\lambda \beta_k^2 L^3}{2}) \Vert y_k - \tilde y \Vert^2 + \phi_3 L^2 \Vert y_k - \tilde y\Vert^2 \end{align}\]

从后往前递推地定义下面的序列,递归边界为 $c_N = d_N =0$, 其中$N$ 为方差缩减的周期,

\[\begin{align} c_k &= (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) + \phi_3L^2 \\ &= (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1)+(c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2} +\frac{\lambda \phi_1(L +L')}{2}) \alpha_k^2L^2 \\ d_k &= (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \phi_3 L^2 \\ &= (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+\frac{\lambda \beta_k^2 L^3}{2}) + (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2} +\frac{\lambda \phi_1(L +L')}{2}) \alpha_k^2L^2 \\ m_k^{(1)} &= \alpha_k (\lambda \phi_1+1) -4 \phi_2 \\ &= \alpha_k (\lambda \phi_1 + 1) - 4(c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) -4 \lambda \phi_1(\alpha_k + \frac{\alpha_k^2 L }{2}) \\ m_k^{(2)} &=(1 -\phi_1) -\frac{\alpha_kL^2(\lambda \phi_1 +1)}{\mu} - \frac{4L^2 \phi_2}{\mu}) \\ &= (1 -\phi_1) -\frac{\alpha_kL^2(\lambda \phi_1 +1)}{\mu} - \frac{4L^2}{\mu} (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) - \frac{4L^2 \lambda \phi_1}{\mu} (\alpha_k + \frac{\alpha_k^2 L }{2}) \end{align}\]

则可以得到,

\[\begin{align} \mathbb{E}[\mathcal{R}_{k+1}] \le \mathcal{R}_k - m_k^{(1)} \mathcal{A}_k - m_k^{(2)} \lambda\mathcal{B}_k \end{align}\]

对$k =0 $ 到 $N-1$ 递推可以得到,对于 第 $s$ 次重启动过程中的第 $t$ 个锚点的第 $k$ 次梯度下降上升,

\[\begin{align} \gamma \sum_{k=0}^{N-1} \mathcal{A}_{s,t,k} + \lambda \mathcal{B}_{s,t,k} \le \mathcal{R}_{s,t,0} - \mathcal{R}_{s,t,N} \end{align}\]

对于锚点的位置,也即计算全梯度进行方差缩减的位置,定义对应的Lyapunov 函数,

\[\begin{align} \mathcal{V}_{s,t} &= \tilde{\mathcal{R}}_{s,t} = \tilde{\mathcal{A}}_{s,t} + \lambda \tilde{\mathcal{B}}_{s,t} \end{align}\]

则根据定义可以得到关系式,

\[\begin{align} \mathcal{V}_{s,t} &= \mathcal{R}_{s,t,0} , \mathcal{V}_{s,t+1} = \mathcal{R}_{s,t,N}\\ \end{align}\]

因此对 $t=0$ 到 $T-1$ 递推可以得到,

\[\begin{align} \gamma \sum_{k=0}^{N-1} \mathcal{A}_{s,t,k} + \lambda \mathcal{B}_{s,t,k} &\le \mathcal{V}_{s,t} - \mathcal{V}_{s,t+1} \\ \gamma \sum_{t=0}^{T-1} \sum_{k=0}^{N-1} \mathcal{A}_{t,k} + \lambda \mathcal{B}_{t,k} &\le {\mathcal{V}}_{s,0} - \mathcal{V}_{s,T} \end{align}\]

根据重启动的定义,已经每一次重启动在上一轮中随机选取一个 $t,k$ ,可以得到每次重启动后Lyapunov 函数的下降满足,

\[\begin{align} \mathcal{V}_{s+1} \le \frac{1}{NT \gamma} \mathcal{V}_s \end{align}\]

因此现在问题转化为选择合适的参数使得 $N T \gamma >1$ ,则可以使得算法线性收敛,

从AGDA的步长得到启发,并且观察Young不等式中的量纲,利用待定系数法,不妨设置,

\[\begin{align} \alpha_k &= \frac{k_1 \mu^2}{L^3} , \lambda_1 = \frac{1}{\alpha_k} \\ \beta_k &= \frac{k_2}{L}, \lambda_2 = \frac{1}{\beta_k} \end{align}\]

当参数固定之后,变量 $\phi_1, \phi_2, \phi_3$ 以及迭代序列也随之确定,回顾

\[\begin{align} \phi_1 &= 1- \mu \beta_k + \frac{\mu d_{k+1} }{\lambda} ( \beta_k^2 + \frac{\beta_k}{\lambda_2}) \\ \phi_2 &= (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (\alpha_k^2 + \frac{\alpha_k}{\lambda_1}) + \lambda \phi_1(\alpha_k + \frac{\alpha_k^2 L }{2}) \\ \phi_3 &= (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2} + \frac{L'}{2} +\frac{\lambda \phi_1(L +L')}{2}) \alpha_k^2 \end{align}\]

以及序列,

\[\begin{align} m_k^{(1)} &= \alpha_k (\lambda \phi_1+1) -4 \phi_2 \\ m_k^{(2)} &=(1 -\phi_1) -\frac{\alpha_kL^2(\lambda \phi_1 +1)}{\mu} - \frac{4L^2 \phi_2}{\mu}) \\ c_k &= (c_{k+1} +d_{k+1} \beta_k^2 L^2 + \frac{\lambda \beta_k^2 L^3}{2}) (1+ \alpha_k \lambda_1) + \phi_3L^2 \\ d_k &= (d_{k+1}(1+ \beta_k \lambda_2) +d_{k+1} \beta_k^2 L^2+ \phi_3 L^2 \\ \end{align}\]

文章证明了选取合适的参数,可以使得条件满足,并且可以证明算法的收敛率为,$\mathcal{O}(n + \kappa^9)$