FISTA: FAST Iterative Shrinkage-Thresholding Algorithm

4 minute read

Published:

FISTA(FAST Iterative Shrinkage-Thresholding Algorithm)是Nesterov加速算法在近端梯度下降方法上的一个特例,笔者第一次看Nesterov加速算法等的更新公式宛如神谕,可以感受到其神奇的加速过程,但不能理解其奥妙。遂撰写此文,记录FISTA的推导过程。

FISTA

由于FISTA是近端梯度下降方法的加速版本,因此读者应该先了解基础的近端梯度下降方法(Proximal Gradient Descent),感兴趣的读者可以移步至 次梯度和近端梯度方法

算法根据如下的操作,对普通的近端梯度下降算法进行加速,

\[\begin{align} y_k &= (1- \theta_k) x_{k} + \theta_k v_{k} \\ x_{k+1} &= \text{prox}_{th}(y_k - t_k \nabla g(y_k)) \\ v_{k+1} &= x_{k} + \frac{1}{\theta_{k+1}} (x_{k+1} - x_{k}) \end{align}\]

也可以写成对称的形式,

\[\begin{align} v_k &= x_k + \frac{1}{\theta_k}(y_k - x_k) \\ x_{k+1} &= \text{prox}_{th}(y_k - t_k \nabla g(y_k)) \\ v_{k+1} &= x_{k} + \frac{1}{\theta_{k+1}} (x_{k+1} - x_{k}) \end{align}\]

后面我们将逐步看到算法的加速所在,

\[\begin{align} f(x_{k+1}) &= f(y_k - t_k G_k) \\ &= g(y_k - t_k G_k) + h(y_k - t_k G_k) \\ &\le g(y_k) - t_k \nabla g(y_k)^T G_k + \frac{Lt_k^2}{2} \Vert G_k \Vert_2^2 + h(y_k - t_k G_k) ,\text{ By Lipschitz}\\ &\le g(y_k) - t_k \nabla g(y_k)^T G_k + \frac{Lt_k^2}{2} \Vert G_k \Vert_2^2 + h(x_k) - (G_k - \nabla g(y_k))^T(x_k - y_k + t_k G_k ) ,\text{By Sub Gradient} \\ &\le g(y_k) +h(x_k)+( \nabla g(y_k)- G_k)^T(x_k - y_k) + (\frac{L t_k^2}{2}- t_k) \Vert G_k \Vert_2^2 \\ &\le g(x_k) +h(x_k) - G_k^T(x_k - y_k) + (\frac{L t_k^2}{L}- t_k) \Vert G_k \Vert_2^2 , \text{ By Convexity of } g \\ &=f(x_k ) + G_k^T(y_k - x_k) + (\frac{L t_k^2}{2}- t_k) \Vert G_k \Vert_2^2 \end{align}\]

类似地对于最优值,

\[\begin{align} f(x_{k+1})&= f(y_k - t_k G_k) \\ &= g(y_k - t_k G_k) + h(y_k - t_k G_k) \\ &\le g(y_k) - t_k \nabla g(y_k)^T G_k + \frac{Lt_k^2}{2} \Vert G_k \Vert_2^2 + h(y_k - t_k G_k) ,\text{ By Lipschitz}\\ &\le g(y_k) - t_k \nabla g(y_k)^T G_k + \frac{Lt_k^2}{2} \Vert G_k \Vert_2^2 + h(x_{\star}) - (G_k - \nabla g(y_k))^T(x_{\star} - y_k + t_k G_k ) ,\text{By Sub Gradient} \\ &\le g(y_k) +h(x_{\star})+( \nabla g(y_k)- G_k)^T(x_{\star} - y_k) + (\frac{L t_k^2}{2}- t_k) \Vert G_k \Vert_2^2 \\ &\le g(x_{\star}) +h(x_{\star}) - G_k^T(x_{\star} - y_k) + (\frac{L t_k^2}{L}- t_k) \Vert G_k \Vert_2^2 , \text{ By Convexity of } g \\ &=f(x_{\star}) + G_k^T( y_k-x_{\star}) + (\frac{L t_k^2}{2}- t_k) \Vert G_k \Vert_2^2 \end{align}\]

利用参数$\theta_k$进行凸组合,

\[\begin{align} f(x_{k+1}) & \le (1-\theta_k)(f(x_k ) + G_k^T(y_k - x_k) + (\frac{L t_k^2}{2}- t_k) \Vert G_k \Vert_2^2) + \theta_k(f(x_{\star}) + G_k^T( y_k-x_{\star}) + (\frac{L t_k^2}{2}- t_k) \Vert G_k \Vert_2^2) \\ &=(1-\theta_k) f(x_k) + \theta_k f(x_{\star}) +G_k^T(y_k - (1-\theta_k)x_k - \theta_k x_{\star}) + (\frac{L t_k^2}{2}- t_k) \Vert G_k \Vert_2^2 \\ & \le (1-\theta_k) f(x_k) + \theta_k f(x_{\star}) +G_k^T(x_{k+1} - (1-\theta_k)x_k - \theta_k x_{\star}) +\frac{t_k}{2} \Vert G_k \Vert_2^2, \text{By Choose } t_k \le \frac{1}{L} \\ & \le (1-\theta_k) f(x_k) + \theta_k f(x_{\star}) + \frac{1}{2t_k}[t_kG_k^T(x_{k+1} - (1-\theta_k)x_k - \theta_k x_{\star}) + t_k^2\Vert G_k \Vert_2^2 ] \\ &=(1-\theta_k) f(x_k) + \theta_k f(x_{\star}) +\frac{1}{2t_k} [\Vert t_k G_k+x_{k+1} -(1-\theta_k)x_k - \theta_k x_{\star} \Vert_2^2 - \Vert x_{k+1} -(1-\theta_k)x_k - \theta_k x_{\star} \Vert_2^2] \\ &=(1-\theta_k) f(x_k) + \theta_k f(x_{\star}) +\frac{1}{2t_k} [\Vert y_k -(1-\theta_k)x_k - \theta_k x_{\star} \Vert_2^2 - \Vert x_{k+1} -(1-\theta_k)x_k - \theta_k x_{\star} \Vert_2^2] \\ &= (1-\theta_k) f(x_k) + \theta_k f(x_{\star}) + \frac{\theta_k^2}{2t_k} [\Vert v_k - x_\star \Vert_2^2- \Vert v_{k+1} - x{\star} \Vert_2^2 ] \\ &=(1-\theta_k)(f(x_k) - f(x_{\star})) + f(x_{\star}) + \frac{\theta_k^2}{2t_k} [\Vert v_k - x_\star \Vert_2^2- \Vert v_{k+1} - x{\star} \Vert_2^2 ] \\ \end{align}\]

求和进行化简并且得到$\theta_k$需要满足的条件,并且设定学习率恒定 $t_k = \frac{1}{L}$

\[\begin{align} f(x_{k+1}) - f(x_{\star}) &\le (1-\theta_k) (f(x_k)- f(x_{\star})) + \frac{\theta_k^2}{2t_k} [\Vert v_k - x_\star \Vert_2^2- \Vert v_{k+1} - x{\star} \Vert_2^2 ] \\ \frac{1}{\theta_k^2}(f(x_{k+1}) - f(x_{\star})) &\le \frac{1-\theta_k}{\theta_k^2} (f(x_k)- f(x_{\star})) + \frac{1}{2L} [\Vert v_k - x_\star \Vert_2^2- \Vert v_{k+1} - x{\star} \Vert_2^2 ] \\ \end{align}\]

为了求和号顺利进行下去,需要$\theta_k$满足一定条件,

\[\begin{align} \theta_k &= \frac{2}{k+2} \\ \frac{1-\theta_k}{\theta_k^2} &\le \frac{1}{\theta_{k-1}^2} \\ \frac{1}{\theta_k^2}(f(x_{k+1}) - f(x_{\star})) &\le \frac{1}{\theta_0^2} (f(x_0)- f(x_{\star})) + \frac{1}{2L} [\Vert v_0 - x_\star \Vert_2^2- \Vert v_{k+1} - x{\star} \Vert_2^2 ] \\ & \le \frac{1}{\theta_0^2} (f(x_0)- f(x_{\star})) + \frac{1}{L} \Vert v_0 - x_\star \Vert_2^2\\ f(x_{k+1}) - f(x_{\star}) &\le \frac{1}{\theta_0^2} (f(x_0)- f(x_{\star})) + \frac{1}{2L} [\Vert v_0 - x_\star \Vert_2^2- \Vert v_{k+1} - x{\star} \Vert_2^2 ] \\ & \le \frac{1-\theta_0^2}{\theta_0^2} (f(x_0)- f(x_{\star})) + \frac{\theta_k^2}{2L} \Vert v_0 - x_\star \Vert_2^2\\ &= \frac{\theta_k^2}{L} \Vert v_0 - x_\star \Vert_2^2 ,\text{With } \theta_0 = 1 \\ &= \frac{2}{L(k+2)^2} \Vert x_0 - x_{\star} \Vert_2^2, \text{With } v_0 =x_0 \\ f(x_k ) - f(x_{\star}) & \le \frac{2}{L(k+1)^2} \Vert x_0 - x_{\star} \Vert_2^2 \end{align}\]

对比加速前的算法,原本算法需要的迭代次数为$O(\frac{1}{\epsilon})$, 但使用加速之后的迭代次数仅仅需要 $O(\frac{1}{\sqrt\epsilon})$