Momentum Improves Normalized SGD
Published:
Paper Reading: Momentum Improves Normalized SGD
考虑如下的经典的随机优化问题
\[\begin{align*} \min_x \left\{f(x) \triangleq F(x;\xi) \right\}. \end{align*}\]并且采用如下基于动量与归一化相结合的SGD更新进行优化
\[\begin{align*} g_t &= (1-\alpha) g_{t-1} + \alpha \nabla F(x_t;\xi_t) \\ x_{t+1} &= x_t - \eta \frac{g_t}{\Vert g_t \Vert}. \end{align*}\]文章使用非常简洁的方式从理论上证明了Normalized SGD加上动量可以达到更快的收敛
Normalized SGD
首先分析只有 $L$-smooth 下面的时候,也即满足假设
\[\begin{align*} \Vert \nabla f(x) - \nabla f(y) \Vert \le L \Vert x - y \Vert, \quad \forall x,y. \end{align*}\]此时成立
\[\begin{align*} f(x_{t+1}) &\le f(x_t) + \nabla f(x_t)^\top (x_{t+1} - x_t) + \frac{L}{2} \Vert x_{t+1} - x_t \Vert^2 \\ &= f(x_t)- \frac{\eta \nabla f(x_t)^\top g_t}{\Vert g_t \Vert} + \frac{\eta^2 L}{2} \\ &\le f(x_t) - \frac{\eta}{3} \Vert \nabla f(x_t) \Vert + \frac{8\eta}{3} \Vert g_t - \nabla f(x_t) \Vert + \frac{\eta^2 L}{2}, \end{align*}\]其中最后一步使用了一个稍微技巧性的不等式 $- a^\top b/ \Vert b \Vert \le - \Vert a \Vert / 3 + 8 \Vert a - b \Vert/3 $.
根据动量更新公式,
\[\begin{align*} &\quad g_{t+1} - \nabla f(x_{t+1}) \\ &= (1-\alpha) g_t + \alpha \nabla F(x_{t+1};\xi_t) - \nabla f(x_{t+1}) \\ &= (1-\alpha) g_t - (1-\alpha) \nabla f(x_{t+1}) + \alpha \epsilon_{t+1} \\ &= (1-\alpha) (g_t - \nabla f(x_t)) + (1-\alpha ) (\nabla f(x_t) - \nabla f(x_{t+1})) + \alpha \epsilon_{t+1}. \end{align*}\]展开递推式得到
\[\begin{align*} g_{t} - \nabla f(x_{t}) &= (1-\alpha)^t (g_ 0- \nabla f(x_0)) + \sum_{k=0}^{t-1}(1-\alpha)^{t-k} ( \nabla f(x_k) - \nabla f(x_{k+1})) + \alpha \sum_{k=0}^{t-1} (1-\alpha)^{t-k-1} \epsilon_{k+1}. \end{align*}\]取期望后得到
\[\begin{align*} \mathbb{E} \Vert g_t - \nabla f(x_t) \Vert &\le (1-\alpha)^t \sqrt{\mathbb{E} \Vert \epsilon_0 \Vert^2} + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \mathbb{E}\Vert \nabla f(x_k) - \nabla f(x_{k+1}) \Vert + \alpha \sqrt{ \mathbb{E} \left \Vert \sum_{k=0}^{t-1}(1-\alpha)^{t-k-1} \epsilon_{k+1} \right \Vert^2 } \\ &=(1-\alpha)^t \sqrt{\mathbb{E} \Vert \epsilon_0 \Vert^2} + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \mathbb{E}\Vert \nabla f(x_k) - \nabla f(x_{k+1}) \Vert +\alpha \sqrt{ \sum_{k=0}^{t-1}(1-\alpha)^{2(t-k-1)} \left \Vert \epsilon_{k+1} \right \Vert^2 } \\ &\le (1-\alpha)^t \sigma + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \eta L + \alpha \sigma \sqrt{ \sum_{k=0}^{t-1} (1-\alpha)^{2(t-k-1)}} \\ &\le (1-\alpha)^{t} \sigma + \frac{\eta L}{\alpha} + \sqrt{\alpha} \sigma. \end{align*}\]代入并且递推后得到
\[\begin{align*} \frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E} \Vert \nabla f(x_t) \Vert \le \mathcal{O} \left( \frac{\Delta}{\eta T} + \eta L + \frac{\sigma}{\alpha T} +\frac{\eta L}{\alpha} + \sqrt{\alpha} \sigma \right) = \mathcal{O} \left( T^{-1/4}\right) \end{align*}\]换句话说,找到一个 $\epsilon$-稳定点的复杂度是 $\mathcal{O}(\epsilon^{-4})$.
Second Order Smoothness
该文章有意思的结论是,当函数进一步满足Hessian Lipschitz连续的时候,可以得到比SGD更快的收敛率,记Hessian的连续性系数为 $\rho$, 该假设下有
\[\begin{align*} \Vert \nabla f(x) - \nabla f(y) - \nabla^2 f(y) (x-y) \Vert \le \frac{\rho}{2} \Vert x- y \Vert^2, \quad \forall x,y. \end{align*}\]考虑如下的更新方式,
\[\begin{align*} y_t &= \frac{1}{\alpha} x_t - \frac{1-\alpha}{\alpha} x_{t-1} \\ g_t &= (1-\alpha) g_{t-1} + \alpha \nabla f(y_t;\xi_t) \\ x_{t-1} &= x_t - \eta \frac{g_t}{ \Vert g_t \Vert } \end{align*}\]该更新方式的核心在于利用更高阶的连续性,
\[\begin{align*} &\quad g_{t+1} - \nabla f(x_{t+1}) \\ &= (1-\alpha) g_t + \alpha \nabla F(y_{t+1};\xi_t) - \nabla f(x_{t+1}) \\ &= (1-\alpha) (g_t - \nabla f(x_t)) + (1-\alpha) (\nabla f(x_t) - \nabla f(x_{t+1})) +\alpha (\nabla f(y_{t+1}) - \nabla f(x_{t+1})) + \alpha \epsilon_{t+1} \\ &= (1-\alpha) (g_t - \nabla f(x_t)) + (1-\alpha) (\nabla f(x_t) - \nabla f(x_{t+1}) - \nabla^2 f(x_{t+1}) (x_t - x_{t+1})) \\ &\quad +\alpha (\nabla f(y_{t+1}) - \nabla f(x_{t+1}) - \nabla^2 f(x_{t+1})(y_{t+1} - x_{t+1}) ) + \alpha \epsilon_{t+1}. \end{align*}\]展开递推式并取期望后得到
\[\begin{align*} &\quad \mathbb{E} \Vert g_t - \nabla f(x_t) \Vert \\ &\le(1-\alpha)^t \sqrt{\mathbb{E} \Vert \epsilon_0 \Vert^2} + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \mathbb{E}\Vert \nabla f(x_k) - \nabla f(x_{k+1}) - \nabla^2 f(x_{k+1}) (x_k - x_{k+1}) \Vert \\ &\quad + \alpha\sum_{k=0}^{t-1} (1-\alpha)^{t-k-1} \mathbb{E}\Vert \nabla f(y_{k+1}) - \nabla f(x_{k+1}) - \nabla^2 f(x_{k+1}) (y_{k+1} - x_{k+1}) \Vert +\alpha \sqrt{ \sum_{k=0}^{t-1}(1-\alpha)^{2(t-k-1)} \left \Vert \epsilon_{k+1} \right \Vert^2 } \\ &\le (1-\alpha)^t \sigma + 2\sum_{k=0}^{t-1} (1-\alpha)^{t-k} \eta^2 \rho + \alpha \sigma \sqrt{ \sum_{k=0}^{t-1} (1-\alpha)^{2(t-k-1)}} \\ &\le (1-\alpha)^{t} \sigma + \frac{\eta^2 \rho}{\alpha^2} + \sqrt{\alpha} \sigma. \end{align*}\]代入并且递推后得到
\[\begin{align*} \frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E} \Vert \nabla f(x_t) \Vert \le \mathcal{O}\left(\frac{\Delta}{\eta T} + \eta L + \frac{\sigma}{\alpha T} +\frac{\eta^2 \rho}{\alpha^2} + \sqrt{\alpha} \sigma \right) = \mathcal{O} \left( T^{-2/7}\right) \end{align*}\]当取 $\alpha = \mathcal{O} \left( T^{-4/7} \right)$ 以及 $\eta = \mathcal{O} \left( T^{-5/7} \right)$ 的时候取到等号。
换句话说,找到一个 $\epsilon$-稳定点的复杂度是 $\mathcal{O}(\epsilon^{-3.5})$.
