Momentum Improves Normalized SGD
Published:
Paper Reading: Momentum Improves Normalized SGD
考虑如下的经典的随机优化问题
\[\begin{align*} \min_x \left\{f(x) \triangleq F(x;\xi) \right\}. \end{align*}\]并且采用如下基于动量与归一化相结合的SGD更新进行优化
\[\begin{align*} g_t &= (1-\alpha) g_{t-1} + \alpha \nabla F(x_t;\xi_t) \\ x_{t+1} &= x_t - \eta \frac{g_t}{\Vert g_t \Vert}. \end{align*}\]文章使用非常简洁的方式从理论上证明了Normalized SGD加上动量可以达到更快的收敛
Normalized SGD
首先分析只有 $L$-smooth 下面的时候,成立
\[\begin{align*} f(x_{t+1}) &\le f(x_t) + \nabla f(x_t)^\top (x_{t+1} - x_t) + \frac{L}{2} \Vert x_{t+1} - x_t \Vert^2 \\ &= f(x_t)- \frac{\eta \nabla f(x_t)^\top g_t}{\Vert g_t \Vert} + \frac{\eta^2 L}{2} \\ &\le f(x_t) - \eta \Vert g_t \Vert + \eta \Vert g_t - \nabla f(x_t) \Vert + \frac{\eta^2 L}{2} \\ &\le f(x_t) - \eta \Vert \nabla f(x_t) \Vert + 2\eta \Vert g_t - \nabla f(x_t) \Vert + \frac{\eta^2 L}{2} \end{align*}\]根据动量更新公式,
\[\begin{align*} &\quad g_{t+1} - \nabla f(x_{t+1}) \\ &= (1-\alpha) g_t + \alpha \nabla F(x_{t+1};\xi_t) - \nabla f(x_{t+1}) \\ &= (1-\alpha) g_t - (1-\alpha) \nabla f(x_{t+1}) + \alpha \epsilon_{t+1} \\ &= (1-\alpha) (g_t - \nabla f(x_t)) + (1-\alpha ) (\nabla f(x_t) - \nabla f(x_{t+1})) + \alpha \epsilon_{t+1}. \end{align*}\]展开递推式得到
\[\begin{align*} g_{t} - \nabla f(x_{t}) &= (1-\alpha)^t (g_ 0- \nabla f(x_0)) + \sum_{k=0}^{t-1}(1-\alpha)^{t-k} ( \nabla f(x_k) - \nabla f(x_{k+1})) + \alpha \sum_{k=0}^{t-1} (1-\alpha)^{t-k-1} \epsilon_{k+1}. \end{align*}\]取期望后得到
\[\begin{align*} \mathbb{E} \Vert g_t - \nabla f(x_t) \Vert &\le (1-\alpha)^t \sqrt{\mathbb{E} \Vert \epsilon_0 \Vert^2} + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \mathbb{E}\Vert \nabla f(x_k) - \nabla f(x_{k+1}) \Vert + \alpha \sqrt{ \mathbb{E} \left \Vert \sum_{k=0}^{t-1}(1-\alpha)^{t-k-1} \epsilon_{k+1} \right \Vert^2 } \\ &=(1-\alpha)^t \sqrt{\mathbb{E} \Vert \epsilon_0 \Vert^2} + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \mathbb{E}\Vert \nabla f(x_k) - \nabla f(x_{k+1}) \Vert +\alpha \sqrt{ \sum_{k=0}^{t-1}(1-\alpha)^{2(t-k-1)} \left \Vert \epsilon_{k+1} \right \Vert^2 } \\ &\le (1-\alpha)^t \sigma + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \eta L + \alpha \sigma \sqrt{ \sum_{k=0}^{t-1} (1-\alpha)^{2(t-k-1)}} \\ &\le (1-\alpha)^{t} \sigma + \frac{\eta L}{\alpha} + \sqrt{\alpha} \sigma. \end{align*}\]代入并且递推后得到
\[\begin{align*} \frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E} \Vert \nabla f(x_t) \Vert \le \frac{\Delta}{\eta T} + \frac{\eta L}{2}+ \frac{2 \sigma}{\alpha T} +\frac{2\eta L}{\alpha} + 2\sqrt{\alpha} \sigma = \mathcal{O} \left( T^{-1/4}\right) \end{align*}\]Second Order Smoothness
该文章有意思的结论是,当函数进一步满足Hessian Lipschitz连续的时候,可以得到比SGD更快的收敛率,记Hessian的连续性系数为 $\rho$.
考虑如下的更新方式,
\[\begin{align*} y_t &= \frac{1}{\alpha} x_t - \frac{1-\alpha}{\alpha} x_{t-1} \\ g_t &= (1-\alpha) g_{t-1} + \alpha \nabla f(y_t;\xi_t) \\ x_{t-1} &= x_t - \eta \frac{g_t}{ \Vert g_t \Vert } \end{align*}\]该更新方式的核心在于利用更高阶的连续性,
\[\begin{align*} &\quad g_{t+1} - \nabla f(x_{t+1}) \\ &= (1-\alpha) g_t + \alpha \nabla F(y_{t+1};\xi_t) - \nabla f(x_{t+1}) \\ &= (1-\alpha) (g_t - \nabla f(x_t)) + (1-\alpha) (\nabla f(x_t) - \nabla f(x_{t+1})) +\alpha (\nabla f(y_{t+1}) - \nabla f(x_{t+1})) + \alpha \epsilon_{t+1} \\ &= (1-\alpha) (g_t - \nabla f(x_t)) + (1-\alpha) (\nabla f(x_t) - \nabla f(x_{t+1}) - \nabla^2 f(x_{t+1}) (x_t - x_{t+1})) \\ &\quad +\alpha (\nabla f(y_{t+1}) - \nabla f(x_{t+1}) - \nabla^2 f(x_{t+1})(y_{t+1} - x_{t+1}) ) + \alpha \epsilon_{t+1}. \end{align*}\]展开递推式并取期望后得到
\[\begin{align*} &\quad \mathbb{E} \Vert g_t - \nabla f(x_t) \Vert \\ &\le(1-\alpha)^t \sqrt{\mathbb{E} \Vert \epsilon_0 \Vert^2} + \sum_{k=0}^{t-1} (1-\alpha)^{t-k} \mathbb{E}\Vert \nabla f(x_k) - \nabla f(x_{k+1}) - \nabla^2 f(x_{k+1}) (x_k - x_{k+1}) \Vert \\ &\quad + \alpha\sum_{k=0}^{t-1} (1-\alpha)^{t-k-1} \mathbb{E}\Vert \nabla f(y_{k+1}) - \nabla f(x_{k+1}) - \nabla^2 f(x_{k+1}) (y_{k+1} - x_{k+1}) \Vert +\alpha \sqrt{ \sum_{k=0}^{t-1}(1-\alpha)^{2(t-k-1)} \left \Vert \epsilon_{k+1} \right \Vert^2 } \\ &\le (1-\alpha)^t \sigma + 2\sum_{k=0}^{t-1} (1-\alpha)^{t-k} \eta^2 \rho + \alpha \sigma \sqrt{ \sum_{k=0}^{t-1} (1-\alpha)^{2(t-k-1)}} \\ &\le (1-\alpha)^{t} \sigma + \frac{\eta^2 \rho}{\alpha^2} + \sqrt{\alpha} \sigma. \end{align*}\]代入并且递推后得到
\[\begin{align*} \frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E} \Vert \nabla f(x_t) \Vert \le \frac{\Delta}{\eta T} + \frac{\eta L}{2 }+ \frac{2 \sigma}{\alpha T} +\frac{2\eta^2 \rho}{\alpha^2} + 2\sqrt{\alpha} \sigma = \mathcal{O} \left( T^{-2/7}\right) \end{align*}\]当取 $\alpha = \mathcal{O} \left( T^{-4/7} \right)$ 以及 $\eta = \mathcal{O} \left( T^{-5/7} \right)$ 的时候取到等号。