Least Square SGD with Tail Average

8 minute read

Published: August 18, 2022

Paper Reading: A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

Introduction

考虑 SGD 求解最小二乘问题，

\[\begin{align*} \min L(w) \triangleq \frac{1}{2} \mathbb{E}_{(x,y) \in \mathcal{D}} [(y - x^\top w )^2]. \end{align*}\]

根据最优点条件，成立

\[\begin{align*} \mathbb{E}[x (x^\top w_* -y) ] = 0 \end{align*}\]

SGD 的迭代格式如下，

\[\begin{align*} w_t = w_{t-1} + \gamma (y_t- x_t^\top w_{t-1}) x_t \end{align*}\]

定义

\[\begin{align*} H \triangleq \mathbb{E} [xx^\top], \quad \Sigma \triangleq \mathbb{E}[(y - x^\top w_*)^2 x x^\top], \quad \sigma^2 \triangleq \Vert H^{-1/2} \Sigma H^{-1/2} \Vert \end{align*}\]

上述定义推广了简单的加性噪声模型，也即

\[\begin{align*} y = x^\top w_* + \xi, \quad \xi \sim \mathcal{N}(0,\sigma^2). \end{align*}\]

通常假设

\[\begin{align*} \mu I \preceq \mathbb{E}[xx^\top] \preceq L I. \end{align*}\]

文章推导时在不等号的另一边使用了稍强一点的假设，

\[\begin{align*} \mathbb{E}[xx^\top x x^\top ] \preceq L \mathbb{E}[xx^\top]. \end{align*}\]

可以得到该模型下的泛化风险为，

\[\begin{align*} &\quad L(w) - L(w_*) \\ &= \frac{1}{2} \mathbb{E}[ (y - x^\top w)^2 - (y - x^\top w_*)^2 ] \\ &= \frac{1}{2} \mathbb{E}[ w^\top xx^\top w -w_*^\top xx^\top w_* - 2 (w - w_*)^\top xy] \\ &= \frac{1}{2} \mathbb{E}[ w^\top xx^\top w -w_*^\top xx^\top w_* - 2 (w - w_*)^\top xx^\top w_*] \\ &= \frac{1}{2} \mathbb{E}[ w^\top xx^\top w +w_*^\top xx^\top w_* - 2 w xx^\top w_*] \\ &= \frac{1}{2} \mathbb{E}[(w- w^*)^\top xx^\top (w - w^*)] \\ &= \frac{1}{2} \Vert w - w_* \Vert_H^2. \end{align*}\]

Risk Bound Decomposition

定义最优点处的梯度，

\[\begin{align*} \xi_t =\nabla L(w_*;x_t,y_t) = x_t ( x_t^\top w_* - y_t) . \end{align*}\]

根据最优性条件，其满足 $\mathbb{E}[\xi_t] = 0$, 定义其协方差矩阵

\[\begin{align*} \Sigma_t = \mathbb{E}[\xi_t \xi_t^\top]. \end{align*}\]

根据假设，成立，

\[\begin{align*} \Sigma \preceq \sigma^2 H. \end{align*}\]

SGD的更新公式为，

\[\begin{align*} w_{t} - w_* &= w_{t-1} - w_* -\gamma \nabla L(w_t; x_t,y_t) \\ &= w_{t-1} - w_* - \gamma x_t (x_t^\top w_{t-1} - y_t) \\ &= (I - \gamma x_t x_t^\top) (w_{t-1} - w_*) - \gamma \xi_t \end{align*}\]

定义 $B_t = I - \gamma x_t x_t^\top$, 得到其递推式

\[\begin{align*} w_{t} - w_* = B_t (w_{t-1} - w_*) - \gamma \xi_t. \end{align*}\]

展开后得到，

\[\begin{align*} w_t - w_* = B_t B_{t-1} \cdots B_0(w_0 - w_*) - \gamma (\xi_t + B_{t} \xi_{t-1} + \cdots +B_{t} B_{t-1} \cdots B_2 \xi_1) \end{align*}\]

利用Cauthy不等式，得到

\[\begin{align*} \mathbb{E} [u^\top Hv]^2 \le \mathbb{E}[u^\top Hu] \mathbb{E}[v^\top Hv], \end{align*}\]

因此有，

\[\begin{align*} \mathbb{E}[(u+v)^\top H (u+v)] &\le \left( \sqrt{\mathbb{E} [u^\top Hu]} + \sqrt{\mathbb{E} [v^\top Hv]}\right)^2 \\ &\le 2 \mathbb{E}[u^\top H u] + 2 \mathbb{E}[v^\top H v]. \end{align*}\]

据此可以得到泛化风险的上界，

\[\begin{align*} L(w_t) - L(w_*) \le \mathbb{E}_{\xi_t = 0} \Vert w_t - w_* \Vert_H^2 +\mathbb{E}_{w_0 = w_*}\Vert w_t - w_* \Vert_H^2. \end{align*}\]

上述第一项与噪声无关，称为确定性的偏差项，第二项和初始点无关，相当于初始与 $w_*$ 产生的误差，称为方差项。

下面分别控制偏差项以及方差项的界。

Bias Bound

可以得到在噪声项 $\xi_t = 0$的前提下更新公式为

\[\begin{align*} w_{t} - w_* &= B_t (w_{t-1} - w_*) =(I - \gamma x_t x_t^\top) (w_{t-1} - w_*) \end{align*}\]

据此当 $\gamma \le 1/ R^2$ 的时候，成立

\[\begin{align*} &\quad \mathbb{E}[ \Vert w_t - w_* \Vert^2] \\ &= \mathbb{E}[ \Vert (I - \gamma x_t x_t^\top ) (w_{t-1} - w_*) \Vert^2] \\ &= \mathbb{E}[ \Vert w_{t-1} - w_* \Vert^2] - 2 \gamma \mathbb{E}[ (w_{t-1} - w_*)x_t x_t^\top (w _{t-1} - w_*)] \\ &\quad + \gamma^2 \mathbb{E}[ (w_{t-1} -w_*)^\top x_t x_t^\top x_t x_t^\top (w_{t-1} - w_*)] \\ &= \mathbb{E}[ \Vert w_{t-1} - w_* \Vert^2] - 2 \gamma H \mathbb{E}[\Vert w_{t-1} - w_* \Vert^2] + \gamma^2 LH \mathbb{E}[\Vert w_{t-1} - w_* \Vert^2] \\ &\le (1- \gamma \mu) \mathbb{E}[ \Vert w_{t-1} - w_* \Vert^2]. \end{align*}\]

因此得到偏差项的上界为

\[\begin{align*} \mathbb{E}[\Vert w_t - w_* \Vert_H^2] \le L \mathbb{E}[\Vert w_t - w_* \Vert^2] \le (1- \gamma \mu)^t L \mathbb{E}[\Vert w_0 - w_* \Vert^2]. \end{align*}\]

Variance Bound

假设 $w_0 = w_*$, 因此也成立 $ \mathbb{E} [w_t] = \mathbb{E}[w_0] $, 根据更新公式

\[\begin{align*} w_{t} - w_* &= B_t (w_{t-1} - w_*) - \gamma \xi_t \\ &= (I - \gamma x_t x_t^\top) (w_{t-1} - w_*) - \gamma \xi_t \end{align*}\]

我们希望分析协方差矩阵

$\begin{align*} C_t = \mathbb{E}_{w_0 =w_*}[ (w_t - w_*)(w_t - w_*)^\top ] \end{align*}$ 根据 $\xi_t$ 的条件独立性，并且展开递推式可以得到，

\[\begin{align*} w_t - w_* = - \gamma (\xi_t + B_{t} \xi_{t-1} + \cdots +B_{t} B_{t-1} \cdots B_2 \xi_1). \end{align*}\]

进一步，

\[\begin{align*} C_t &= \gamma^2 \left(\Sigma_t + B_t \Sigma_{t-1} B_t^\top + \cdots + B_{t} B_{t-1} \cdots B_2 \Sigma_1 B_2^\top \cdots B_{t-1}^\top B_t \right) \\ &= C_{t-1} + \gamma^2 B_{t} B_{t-1} \cdots B_2 \Sigma_1 B_2^\top \cdots B_{t-1}^\top B_t \end{align*}\]

因此得到了$C_t$ 为单调递增序列，

\[\begin{align*} O = C_0 \preceq C_1 \preceq \cdots \preceq C_t \end{align*}\]

另一方面根据 $x_t$ 与 $w_{t-1}$ 的独立性，可以得到其递推式

\[\begin{align*} C_{t} &= \mathbb{E}[ (I - \gamma x_t x_t^\top )C_{t-1} (I - \gamma x_t x_t^\top )] + \gamma^2 \Sigma_t \\ &=C_{t-1} - \gamma H C_{t-1} - \gamma C_{t-1} H + \gamma^2 \mathbb{E}[ x_t x_t^\top C_tx_t x_t^\top] + \gamma^2 \Sigma_t. \end{align*}\]

利用该递推式可以发现，

\[\begin{align*} {\rm tr}(C_t) &= {\rm tr} (C_{t-1} - \gamma H C_{t-1} - \gamma C_{t-1} H + \gamma^2 \mathbb{E}[ x_t x_t^\top C_tx_t x_t^\top] + \gamma^2 \Sigma_t) \\ &= {\rm tr} (C_{t-1}) - 2\gamma {\rm tr}(HC_{t-1}) + \gamma^2 {\rm tr} \mathbb{E}[C_t x_t x_t^\top x_t x_t^\top ] + \gamma^2 {\rm tr}(\Sigma) \\ &\le {\rm tr} (C_{t-1}) - 2\gamma {\rm tr}(HC_{t-1}) + \gamma^2 L {\rm tr}( H C_t) + \gamma^2 {\rm tr}(\Sigma) \\ &\le (1- \gamma \mu) {\rm tr}(C_{t-1}) + \gamma^2 {\rm tr}(\Sigma). \end{align*}\]

因此 $C_t$ 存在上界，

\[\begin{align*} {\rm tr}(C_t) \le \frac{\gamma}{\mu} {\rm tr}(\Sigma). \end{align*}\]

根据单调序列收敛定理，$C_t$ 的极限存在，记为 $C_{\infty}$, 其满足如下关系式，

\[\begin{align*} C_{\infty} &=C_{\infty} - \gamma H C_{\infty} - \gamma C_{\infty} H + \gamma^2 \mathbb{E}[ x x^\top C_{\infty}x x^\top] + \gamma^2 \Sigma. \end{align*}\]

若算法采用均值作为输出，

\[\begin{align*} \bar w_T = \frac{1}{T} \sum_{t=0}^{T-1} w_t, \end{align*}\]

则可以得到，

\[\begin{align*} &\quad \frac{1}{2} \mathbb{E}_{w_0 = w_*}[(\bar w_T - w_*) (\bar w_T - w_*)^\top ] \\ &= \frac{1}{2 T^2} \sum_{t=0}^{T-1} \sum_{t' = 0}^{T-1} \mathbb{E}[ (w_t - w_*) (w_{t'} - w_*)^\top] \\ &\preceq \frac{1}{2 T^2} \sum_{t=0}^{T-1} \sum_{t' = t}^{T-1} \mathbb{E}[ (w_t - w_*) (w_{t'} - w_*)^\top + (w_{t'} - w_*) (w_{t} - w_*)^\top] \\ &= \frac{1}{2 T^2} \sum_{t=0}^{T-1} \sum_{\tau =0}^{T-t-1} \mathbb{E}[ (w_t - w_*) (w_t - w_*)^\top(I - \gamma H)^\tau + (I - \gamma H)^\tau (w_t - w_*) (w_t - w_*)^\top]. \end{align*}\]

进一步有

\[\begin{align*} &\quad \frac{1}{2} \mathbb{E}_{w_0 = w_*} [\Vert \bar w_T - w_* \Vert_H^2] \\ &= \frac{1}{2} {\rm tr} \mathbb{E}[H (\bar w_T - w_*) (\bar w_T - w_*)^\top ] \\ &\le \frac{1}{T^2} \sum_{t=0}^{T-1} \sum_{\tau = 0}^{T - t-1} {\rm tr} \mathbb{E}[ H C_t (I- \gamma H)^{\tau} ] \\ &\le \frac{1}{T^2} \sum_{t=0}^{T-1} \sum_{\tau = 0}^{\infty} {\rm tr} \mathbb{E}[ H C_t (I- \gamma H)^{\tau} ] \\ &= \frac{1}{T^2} \sum_{t=0}^{T-1} {\rm tr}\mathbb{E}[ C_t (\gamma H)^{-1} H ] \\ &= \frac{1}{\gamma T^2 }\sum_{t=0}^{T-1} {\rm tr}(C_t) \\ &\le \frac{1}{ \gamma T} {\rm tr}( C_{\infty}). \end{align*}\]

因此问题转化为对 ${\rm tr}( C_{\infty})$ 的上界的估计，我们知道其满足，

\[\begin{align*} H C_{\infty} + C_{\infty} H = \gamma \mathbb{E}[xx^\top C_{\infty} xx^\top] + \gamma \Sigma. \end{align*}\]

定义如下的矩阵算子，

\[\begin{align*} \mathcal{S} \circ M &= \mathbb{E}[ xx^\top M xx^\top] \\ \mathcal{T} \circ M &= HM + MH - \gamma HM H . \end{align*}\]

可以发现算子 $\mathcal{S}$为半正定算子, 也即

\[\begin{align*} M \preceq M' \Rightarrow \mathcal{S} \circ M \preceq \mathcal{S} \circ M'. \end{align*}\]

可以将方程写为，

\[\begin{align*} \mathcal{T} \circ C_{\infty} = \gamma S \circ C_{\infty} + \gamma \Sigma - \gamma H C_{\infty} H. \end{align*}\]

根据关系式，

\[\begin{align*} ( I - \gamma \mathcal{T} )\circ M = (I - \gamma H) M (I - \gamma H) \end{align*}\]

以及Neumann级数可以得到，

\[\begin{align*} \mathcal{T}^{-1} = \gamma \sum_{k=0}^{\infty} (I - \gamma \mathcal{T})^k = \gamma \sum_{k=0}^{\infty} (I- \gamma H)^k M (I - \gamma H)^k. \end{align*}\]

因此算子 $\mathcal{T}^{-1}$ 也为半正定算子，因此成立，

\[\begin{align*} C_{\infty} &\preceq \gamma \mathcal{T}^{-1} \circ S \circ C_{\infty} + \gamma \mathcal{T}^{-1} \circ\Sigma - \gamma \mathcal{T}^{-1} \circ (H C_{\infty} H) \\ &\preceq \gamma \mathcal{T}^{-1} \circ S \circ C_{\infty} + \gamma \mathcal{T}^{-1} \circ\Sigma \\ &\preceq \gamma \mathcal{T}^{-1} \circ S \circ C_{\infty} + \gamma \sigma^2\mathcal{T}^{-1} \circ H. \end{align*}\]

递推可以得到，

\[\begin{align*} C_{\infty} \preceq \gamma \sigma^2 \sum_{t=0}^{\infty }(\gamma \mathcal{T}^{-1} \circ \mathcal{S} )^{t} \circ \mathcal{T}^{-1} \circ H. \end{align*}\]

进一步利用

\[\begin{align*} \mathcal{T}^{-1} H &= \gamma \sum_{k=0}^{\infty} (I- \gamma H)^k H ( I - \gamma H) \\ &= \gamma \sum_{k=0}^{\infty} (I- \gamma H)^{2k} H \\ &\preceq \gamma \sum_{k=0}^{\infty} (I- \gamma H)^{k} H \\ &= I. \end{align*}\]

以及根据假设有，

\[\begin{align*} \mathcal{S} \circ I \preceq LH. \end{align*}\]

递推后可以得到,

\[\begin{align*} C_{\infty} \preceq \gamma \sigma^2\sum_{t=0}^{\infty} (\gamma L)^t I = \frac{\gamma \sigma^2}{1 - \gamma L} I \end{align*}\]

代入得到方差项的界为，

\[\begin{align*} \mathbb{E}_{w_0 = w_*}[\Vert \bar w_T - w_* \Vert_H^2] \le \frac{2 \sigma^2 d}{T(1- \gamma L)} \end{align*}\]

Conclusion

使用尾平均策略作为输出，

\[\begin{align*} \bar w_{t:T} = \frac{1}{T-t} \sum_{t'=t}^{T-1} w_t . \end{align*}\]

并且代入偏差项和方差项的界可以得到泛化风险的上界

\[\begin{align*} &\quad L(\bar w_{t:T}) - L(w_*) \\ &\le \mathbb{E}_{\xi_t = 0} \Vert \bar w_{t:T} - w_* \Vert_H^2 +\mathbb{E}_{w_0 = w_*}\Vert w_{t:T} - w_* \Vert_H^2 \\ &\le (1- \gamma \mu)^t L \mathbb{E}[\Vert w_0 - w_* \Vert^2] + \frac{1}{T-t} \times \frac{2 \sigma^2 d}{1- \gamma L}. \end{align*}\]

选择 $\gamma = 1/(2L)$, 可以发现为了得到 $\epsilon$-最优解所需要的复杂度为，

\[\begin{align*} \mathcal{O} \left( \kappa \log ( \epsilon^{-1}) + \sigma^2 d \epsilon^{-1} \right) \end{align*}\]

Share on

Twitter Facebook LinkedIn

Lesi Chen (陈乐偲)

Least Square SGD with Tail Average

Introduction

Risk Bound Decomposition

Bias Bound

Variance Bound

Conclusion

Share on

You May Also Enjoy

Momentum Improves Normalized SGD

Benign Overfitting of SGD for Least Square

PAGE