Least Square SGD with Tail Average
Published:
Paper Reading: A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)
Introduction
考虑 SGD 求解最小二乘问题,
\[\begin{align*} \min L(w) \triangleq \frac{1}{2} \mathbb{E}_{(x,y) \in \mathcal{D}} [(y - x^\top w )^2]. \end{align*}\]根据最优点条件,成立
\[\begin{align*} \mathbb{E}[x (x^\top w_* -y) ] = 0 \end{align*}\]SGD 的迭代格式如下,
\[\begin{align*} w_t = w_{t-1} + \gamma (y_t- x_t^\top w_{t-1}) x_t \end{align*}\]定义
\[\begin{align*} H \triangleq \mathbb{E} [xx^\top], \quad \Sigma \triangleq \mathbb{E}[(y - x^\top w_*)^2 x x^\top], \quad \sigma^2 \triangleq \Vert H^{-1/2} \Sigma H^{-1/2} \Vert \end{align*}\]上述定义推广了简单的加性噪声模型,也即
\[\begin{align*} y = x^\top w_* + \xi, \quad \xi \sim \mathcal{N}(0,\sigma^2). \end{align*}\]通常假设
\[\begin{align*} \mu I \preceq \mathbb{E}[xx^\top] \preceq L I. \end{align*}\]文章推导时在不等号的另一边使用了稍强一点的假设,
\[\begin{align*} \mathbb{E}[xx^\top x x^\top ] \preceq L \mathbb{E}[xx^\top]. \end{align*}\]可以得到该模型下的泛化风险为,
\[\begin{align*} &\quad L(w) - L(w_*) \\ &= \frac{1}{2} \mathbb{E}[ (y - x^\top w)^2 - (y - x^\top w_*)^2 ] \\ &= \frac{1}{2} \mathbb{E}[ w^\top xx^\top w -w_*^\top xx^\top w_* - 2 (w - w_*)^\top xy] \\ &= \frac{1}{2} \mathbb{E}[ w^\top xx^\top w -w_*^\top xx^\top w_* - 2 (w - w_*)^\top xx^\top w_*] \\ &= \frac{1}{2} \mathbb{E}[ w^\top xx^\top w +w_*^\top xx^\top w_* - 2 w xx^\top w_*] \\ &= \frac{1}{2} \mathbb{E}[(w- w^*)^\top xx^\top (w - w^*)] \\ &= \frac{1}{2} \Vert w - w_* \Vert_H^2. \end{align*}\]Risk Bound Decomposition
定义最优点处的梯度,
\[\begin{align*} \xi_t =\nabla L(w_*;x_t,y_t) = x_t ( x_t^\top w_* - y_t) . \end{align*}\]根据最优性条件,其满足 $\mathbb{E}[\xi_t] = 0$, 定义其协方差矩阵
\[\begin{align*} \Sigma_t = \mathbb{E}[\xi_t \xi_t^\top]. \end{align*}\]根据假设,成立,
\[\begin{align*} \Sigma \preceq \sigma^2 H. \end{align*}\]SGD的更新公式为,
\[\begin{align*} w_{t} - w_* &= w_{t-1} - w_* -\gamma \nabla L(w_t; x_t,y_t) \\ &= w_{t-1} - w_* - \gamma x_t (x_t^\top w_{t-1} - y_t) \\ &= (I - \gamma x_t x_t^\top) (w_{t-1} - w_*) - \gamma \xi_t \end{align*}\]定义 $B_t = I - \gamma x_t x_t^\top$, 得到其递推式
\[\begin{align*} w_{t} - w_* = B_t (w_{t-1} - w_*) - \gamma \xi_t. \end{align*}\]展开后得到,
\[\begin{align*} w_t - w_* = B_t B_{t-1} \cdots B_0(w_0 - w_*) - \gamma (\xi_t + B_{t} \xi_{t-1} + \cdots +B_{t} B_{t-1} \cdots B_2 \xi_1) \end{align*}\]利用Cauthy不等式,得到
\[\begin{align*} \mathbb{E} [u^\top Hv]^2 \le \mathbb{E}[u^\top Hu] \mathbb{E}[v^\top Hv], \end{align*}\]因此有,
\[\begin{align*} \mathbb{E}[(u+v)^\top H (u+v)] &\le \left( \sqrt{\mathbb{E} [u^\top Hu]} + \sqrt{\mathbb{E} [v^\top Hv]}\right)^2 \\ &\le 2 \mathbb{E}[u^\top H u] + 2 \mathbb{E}[v^\top H v]. \end{align*}\]据此可以得到泛化风险的上界,
\[\begin{align*} L(w_t) - L(w_*) \le \mathbb{E}_{\xi_t = 0} \Vert w_t - w_* \Vert_H^2 +\mathbb{E}_{w_0 = w_*}\Vert w_t - w_* \Vert_H^2. \end{align*}\]上述第一项与噪声无关,称为确定性的偏差项,第二项和初始点无关,相当于初始与 $w_*$ 产生的误差,称为方差项。
下面分别控制偏差项以及方差项的界。
Bias Bound
可以得到在噪声项 $\xi_t = 0$的前提下更新公式为
\[\begin{align*} w_{t} - w_* &= B_t (w_{t-1} - w_*) =(I - \gamma x_t x_t^\top) (w_{t-1} - w_*) \end{align*}\]据此当 $\gamma \le 1/ R^2$ 的时候,成立
\[\begin{align*} &\quad \mathbb{E}[ \Vert w_t - w_* \Vert^2] \\ &= \mathbb{E}[ \Vert (I - \gamma x_t x_t^\top ) (w_{t-1} - w_*) \Vert^2] \\ &= \mathbb{E}[ \Vert w_{t-1} - w_* \Vert^2] - 2 \gamma \mathbb{E}[ (w_{t-1} - w_*)x_t x_t^\top (w _{t-1} - w_*)] \\ &\quad + \gamma^2 \mathbb{E}[ (w_{t-1} -w_*)^\top x_t x_t^\top x_t x_t^\top (w_{t-1} - w_*)] \\ &= \mathbb{E}[ \Vert w_{t-1} - w_* \Vert^2] - 2 \gamma H \mathbb{E}[\Vert w_{t-1} - w_* \Vert^2] + \gamma^2 LH \mathbb{E}[\Vert w_{t-1} - w_* \Vert^2] \\ &\le (1- \gamma \mu) \mathbb{E}[ \Vert w_{t-1} - w_* \Vert^2]. \end{align*}\]因此得到偏差项的上界为
\[\begin{align*} \mathbb{E}[\Vert w_t - w_* \Vert_H^2] \le L \mathbb{E}[\Vert w_t - w_* \Vert^2] \le (1- \gamma \mu)^t L \mathbb{E}[\Vert w_0 - w_* \Vert^2]. \end{align*}\]Variance Bound
假设 $w_0 = w_*$, 因此也成立 $ \mathbb{E} [w_t] = \mathbb{E}[w_0] $, 根据更新公式
\[\begin{align*} w_{t} - w_* &= B_t (w_{t-1} - w_*) - \gamma \xi_t \\ &= (I - \gamma x_t x_t^\top) (w_{t-1} - w_*) - \gamma \xi_t \end{align*}\]我们希望分析协方差矩阵
\(\begin{align*} C_t = \mathbb{E}_{w_0 =w_*}[ (w_t - w_*)(w_t - w_*)^\top ] \end{align*}\) 根据 $\xi_t$ 的条件独立性,并且展开递推式可以得到,
\[\begin{align*} w_t - w_* = - \gamma (\xi_t + B_{t} \xi_{t-1} + \cdots +B_{t} B_{t-1} \cdots B_2 \xi_1). \end{align*}\]进一步,
\[\begin{align*} C_t &= \gamma^2 \left(\Sigma_t + B_t \Sigma_{t-1} B_t^\top + \cdots + B_{t} B_{t-1} \cdots B_2 \Sigma_1 B_2^\top \cdots B_{t-1}^\top B_t \right) \\ &= C_{t-1} + \gamma^2 B_{t} B_{t-1} \cdots B_2 \Sigma_1 B_2^\top \cdots B_{t-1}^\top B_t \end{align*}\]因此得到了$C_t$ 为单调递增序列,
\[\begin{align*} O = C_0 \preceq C_1 \preceq \cdots \preceq C_t \end{align*}\]另一方面根据 $x_t$ 与 $w_{t-1}$ 的独立性,可以得到其递推式
\[\begin{align*} C_{t} &= \mathbb{E}[ (I - \gamma x_t x_t^\top )C_{t-1} (I - \gamma x_t x_t^\top )] + \gamma^2 \Sigma_t \\ &=C_{t-1} - \gamma H C_{t-1} - \gamma C_{t-1} H + \gamma^2 \mathbb{E}[ x_t x_t^\top C_tx_t x_t^\top] + \gamma^2 \Sigma_t. \end{align*}\]利用该递推式可以发现,
\[\begin{align*} {\rm tr}(C_t) &= {\rm tr} (C_{t-1} - \gamma H C_{t-1} - \gamma C_{t-1} H + \gamma^2 \mathbb{E}[ x_t x_t^\top C_tx_t x_t^\top] + \gamma^2 \Sigma_t) \\ &= {\rm tr} (C_{t-1}) - 2\gamma {\rm tr}(HC_{t-1}) + \gamma^2 {\rm tr} \mathbb{E}[C_t x_t x_t^\top x_t x_t^\top ] + \gamma^2 {\rm tr}(\Sigma) \\ &\le {\rm tr} (C_{t-1}) - 2\gamma {\rm tr}(HC_{t-1}) + \gamma^2 L {\rm tr}( H C_t) + \gamma^2 {\rm tr}(\Sigma) \\ &\le (1- \gamma \mu) {\rm tr}(C_{t-1}) + \gamma^2 {\rm tr}(\Sigma). \end{align*}\]因此 $C_t$ 存在上界,
\[\begin{align*} {\rm tr}(C_t) \le \frac{\gamma}{\mu} {\rm tr}(\Sigma). \end{align*}\]根据单调序列收敛定理,$C_t$ 的极限存在,记为 $C_{\infty}$, 其满足如下关系式,
\[\begin{align*} C_{\infty} &=C_{\infty} - \gamma H C_{\infty} - \gamma C_{\infty} H + \gamma^2 \mathbb{E}[ x x^\top C_{\infty}x x^\top] + \gamma^2 \Sigma. \end{align*}\]若算法采用均值作为输出,
\[\begin{align*} \bar w_T = \frac{1}{T} \sum_{t=0}^{T-1} w_t, \end{align*}\]则可以得到,
\[\begin{align*} &\quad \frac{1}{2} \mathbb{E}_{w_0 = w_*}[(\bar w_T - w_*) (\bar w_T - w_*)^\top ] \\ &= \frac{1}{2 T^2} \sum_{t=0}^{T-1} \sum_{t' = 0}^{T-1} \mathbb{E}[ (w_t - w_*) (w_{t'} - w_*)^\top] \\ &\preceq \frac{1}{2 T^2} \sum_{t=0}^{T-1} \sum_{t' = t}^{T-1} \mathbb{E}[ (w_t - w_*) (w_{t'} - w_*)^\top + (w_{t'} - w_*) (w_{t} - w_*)^\top] \\ &= \frac{1}{2 T^2} \sum_{t=0}^{T-1} \sum_{\tau =0}^{T-t-1} \mathbb{E}[ (w_t - w_*) (w_t - w_*)^\top(I - \gamma H)^\tau + (I - \gamma H)^\tau (w_t - w_*) (w_t - w_*)^\top]. \end{align*}\]进一步有
\[\begin{align*} &\quad \frac{1}{2} \mathbb{E}_{w_0 = w_*} [\Vert \bar w_T - w_* \Vert_H^2] \\ &= \frac{1}{2} {\rm tr} \mathbb{E}[H (\bar w_T - w_*) (\bar w_T - w_*)^\top ] \\ &\le \frac{1}{T^2} \sum_{t=0}^{T-1} \sum_{\tau = 0}^{T - t-1} {\rm tr} \mathbb{E}[ H C_t (I- \gamma H)^{\tau} ] \\ &\le \frac{1}{T^2} \sum_{t=0}^{T-1} \sum_{\tau = 0}^{\infty} {\rm tr} \mathbb{E}[ H C_t (I- \gamma H)^{\tau} ] \\ &= \frac{1}{T^2} \sum_{t=0}^{T-1} {\rm tr}\mathbb{E}[ C_t (\gamma H)^{-1} H ] \\ &= \frac{1}{\gamma T^2 }\sum_{t=0}^{T-1} {\rm tr}(C_t) \\ &\le \frac{1}{ \gamma T} {\rm tr}( C_{\infty}). \end{align*}\]因此问题转化为对 ${\rm tr}( C_{\infty})$ 的上界的估计,我们知道其满足,
\[\begin{align*} H C_{\infty} + C_{\infty} H = \gamma \mathbb{E}[xx^\top C_{\infty} xx^\top] + \gamma \Sigma. \end{align*}\]定义如下的矩阵算子,
\[\begin{align*} \mathcal{S} \circ M &= \mathbb{E}[ xx^\top M xx^\top] \\ \mathcal{T} \circ M &= HM + MH - \gamma HM H . \end{align*}\]可以发现算子 $\mathcal{S}$为半正定算子, 也即
\[\begin{align*} M \preceq M' \Rightarrow \mathcal{S} \circ M \preceq \mathcal{S} \circ M'. \end{align*}\]可以将方程写为,
\[\begin{align*} \mathcal{T} \circ C_{\infty} = \gamma S \circ C_{\infty} + \gamma \Sigma - \gamma H C_{\infty} H. \end{align*}\]根据关系式,
\[\begin{align*} ( I - \gamma \mathcal{T} )\circ M = (I - \gamma H) M (I - \gamma H) \end{align*}\]以及Neumann级数可以得到,
\[\begin{align*} \mathcal{T}^{-1} = \gamma \sum_{k=0}^{\infty} (I - \gamma \mathcal{T})^k = \gamma \sum_{k=0}^{\infty} (I- \gamma H)^k M (I - \gamma H)^k. \end{align*}\]因此算子 $\mathcal{T}^{-1}$ 也为半正定算子,因此成立,
\[\begin{align*} C_{\infty} &\preceq \gamma \mathcal{T}^{-1} \circ S \circ C_{\infty} + \gamma \mathcal{T}^{-1} \circ\Sigma - \gamma \mathcal{T}^{-1} \circ (H C_{\infty} H) \\ &\preceq \gamma \mathcal{T}^{-1} \circ S \circ C_{\infty} + \gamma \mathcal{T}^{-1} \circ\Sigma \\ &\preceq \gamma \mathcal{T}^{-1} \circ S \circ C_{\infty} + \gamma \sigma^2\mathcal{T}^{-1} \circ H. \end{align*}\]递推可以得到,
\[\begin{align*} C_{\infty} \preceq \gamma \sigma^2 \sum_{t=0}^{\infty }(\gamma \mathcal{T}^{-1} \circ \mathcal{S} )^{t} \circ \mathcal{T}^{-1} \circ H. \end{align*}\]进一步利用
\[\begin{align*} \mathcal{T}^{-1} H &= \gamma \sum_{k=0}^{\infty} (I- \gamma H)^k H ( I - \gamma H) \\ &= \gamma \sum_{k=0}^{\infty} (I- \gamma H)^{2k} H \\ &\preceq \gamma \sum_{k=0}^{\infty} (I- \gamma H)^{k} H \\ &= I. \end{align*}\]以及根据假设有,
\[\begin{align*} \mathcal{S} \circ I \preceq LH. \end{align*}\]递推后可以得到,
\[\begin{align*} C_{\infty} \preceq \gamma \sigma^2\sum_{t=0}^{\infty} (\gamma L)^t I = \frac{\gamma \sigma^2}{1 - \gamma L} I \end{align*}\]代入得到方差项的界为,
\[\begin{align*} \mathbb{E}_{w_0 = w_*}[\Vert \bar w_T - w_* \Vert_H^2] \le \frac{2 \sigma^2 d}{T(1- \gamma L)} \end{align*}\]Conclusion
使用尾平均策略作为输出,
\[\begin{align*} \bar w_{t:T} = \frac{1}{T-t} \sum_{t'=t}^{T-1} w_t . \end{align*}\]并且代入偏差项和方差项的界可以得到泛化风险的上界
\[\begin{align*} &\quad L(\bar w_{t:T}) - L(w_*) \\ &\le \mathbb{E}_{\xi_t = 0} \Vert \bar w_{t:T} - w_* \Vert_H^2 +\mathbb{E}_{w_0 = w_*}\Vert w_{t:T} - w_* \Vert_H^2 \\ &\le (1- \gamma \mu)^t L \mathbb{E}[\Vert w_0 - w_* \Vert^2] + \frac{1}{T-t} \times \frac{2 \sigma^2 d}{1- \gamma L}. \end{align*}\]选择 $\gamma = 1/(2L)$, 可以发现为了得到 $\epsilon$-最优解所需要的复杂度为,
\[\begin{align*} \mathcal{O} \left( \kappa \log ( \epsilon^{-1}) + \sigma^2 d \epsilon^{-1} \right) \end{align*}\]