SDCA: Stochastic Dual Coordinate Ascent
Published:
论文阅读笔记:Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
考虑机器学习中带 $L_2$ 范数正则化的结构风险最小化问题,
\[\min_w P(w) = \min_w \frac{1}{n} \sum_{i=1}^n \phi_i(w^\top x_i) + \frac{\lambda}{2} \Vert w \Vert^2\]例如对于SVM,$\phi_i(a) = \max(0,1- y_i a)$.
使用仿射变换重塑该问题,
\[\begin{align} \min_w \frac{1}{n} \sum_{i=1}^n \phi_i(w^\top x_i) + \frac{\lambda}{2} \Vert w \Vert^2, \text{ s.t. } z_i = w^\top x_i \end{align}\]利用Lagrange乘子法,
\[\begin{align} P(w_{\ast}) &= \min_{w,\mathbf{z}} \max_{\alpha_i} L(w,\mathbf{z}, \mathbf{\alpha}) \\ &= \min_{w,\mathbf{z}} \max_{\mathbf{\alpha}}\frac{1}{n} \sum_{i=1}^n \phi_i(z_i) + \frac{\lambda}{2} \Vert w \Vert^2 + \frac{1}{n}\sum_{i=1}^n \alpha_i (z_i - w^\top x_i ) \\ \end{align}\]下面的推导告诉我们,对偶问题提供了原问题的一个下界,
\[\begin{align} D(\mathbf{\alpha}_{\ast}) &= \max_\mathbf{\alpha} \min_{w , \mathbf{z}} L(w, \mathbf{z}) \\ &\le \max_\mathbf{\alpha}L(w_{\ast},\mathbf{z}_{\ast}, \mathbf{\alpha'}) \\ &= \max_\mathbf{\alpha} \frac{1}{n} \sum_{i=1}^n \phi_i(z_i^{\ast}) + \frac{\lambda}{2} \Vert w_{\ast} \Vert^2 + \frac{1}{n}\sum_{i=1}^n \alpha_i ( z_i^{\ast} - w_{\ast}^\top x_i) \\ &= \frac{1}{n} \sum_{i=1}^n \phi_i(z_i^{\ast}) + \frac{\lambda}{2} \Vert w_{\ast} \Vert^2, \text{By } w_{\ast}^\top x_i = z_i^{\ast} \\ &=P(w_{\ast}) \end{align}\]而对于该结构风险最小化问题,其对偶问题可以用共轭函数写出显示表达式为,
\[\begin{align} D(\mathbf{\alpha}_{\ast})&= \max_{\mathbf{\alpha}} \min_{w,\mathbf{z}} \frac{1}{n} \sum_{i=1}^n \phi_i(z_i) + \frac{\lambda}{2} \Vert w \Vert^2 + \frac{1}{n} \sum_{i=1}^n \alpha_i (z_i - w^\top x_i ) \\ &=\max_{\mathbf{\alpha}} \min_{\mathbf{z}} \sum_{i=1}^n \phi_i(z_i) - \frac{\lambda}{2} \Vert w \Vert^2 + \frac{1}{n}\sum_{i=1}^n \alpha_i z_i ,\text{ By } w = \frac{1}{ \lambda n} \sum_{i=1}^n \alpha_i x_i \\ &= \max_{\mathbf{\alpha}} \frac{1}{n} \sum_{i=1}^n - \phi_i^{\ast}(-\alpha_i) - \frac{\lambda}{2} \Vert w \Vert^2 \end{align}\]将 $\mathbf{\alpha}$ 看作一个向量,则对其进行优化每次只需要考虑其一个分量即可,使用随机坐标上升方法,
\[\begin{align} \text{Pick } i &= \mathcal{U}(1,n) \\ \Delta \alpha_i &= \text{argmax}_{\Delta \alpha_i} - \phi_i^{\ast}(-\alpha_i^{(k)} - \Delta \alpha_i) - \frac{\lambda n}{2} \Vert w^{(k)} + \frac{1}{\lambda n} \Delta \alpha_i x_i \Vert^2 \\ \alpha_i^{(k+1)} &= \alpha_i^{(k)} + \Delta \alpha_i \\ w^{(k+1) } &= w_i^{(k)} + \frac{1}{\lambda n} \Delta \alpha_i x_i \end{align}\]Strongly Convex Case
本节假设优化函数满足 $\frac{1}{\mu}$- 光滑,也即其共轭函数 $\phi_i^\ast$ 满足 $\mu$- 强凸性质,
考虑对偶问题所提供的下界的提升,
\[\begin{align} D(\alpha_i^{(k+1)}) - D(\alpha_i^{(k)}) &= (-\phi_i^{\ast}(-\alpha_i^{(k+1)}) - \frac{\lambda n}{2 }\Vert w^{(k+1)} \Vert^2) - (-\phi_i^{\ast}(-\alpha_i^{(k)}) - \frac{\lambda n }{2 }\Vert w^{(k)} \Vert^2) \\ &= \max_{\Delta \alpha_i} (- \phi_i^{\ast}(-\alpha_i^{(k)} - \Delta \alpha_i) - \frac{\lambda n}{2} \Vert w^{(k)} + \frac{1}{\lambda n} \Delta \alpha_i x_i \Vert^2) - (-\phi_i^{\ast}(-\alpha_i^{(k)}) - \frac{\lambda n }{2 }\Vert w^{(k)} \Vert^2) \\ &\ge - \phi_i^{\ast}(-\alpha_i^{(k)} - s( u -\alpha_i^{(k)})) - \frac{\lambda n}{2} \Vert w^{(k)} + \frac{1}{\lambda n} s( u -\alpha_i^{(k)}) x_i \Vert^2) - (-\phi_i^{\ast}(-\alpha_i^{(k)}) - \frac{\lambda n }{2 }\Vert w^{(k)} \Vert^2) ,\forall u ,s\\ &\ge -\phi_i^{\ast}(-su - (1-s) \alpha_i^{(k)}) - \frac{\lambda n}{2} \Vert w^{(k)} + \frac{1}{\lambda n} s( u -\alpha_i^{(k)}) x_i \Vert^2+ \phi_i^{\ast}(-\alpha_i^{(k)}) + \frac{\lambda n }{2 }\Vert w^{(k)} \Vert^2 \\ &\ge - s \phi_i^{\ast} (-u) - (1-s) \phi_i^{\ast}(-\alpha_i^{(k)}) +\frac{\mu}{2} s(1-s) \Vert u - \alpha_i^{(k) } \Vert^2 \\ &\quad - s(u - \alpha_i^{(k)}) w^\top x_i-\frac{1}{2\lambda n} \Vert s( u -\alpha_i^{(k)}) x_i \Vert^2+ \phi_i^{\ast}(-\alpha_i^{(k)}) \\ &= s [\phi_i^{\ast}(-\alpha_i^{(k)}) +\alpha_i^{(k) } w^\top x_i] -s [\phi_{i^{\ast}}(-u) + u w^\top x_i] +(\frac{\mu s(1-s)}{2}- \frac{s^2 \Vert x_i \Vert^2}{2 \lambda n}) \Vert u - \alpha_i^{(k)} \Vert^2 \\ &= s [\phi_i^{\ast}(-\alpha_i^{(k)}) +\alpha_i^{(k) } w^\top x_i + \phi_i(w^\top x_i)] +(\frac{\mu s(1-s)}{2}- \frac{s^2 \Vert x_i \Vert^2}{2 \lambda n}) \Vert u - \alpha_i^{(k)} \Vert^2 ,\text{Let } - u =\partial \phi_i(w^\top x_i)\\ &= s [\phi_i^{\ast}(-\alpha_i^{(k)}) + \phi_i(w^\top x_i)+ \alpha_i^{(k) } w^\top x_i] +(\frac{\mu s(1-s)}{2}- \frac{s^2 \Vert x_i \Vert^2}{2 \lambda n}) \Vert u - \alpha_i^{(k)} \Vert^2 \end{align}\]如此成功将右边凑出了对偶间隔的形式,也即,
\[\begin{align} P(w) - D(\alpha) &= \frac{1}{n}\sum_{i=1}^n (\phi_i(w^\top x_i) + \frac{\lambda }{2} \Vert w \Vert^2) - \frac{1}{n}\sum_{i=1}^n (-\phi_i^{\ast}(-\alpha_i) - \frac{\lambda }{2} \Vert w \Vert^2) \\ &=\frac{1}{n}\sum_{i=1}^n \phi_i(w^\top x_i) + \phi_i^{\ast}(-\alpha_i) + \lambda \Vert w \Vert^2 \\ &= \frac{1}{n}\sum_{i=1}^n \phi_i(w^\top x_i) + \phi_i^{\ast}(-\alpha_i) + \alpha_i w^\top x_i \end{align}\]对上述不等式取条件期望,
\[\begin{align} \mathbb{E} [D(\alpha^{(k+1)}) - D(\alpha^{(k)}) ] &= \frac{1}{n}\sum_{j=1}^n (D(\alpha_j^{(k+1)}) - D(\alpha_j^{(k)})) I(\xi_j = 1) \\ &= \frac{1}{n^2 } \sum_{i=1}^n D(\alpha_i^{(k+1)}) - D(\alpha_i^{(k)}) \\ &\ge \frac{s}{n}(P(w) - D(\alpha)) +(\frac{s}{n})^2 \sum_{i=1}^n(\frac{\mu (1-s)}{2s}- \frac{\Vert x_i \Vert^2}{2 \lambda n}) \Vert u - \alpha_i^{(k)} \Vert^2 \end{align}\]据此可以证明收敛率,
\[\begin{align} \mathbb{E}[ D(\alpha^{(k+1)}) - D(\alpha^{(k)})] &\ge \frac{s}{n}(P(w) - D(\alpha)) ,\text{Let } s = \frac{\lambda n\mu}{\max_i \Vert x_i \Vert^2 + \lambda n\mu} \\ &\ge \frac{s}{n}(D(\alpha^{\ast}) - D(\alpha^{(k)})) \\ \mathbb{E}[D(\alpha^{\ast}) - D(\alpha^{(k+1)}) ] &\le (1- \frac{s}{n}) (D(\alpha^{\ast}) - D(\alpha^{(k)})) \end{align}\]因此为了达到关于对偶问题的 $\epsilon$ - 最优解,需要的计算复杂度为,
\[\begin{align} T = \mathcal{O}(\frac{n}{s} \log \frac{1}{\epsilon}) = \mathcal{O}(( n + \frac{1}{\lambda \mu}) \log \frac{1}{\epsilon}) \end{align}\]并且根据,
\[\begin{align} P(w) - D(\alpha) &\le \frac{n}{s} \mathbb{E}[ D(\alpha^{(k+1)}) - D(\alpha^{(k)})] \le \frac{n}{s} \mathbb{E}[ D(\alpha^{\ast}) - D(\alpha^{(k)})] \end{align}\]为了达到关于对偶间隙的 $\epsilon$ - 最优解,所需要的计算复杂度至多相差一个 $\log$ 项,
\[\begin{align} T = \mathcal{O}((n+ \frac{1}{\lambda\mu} ) \log \frac{n}{s\epsilon}) = \tilde{\mathcal{O}}((n + \frac{1}{\lambda \mu}) \log \frac{1}{\epsilon}) \end{align}\]回顾 SVRG 中得到对于条件数为 $\kappa = \frac{L}{\mu}$ 的 $L$ - 光滑 $\mu$ - 强凸函数得到的收敛率为 $\mathcal{O}( n + \kappa) \log \frac{1}{\epsilon})$,
而对于正则项系数为 $\lambda$ 的函数满足 $\lambda$ - 强凸,而根据假设其满足 $\frac{1}{\mu}$- 光滑,也即其条件数满足 $\kappa = \frac{1}{\lambda \mu}$
因此SDCA和SVRG在该问题上具有相同的收敛率 $\mathcal{O}(n + \kappa) \log \frac{1}{\epsilon}) $