PAGE

4 minute read

Published:

Paper Reading: PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization.

PAGE is a simple method leads to optimal rate in stochastic nonconvex optimization.

We are instrested in the four settings, using PAGE the SFO complexity will be

 OfflineOnline
NonConvex$\mathcal{O}(n + \sqrt{n} \epsilon^{-2} )$$\mathcal{O}(\sigma^2 \epsilon^{-2} + \sigma\epsilon^{-3} )$
Polyak-Łojasiewicz$\mathcal{O}((n + \sqrt n \kappa) \log (\epsilon^{-1} )$$\mathcal{O}( (\sigma^2 \mu^{-1} \epsilon^{-1} + \sigma \mu^{-0.5}\epsilon^{-0.5} \kappa) \log(\epsilon^{-1}) )$

We distinguish them as Off-NC, On-NC, Off-PL and On-PL settings.

PAGE using the following update

\[\begin{align*} x_{k+1} &= x_k - \eta g_k \\ g_{k+1} &= \begin{cases} \frac{1}{\vert S_1 \vert} \sum_{i \in S_1} \nabla f (x_{k+1}; \xi_i) , & \text{with prob. } p \\ g_k + \frac{1}{ \vert S_2 \vert} \sum_{i \in S_2} (\nabla f(x_{k+1}; \xi_i) - \nabla f(x_k; \xi_i)), & \text{with prob.} 1-p \end{cases} \end{align*}\]

Off-NC

The estimators gives equation (1) as

\[\begin{align*} \mathbb{E}[\Vert g_{k+1} - \nabla f(x_{k+1}) \Vert^2] &\le (1-p) \Vert g_k - \nabla f(x_k) \Vert^2 + \frac{(1-p) \eta^2 L^2}{\vert S_2\vert} \Vert g_k \Vert^2 + \mathbb{I}[\vert S_1 \vert<n] \frac{p \sigma^2}{\vert S_1 \vert}. \end{align*}\]

GD step gives equation (2) as

\[\begin{align*} \mathbb{E}[f(x_{k+1})] &\le f(x_k) + \nabla f(x_k)^\top (x_{k+1} - x_k) + \frac{L}{2} \Vert x_{k+1} - x_k \Vert^2 \\ &= f(x_k) - \eta\nabla f(x_k)^\top g_k + \frac{L \eta^2}{2} \Vert g_k \Vert^2 \\ &\le f(x_k ) - \frac{\eta}{2} \Vert \nabla f(x_k) \Vert^2 + \frac{\eta}{2} \Vert \nabla f(x_k) - g_k \Vert^2 + \left( \frac{L \eta^2}{2} - \frac{\eta}{2} \right)\Vert g_k \Vert^2 \end{align*}\]

Adding (1) with $\eta / 2p$ times to (2) yields

\(\begin{align*} \mathbb{E}[\Phi(x_{k+1})] &\le \Phi(x_k) - \frac{\eta}{2} \Vert \nabla f(x_k) \Vert^2 - \left( \frac{\eta}{2} -\frac{L \eta^2}{2} - \frac{(1-p) \eta^3 L^2}{2p\vert S_2 \vert} \right)\Vert g_k \Vert^2 + \mathbb{I}[\vert S_1 \vert<n] \frac{\eta \sigma^2}{2\vert S_1 \vert} \end{align*}\) where

\[\begin{align*} \Phi(x_k) \triangleq f(x_k) - f(x^*) + \frac{\eta}{2p} \Vert g_k - \nabla f(x_k) \Vert^2. \end{align*}\]

Set \(\begin{align*} \eta = \frac{1}{2L}, \quad \vert S_1 \vert = n,\quad \vert S_2 \vert = \sqrt{\vert S_1 \vert}, \quad p = \frac{1}{\vert S_2 \vert}, \quad K = \frac{4\Delta L}{\epsilon^2} \end{align*}\)

where $\Delta = \Phi(x_0)$.

Then PAGE can gurantee $\Vert \nabla f(x) \Vert \le \epsilon$ within a SFO complexity of $\mathcal{O}(n + \sqrt{n} \epsilon^{-2} )$.

On-NC

For online case, we let

\[\begin{align*} \eta = \frac{1}{2L}, \quad \vert S_1 \vert = \frac{\sigma^2}{\epsilon^2},\quad \vert S_2 \vert = \sqrt{\vert S_1 \vert}, \quad p = \frac{1}{\vert S_2 \vert}, \quad K = \frac{4\Delta L}{\epsilon^2} \end{align*}\]

We can gurantee $\Vert \nabla f(x) \Vert \le \epsilon$ within a SFO complexity of $\mathcal{O}(\sigma^2 \epsilon^{-2} + \sigma\epsilon^{-3} )$.

Off-PL

PL condition assumes \(\begin{align*} \Vert \nabla f(x) \Vert^2 \ge 2 \mu (f(x) - f(x^*)), \forall x. \end{align*}\) Recall the estimators gives equation (1) as

\[\begin{align*} \mathbb{E}[\Vert g_{k+1} - \nabla f(x_{k+1}) \Vert^2] &\le (1-p) \Vert g_k - \nabla f(x_k) \Vert^2 + \frac{(1-p) \eta^2 L^2}{\vert S_2\vert} \Vert g_k \Vert^2 + \mathbb{I}[\vert S_1 \vert<n] \frac{p \sigma^2}{\vert S_1 \vert}. \end{align*}\]

Using PL condition, GD step implies equation (2) as

\[\begin{align*} \mathbb{E}[f(x_{k+1}) - f(x^*)] &\le (1- \mu \eta) (f(x_k) - f(x^*))+ \frac{\eta}{2} \Vert \nabla f(x_k) - g_k \Vert^2 + \left( \frac{L \eta^2}{2} - \frac{\eta}{2} \right)\Vert g_k \Vert^2 \end{align*}\]

Assume $ \eta \le p / (2\mu)$ which will be verified later.

Adding (1) with $\eta / p$ times to (2) yields

\[\begin{align*} \mathbb{E}[\Phi(x_{k+1})] &\le (1- \mu \eta)\Phi(x_k) - \left( \frac{\eta}{2} -\frac{L \eta^2}{2} - \frac{(1-p) \eta^3 L^2}{2p\vert S_2 \vert} \right)\Vert g_k \Vert^2 + \mathbb{I}[\vert S_1 \vert<n] \frac{\eta \sigma^2}{2\vert S_1 \vert} \end{align*}\]

where

\[\begin{align*} \Phi(x_k) \triangleq f(x_k) - f(x^\ast) + \frac{\eta}{p} \Vert g_k - \nabla f(x_k) \Vert^2. \end{align*}\]

Set \(\begin{align*} \eta = \min \left\{ \frac{1}{2L}, \frac{p}{2\mu} \right\}, \quad \vert S_1 \vert = n,\quad \vert S_2 \vert = \sqrt{\vert S_1 \vert}, \quad p = \frac{1}{\vert S_2 \vert}, \quad K = \frac{1}{\mu \eta} \log \left(\frac{\Delta}{\epsilon}\right) \end{align*}\) We can gurantee $\Vert f(x) - f(x^\ast) \Vert \le \epsilon$ within a SFO complexity of $\mathcal{O}((n + \sqrt n \kappa) \log (\Delta/\epsilon) )$.

On-PL

For online case, we let

\[\begin{align*} \eta = \min \left\{\frac{1}{2L}, \frac{p}{2\mu}\right\}, \quad \vert S_1 \vert = \frac{\sigma^2}{\mu \epsilon},\quad \vert S_2 \vert = \sqrt{\vert S_1 \vert}, \quad p = \frac{1}{\vert S_2 \vert}, \quad K = \frac{4\Delta L}{\epsilon^2} \end{align*}\]

We can gurantee $\Vert f(x) - f(x^\ast) \Vert \le \epsilon$ within a SFO complexity of $\mathcal{O}( (\sigma^2 \mu^{-1} \epsilon^{-1} + \sigma \mu^{-0.5}\epsilon^{-0.5} \kappa) \log(\epsilon^{-1}) )$.