playground:playground

PlayGround

Convergence rates of stochastic optimization algorithms for convex and nonconvex optimization
Algorithms Convex Optimization Nonconvex Optimization
Constant learning rate Diminishing learning rate Constant learning rate Diminishing learning rate
SGD \cite{sca2020}$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$ $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$
SGD with SPS \cite{loizou2021}———$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$———$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$
Minibatch SGD \cite{chen2020}———$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$———$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$
Adam \cite{adam}———$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$——————
AMSGrad \cite{reddi2018}———$\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$——————
GWDC \cite{liang2020}———$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$——————
AMSGWDC \cite{liang2020}———$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$——————
AMSGrad \cite{chen2019}———$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$——–$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$
AdaBelief \cite{adab}———$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$———$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$
Proposed $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$

Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(x)\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Proposed can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Proposed had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$. (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists. \end{table*}

  • playground/playground.txt
  • 最終更新: 2021/08/22 14:34
  • by Hideaki IIDUKA