playground:playground

文書の過去の版を表示しています。


PlayGround

Convergence rates of stochastic optimization algorithms for convex and nonconvex optimization
Algorithms Convex Optimization Nonconvex Optimization
Constant learning rate Diminishing learning rate Constant learning rate Diminishing learning rate
SGD \cite{sca2020}$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$ $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$
SGD with SPS \cite{loizou2021} ———

& $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$ & ——— & $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$
Minibatch SGD \cite{chen2020} & ——— & $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$ & ——— & $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$
\hline\hline Adam \cite{adam} & ——— & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$

& ——— & ———
AMSGrad \cite{reddi2018} & ——— & $\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$

& ——— & ———
GWDC \cite{liang2020} & ——— & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ & ——— & ———
AMSGWDC \cite{liang2020} & ——— & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ & ——— & ———
AMSGrad \cite{chen2019} & ——— & $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$

& ——— & $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$
AdaBelief \cite{adab} & ——— & $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$

& ——— & $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$
Algorithm \ref{algo:1} (presented herein) & $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$ & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ & $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$ & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$
\hline \end{tabular}

Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(\bm{x})\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$. (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists. \end{table*}

  • playground/playground.1629609858.txt.gz
  • 最終更新: 2021/08/22 14:24
  • by Hideaki IIDUKA