PlayGround

Algorithms	Convex Optimization		Nonconvex Optimization
Convergence rates of stochastic optimization algorithms for convex and nonconvex optimization
	Constant learning rate	Diminishing learning rate	Constant learning rate	Diminishing learning rate
SGD \cite{sca2020}	$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$	$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$	$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$	$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$
SGD with SPS \cite{loizou2021}	———	$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$	———	$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$
Minibatch SGD \cite{chen2020}	———	$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right) + C}$	———	$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right) + C}$
Adam \cite{adam}	———	$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$	———	———
AMSGrad \cite{reddi2018}	———	$\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$	———	———
GWDC \cite{liang2020}	———	$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$	———	———
AMSGWDC \cite{liang2020}	———	$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$	———	———
AMSGrad \cite{chen2019}	———	$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$	——–	$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$
AdaBelief \cite{adab}	———	$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$	———	$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$
Proposed	$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$	$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$	$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$	$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$

Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(x)\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Proposed can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Proposed had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$. (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists. \end{table*}