差分

このページの2つのバージョン間の差分を表示します。

--- playground:playground [2021/08/22 14:30] – Hideaki IIDUKA
+++ playground:playground [2021/08/22 14:34] (現在) – Hideaki IIDUKA
@@ 行 9: / 行 9: @@
 | Adam \cite{adam}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$|---------|---------|
 | AMSGrad \cite{reddi2018}|---------|$\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$|---------|---------|
+| GWDC \cite{liang2020}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|---------|---------|
+| AMSGWDC \cite{liang2020}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|---------|---------|
+| AMSGrad \cite{chen2019}|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$|--------|$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$|
+| AdaBelief \cite{adab}|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$|
+| Proposed |$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$|
-GWDC \cite{liang2020}
+Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(x)\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Proposed can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Proposed had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$.  (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists.
-& ---------
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$
-& ---------
-& --------- \\
-AMSGWDC \cite{liang2020}
-& ---------
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$
-& ---------
-& --------- \\
-AMSGrad \cite{chen2019}
-& ---------
-& $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$
-& ---------
-& $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ \\
-AdaBelief \cite{adab}
-& ---------
-& $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$
-& ---------
-& $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ \\
-Algorithm \ref{algo:1} (presented herein)
-& $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$
-& $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$
-& $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$ \\
-\hline
-\end{tabular}
-Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(\bm{x})\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} can be applied to not only convex but also nonconvex optimization. In the case of using diminishing learning rates, SGD \cite{sca2020} and Algorithm \ref{algo:1} had the best convergence rates, $\mathcal{O}(1/\sqrt{n})$.  (*) Theorem 1 in \cite{reddi2018} shows that a counter-example to the \cite{adam} results exists.
 \end{table*}