差分
このページの2つのバージョン間の差分を表示します。
両方とも前のリビジョン 前のリビジョン 次のリビジョン | 前のリビジョン | ||
playground:playground [2021/08/22 14:30] – Hideaki IIDUKA | playground:playground [2021/08/22 14:34] (現在) – Hideaki IIDUKA | ||
---|---|---|---|
行 9: | 行 9: | ||
| Adam \cite{adam}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$|---------|---------| | | Adam \cite{adam}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}^{(*)}$|---------|---------| | ||
| AMSGrad \cite{reddi2018}|---------|$\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$|---------|---------| | | AMSGrad \cite{reddi2018}|---------|$\displaystyle{\mathcal{O}\left( \sqrt{\frac{1 + \ln T}{T}} \right)}$|---------|---------| | ||
+ | | GWDC \cite{liang2020}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|---------|---------| | ||
+ | | AMSGWDC \cite{liang2020}|---------|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|---------|---------| | ||
+ | | AMSGrad \cite{chen2019}|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$|--------|$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$| | ||
+ | | AdaBelief \cite{adab}|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$|---------|$\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$| | ||
+ | | Proposed |$\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$|$\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$|$\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$| | ||
- | GWDC \cite{liang2020} | + | Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(x)\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Proposed |
- | & --------- | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ | + | |
- | & --------- | + | |
- | & --------- \\ | + | |
- | AMSGWDC \cite{liang2020} | + | |
- | & --------- | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ | + | |
- | & --------- | + | |
- | & --------- \\ | + | |
- | AMSGrad \cite{chen2019} | + | |
- | & --------- | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$ | + | |
- | + | ||
- | & --------- | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ \\ | + | |
- | AdaBelief \cite{adab} | + | |
- | & --------- | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{\ln T}{\sqrt{T}} \right)}$ | + | |
- | + | ||
- | & --------- | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{\ln n}{\sqrt{n}} \right)}$ \\ | + | |
- | Algorithm \ref{algo: | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{1}{T} \right)} + C_1 \alpha + C_2 \beta$ | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{T}} \right)}$ | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{1}{n} \right)} + C_1 \alpha + C_2 \beta$ | + | |
- | & $\displaystyle{\mathcal{O}\left( \frac{1}{\sqrt{n}} \right)}$ \\ | + | |
- | \hline | + | |
- | \end{tabular} | + | |
- | + | ||
- | Note: $C$, $C_1$, and $C_2$ are constants independent of learning rates $\alpha, \beta$, number of training examples $T$, and number of iterations $n$. The convergence rate for convex optimization is measured in terms of regret as $R(T)/T$, and the convergence rate for nonconvex optimization is measured as the expectation of the squared gradient norm $\min_{k\in [n]} \mathbb{E}[\|\nabla f(\bm{x})\|^2]$. In the case of using constant learning rates, SGD \cite{sca2020} and Algorithm \ref{algo: | + | |
\end{table*} | \end{table*} | ||