DL0048 Adam Optimizer

Written by

Can you explain how the Adam optimizer works?

Answer

The Adam (Adaptive Moment Estimation) optimizer is a powerful algorithm for training deep learning models, combining the principles of Momentum and RMSprop to compute adaptive learning rates for each parameter.
Adam updates following the steps below:
(1) First Moment Calculation(Mean/Momentum)
It computes an exponentially decaying average of past gradients, which is the estimate of the first moment (mean) of the gradient. This introduces a momentum effect to smooth out the updates.
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
Where:
$m_t$ is the 1st moment (mean of gradients).
$g_t$ is the gradient at step $t$ .
$\beta_1$ controls momentum (default: 0.9).

(2) Second Moment Calculation (Variance)
Adam also computes an exponentially decaying average of past squared gradients, which is the estimate of the second moment (uncentered variance) of the gradient. This provides a measure of the scale (magnitude) of the gradients.
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
Where:
$v_t$ is the 2nd moment (variance of gradients).
$\beta_2$ controls smoothing of squared gradients (default: 0.999).

(3) Bias Correction
Since $m_t$ and $v_t$ are initialized as zero vectors, they are biased towards zero, especially during the initial steps. Adam applies bias correction to these estimates.
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
Where
$\hat{m}_t$ is the bias-corrected 1st moment.
$\hat{v}_t$ is the bias-corrected 2nd moment.
$\beta_1^t$ , $\beta_2^t$ are the exponential decay raised to step $t$ , correcting bias from initialization.

(4) Parameter Update
The final parameter update scales the bias-corrected first moment ( $m_t$ ) by the overall learning rate ( $\alpha$ ) and divides by the square root of the bias-corrected second moment ( $v_t$ ).
$\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
Where:
$\theta_t$ are model parameters.
$\alpha$ is the learning rate.
$\epsilon$ prevents division by zero.

The plot below shows how the Adam optimizer efficiently moves from the starting point to the minimum of the quadratic bowl, taking adaptive steps that quickly converge to the origin.

Did you solve the problem?

Basics

DL0048 Adam Optimizer

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding