DL0048 Adam Optimizer

Can you explain how the Adam optimizer works?

Answer

The Adam (Adaptive Moment Estimation) optimizer is a powerful algorithm for training deep learning models, combining the principles of Momentum and RMSprop to compute adaptive learning rates for each parameter.
Adam updates following the steps below:
(1) First Moment Calculation(Mean/Momentum)
It computes an exponentially decaying average of past gradients, which is the estimate of the first moment (mean) of the gradient. This introduces a momentum effect to smooth out the updates.
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
Where:
m_t is the 1st moment (mean of gradients).
g_t is the gradient at step t.
\beta_1 controls momentum (default: 0.9).

(2) Second Moment Calculation (Variance)
Adam also computes an exponentially decaying average of past squared gradients, which is the estimate of the second moment (uncentered variance) of the gradient. This provides a measure of the scale (magnitude) of the gradients.
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
Where:
v_t is the 2nd moment (variance of gradients).
\beta_2 controls smoothing of squared gradients (default: 0.999).

(3) Bias Correction
Since m_t​ and v_t ​ are initialized as zero vectors, they are biased towards zero, especially during the initial steps. Adam applies bias correction to these estimates.
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
Where
\hat{m}_t is the bias-corrected 1st moment.
\hat{v}_t is the bias-corrected 2nd moment.
\beta_1^t, \beta_2^t are the exponential decay raised to step t, correcting bias from initialization.

(4) Parameter Update
The final parameter update scales the bias-corrected first moment (m_t) by the overall learning rate (\alpha) and divides by the square root of the bias-corrected second moment (v_t​).
\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
Where:
\theta_t are model parameters.
\alpha is the learning rate.
\epsilon prevents division by zero.

The plot below shows how the Adam optimizer efficiently moves from the starting point to the minimum of the quadratic bowl, taking adaptive steps that quickly converge to the origin.


Login to view more content


Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *