Adam vs. AdamW: Understanding Weight Decay and Its Impact on Model Performance

Ahmed Yassin
4 min readNov 8, 2024

--

As machine learning engineers, we’re constantly seeking ways to improve our models’ performance. Two popular optimization algorithms, Adam and AdamW, have become staples in our toolkit.

While both are extensions of SGD (Stochastic Gradient Descent) with adaptive learning rates, they handle weight decay differently, impacting regularization, generalization, and convergence stability.

In this post, we’ll dive into the mathematics behind these optimizers, explore their differences through code example, and finally, I will provide practical recommendations for when to use each.

Understanding Adam Optimizer

Adam is an optimization algorithm that adapts the learning rate for each parameter. It does this by calculating two things:

  • Momentum: A running average of past gradients.
  • RMSprop: A running average of the squared gradients.

The Math:

  1. Gradient Calculation:
  • gt: Gradient of the loss function with respect to the parameters at time step t.

2. Momentum Calculation:

  • mt: This is the momentum at time step t. It’s calculated as a weighted average of the current gradient and the previous momentum.
mt = β1 * mt-1 + (1 - β1) * gt
  • β1 is a hyperparameter, typically set to 0.9. It controls how much weight we give to past gradients.

3. RMSprop Calculation:

  • vt: This is the RMSprop at time step t. It’s calculated as a weighted average of the squared current gradient and the previous RMSprop:
vt = β2 * vt-1 + (1 - β2) * gt^2
  • β2 is another hyperparameter, typically set to 0.999. It controls how much weight we give to past squared gradients.

4. Bias Correction:

  • Because mt and vt are initialized to 0, the initial estimates are biased towards 0. To correct for this bias, we use:
mt_hat = mt / (1 - β1^t) 
vt_hat = vt / (1 - β2^t)

5. Updating Parameters:

  • Finally, we update the parameters using the following formula:
θt+1 = θt - η * mt_hat / (sqrt(vt_hat) + ε)
  • θt is the parameter at time step t.
  • η is the learning rate.
  • ε is a small number (usually 1e-8) to prevent division by zero.

So Adam combines the benefits of momentum and RMSprop. Momentum helps accelerate the optimization process, while RMSprop helps adapt each parameter's learning rate.

The Problem with Regular Adam:

In Adam, weight decay is often added as an L2 regularization term to the loss function: L(θ) + (λ/2) * ||θ||^2However, adding this term to the loss affects the adaptive learning rates, which can hinder optimal convergence.

Understanding AdamW Optimizer

AdamW is a smarter version of Adam as it decouples weight decay from the gradient update step. Instead of adding weight decay to the loss function, it applies weight decay directly during the parameter update, leading to more consistent regularization and better generalization.

The Math:

  1. Calculations (Same as Adam):
mt = β1 * mt-1 + (1 - β1) * gt 
vt = β2 * vt-1 + (1 - β2) * gt^2
mt_hat = mt / (1 - β1^t)
vt_hat = vt / (1 - β2^t)

2. Parameter Update (AdamW):

θt+1 = θt - η * (mt_hat / (sqrt(vt_hat) + ε) + λ * θt)
  • λ: is the weight decay coefficient applied directly to the parameter update.

This formulation prevents weight decay from affecting the adaptive learning rates and allows for a more consistent regularization effect.

By separating weight decay from the gradient-based updates, AdamW achieves better generalization performance without interfering with the learning rate dynamics.

Key Differences Between Adam and AdamW


+================+============================================+===========================================+
| Aspect | Adam | AdamW |
+================+============================================+===========================================+
| Weight Decay | Applied via L2 regularization to the loss | Applied directly during parameter updates |
+----------------+--------------------------------------------+-------------------------------------------+
| Impact | Can interfere with adaptive learning rates | Decoupled, leading to consistent updates |
+----------------+--------------------------------------------+-------------------------------------------+
| Generalization | May lead to overfitting in complex models | Better regularization and generalization |
+----------------+--------------------------------------------+-------------------------------------------+

In Adam, weight decay is added to the loss function itself. This means the loss function becomes L(θ) + (λ/2) * ||θ||²

Here, λ is the weight decay coefficient and ||θ||^2 is the L2 norm of the parameters.

This approach can interfere with the adaptive learning rates calculated by Adam. The reason is that the gradient of the weight decay term is proportional to the parameters themselves. This can lead to suboptimal updates, especially when the parameters are large values.

Weight Decay in AdamW:

In AdamW, weight decay is applied directly to the parameter update step, rather than being added to the loss function. This means the parameter update equation becomes:

θt+1 = θt - η * (mt_hat / (sqrt(vt_hat) + ε) + λ * θt)

Here, the weight decay term λ * θt is added directly to the update, without affecting the adaptive learning rates. This ensures that weight decay can be applied consistently without interfering with the optimization process.

Code Example (Find the code here)

Code Visualization

Practical Recommendations:

  • When to Choose Adam: You can use Adam for quick prototyping or simpler tasks where regularization is not crucial. It may converge faster initially but can suffer from poor generalization due to interference from weight decay.
  • When to Choose AdamW: In case you have larger models or when training on complex, high-dimensional data, it’s better to choose AdamW, because the decoupled weight decay helps achieve better generalization and stable convergence.

Thanks for reading!

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Ahmed Yassin
Ahmed Yassin

Written by Ahmed Yassin

AI enthusiast who is eager to learn something new every day! Let's connect on LinkedIn: linkedin.com/in/yassin01/

No responses yet

Write a response