DL0049 Weight Init

Written by

Why is “weight initialization” important in deep neural networks?

Answer

Weight initialization is crucial for stabilizing activations and gradients, enabling deep neural networks to train efficiently and converge faster without numerical instability.
(1) Prevents vanishing/exploding gradients: Proper initialization keeps activations and gradients within reasonable ranges during forward and backward passes
(2) Ensures faster convergence: Good initialization allows the optimizer to reach a good solution more quickly.
(3) Breaks symmetry: Different initial weights ensure neurons learn unique features rather than identical outputs.

Xavier Initialization is used for activations like Sigmoid or Tanh:
Xavier aims to keep the variance of activations consistent across layers for symmetric activations (tanh/sigmoid).
$W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}} + n_{\text{out}}}\right)$
or (uniform version):
$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$
Where:
$n_{\text{in}}$ = number of input units
$n_{\text{out}}$ = number of output units
$\mathcal{N}(\mu, \sigma^2)$ means weights are sampled from a normal (Gaussian) distribution with mean $\mu$ and variance $\sigma^2$ .
$\mathcal{U}(a, b)$ means weights are sampled from a uniform distribution in the range $[a, b]$ .

He Initialization is used for activations like ReLU or Leaky ReLU:
He initialization scales up the variance for ReLU, since half of its outputs are zero, preventing vanishing activations.
$W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$
or (uniform version):
$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}}, \sqrt{\frac{6}{n_{\text{in}}}}\right)$
Where:
$n_{\text{in}}$ = number of input units

He initialization is recommended for ReLU networks as shown in the plot below.

Did you solve the problem?

Basics

DL0049 Weight Init

Comments

Leave a Reply Cancel reply

More posts

MSD0007 Demand Forecasting System for Retailer

MSD0006 Video Recommendation System

MSD0005 Surveillance Video Anomaly Detection

DL0052 Rotary Positional Embedding