DL0049 Weight Init

Why is “weight initialization” important in deep neural networks?

Answer

Weight initialization is crucial for stabilizing activations and gradients, enabling deep neural networks to train efficiently and converge faster without numerical instability.
(1) Prevents vanishing/exploding gradients: Proper initialization keeps activations and gradients within reasonable ranges during forward and backward passes
(2) Ensures faster convergence: Good initialization allows the optimizer to reach a good solution more quickly.
(3) Breaks symmetry: Different initial weights ensure neurons learn unique features rather than identical outputs.

Xavier Initialization
is used for activations like Sigmoid or Tanh:
Xavier aims to keep the variance of activations consistent across layers for symmetric activations (tanh/sigmoid).
 W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}} + n_{\text{out}}}\right)
or (uniform version):
 W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},  \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)
Where:
 n_{\text{in}} = number of input units
 n_{\text{out}} = number of output units
 \mathcal{N}(\mu, \sigma^2) means weights are sampled from a normal (Gaussian) distribution with mean  \mu and variance  \sigma^2 .
 \mathcal{U}(a, b) means weights are sampled from a uniform distribution in the range  [a, b] .

He Initialization is used for activations like ReLU or Leaky ReLU:
He initialization scales up the variance for ReLU, since half of its outputs are zero, preventing vanishing activations.
 W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)
or (uniform version):
 W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}},  \sqrt{\frac{6}{n_{\text{in}}}}\right)
Where:
 n_{\text{in}} = number of input units

He initialization is recommended for ReLU networks as shown in the plot below.


Login to view more content

Did you solve the problem?

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *