Why is “weight initialization” important in deep neural networks?
Answer
Weight initialization is crucial for stabilizing activations and gradients, enabling deep neural networks to train efficiently and converge faster without numerical instability.
(1) Prevents vanishing/exploding gradients: Proper initialization keeps activations and gradients within reasonable ranges during forward and backward passes
(2) Ensures faster convergence: Good initialization allows the optimizer to reach a good solution more quickly.
(3) Breaks symmetry: Different initial weights ensure neurons learn unique features rather than identical outputs.
Xavier Initialization is used for activations like Sigmoid or Tanh:
Xavier aims to keep the variance of activations consistent across layers for symmetric activations (tanh/sigmoid).
or (uniform version):
Where: = number of input units
= number of output units
means weights are sampled from a normal (Gaussian) distribution with mean
and variance
.
means weights are sampled from a uniform distribution in the range
.
He Initialization is used for activations like ReLU or Leaky ReLU:
He initialization scales up the variance for ReLU, since half of its outputs are zero, preventing vanishing activations.
or (uniform version):
Where: = number of input units
He initialization is recommended for ReLU networks as shown in the plot below.









