Tag: Basics

  • DL0050 Knowledge Distillation

    Describe the process and benefits of knowledge distillation.

    Answer

    Knowledge Distillation (KD) transfers “dark knowledge” about inter-class relationships from a large, accurate teacher model to a smaller student model. The student learns via a temperature-scaled softmax and a combined distillation plus supervised loss, enabling substantial model compression and faster inference while retaining high accuracy, provided that the teacher quality, student capacity, and hyperparameters are well chosen.

    Definition: Knowledge distillation is a process where a smaller model (student) learns to mimic the behavior of a larger, well-trained model (teacher).

    Soft Targets: The student is trained not only on hard labels (one-hot) but also on the soft output probabilities of the teacher.

    Temperature Scaling: Teacher logits are softened using a temperature T to reveal more information about class similarities:
    \mbox{Softmax}(z_i / T) = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}
    Where:
     z_i : Raw score (logit) for the i-th class.
     K : Total number of classes in the classification problem.
     T : Temperature parameter (>0) used to soften the probabilities. Higher  T produces a smoother distribution, revealing relationships between classes (“dark knowledge”).

    The below plot shows the Softmax probabilities for a fixed set of Teacher logits under three different temperatures. Increasing the temperature smooths the distribution.

    Loss Function: Typically combines distillation loss (difference between teacher and student soft outputs) and standard cross-entropy loss with true labels.

    Key Benefits of KD:
    (1) Model compression: the student is smaller and faster while retaining much of the teacher’s performance, enabling deployment on resource-constrained devices.
    (2) Inference Speed: Significantly decreases latency, making the model suitable for deployment on edge devices or real-time applications.
    (3) Improved Generalization: The Teacher’s smooth soft targets act as a form of powerful regularization, often leading the Student to generalize better than if it were trained only on hard labels.

    The plot below demonstrates the Knowledge Distillation (KD) process.


    Login to view more content

  • DL0049 Weight Init

    Why is “weight initialization” important in deep neural networks?

    Answer

    Weight initialization is crucial for stabilizing activations and gradients, enabling deep neural networks to train efficiently and converge faster without numerical instability.
    (1) Prevents vanishing/exploding gradients: Proper initialization keeps activations and gradients within reasonable ranges during forward and backward passes
    (2) Ensures faster convergence: Good initialization allows the optimizer to reach a good solution more quickly.
    (3) Breaks symmetry: Different initial weights ensure neurons learn unique features rather than identical outputs.

    Xavier Initialization
    is used for activations like Sigmoid or Tanh:
    Xavier aims to keep the variance of activations consistent across layers for symmetric activations (tanh/sigmoid).
     W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}} + n_{\text{out}}}\right)
    or (uniform version):
     W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},  \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)
    Where:
     n_{\text{in}} = number of input units
     n_{\text{out}} = number of output units
     \mathcal{N}(\mu, \sigma^2) means weights are sampled from a normal (Gaussian) distribution with mean  \mu and variance  \sigma^2 .
     \mathcal{U}(a, b) means weights are sampled from a uniform distribution in the range  [a, b] .

    He Initialization is used for activations like ReLU or Leaky ReLU:
    He initialization scales up the variance for ReLU, since half of its outputs are zero, preventing vanishing activations.
     W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)
    or (uniform version):
     W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}},  \sqrt{\frac{6}{n_{\text{in}}}}\right)
    Where:
     n_{\text{in}} = number of input units

    He initialization is recommended for ReLU networks as shown in the plot below.


    Login to view more content
  • DL0048 Adam Optimizer

    Can you explain how the Adam optimizer works?

    Answer

    The Adam (Adaptive Moment Estimation) optimizer is a powerful algorithm for training deep learning models, combining the principles of Momentum and RMSprop to compute adaptive learning rates for each parameter.
    Adam updates following the steps below:
    (1) First Moment Calculation(Mean/Momentum)
    It computes an exponentially decaying average of past gradients, which is the estimate of the first moment (mean) of the gradient. This introduces a momentum effect to smooth out the updates.
    m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
    Where:
    m_t is the 1st moment (mean of gradients).
    g_t is the gradient at step t.
    \beta_1 controls momentum (default: 0.9).

    (2) Second Moment Calculation (Variance)
    Adam also computes an exponentially decaying average of past squared gradients, which is the estimate of the second moment (uncentered variance) of the gradient. This provides a measure of the scale (magnitude) of the gradients.
    v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
    Where:
    v_t is the 2nd moment (variance of gradients).
    \beta_2 controls smoothing of squared gradients (default: 0.999).

    (3) Bias Correction
    Since m_t​ and v_t ​ are initialized as zero vectors, they are biased towards zero, especially during the initial steps. Adam applies bias correction to these estimates.
    \hat{m}_t = \frac{m_t}{1 - \beta_1^t}
    \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
    Where
    \hat{m}_t is the bias-corrected 1st moment.
    \hat{v}_t is the bias-corrected 2nd moment.
    \beta_1^t, \beta_2^t are the exponential decay raised to step t, correcting bias from initialization.

    (4) Parameter Update
    The final parameter update scales the bias-corrected first moment (m_t) by the overall learning rate (\alpha) and divides by the square root of the bias-corrected second moment (v_t​).
    \theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
    Where:
    \theta_t are model parameters.
    \alpha is the learning rate.
    \epsilon prevents division by zero.

    The plot below shows how the Adam optimizer efficiently moves from the starting point to the minimum of the quadratic bowl, taking adaptive steps that quickly converge to the origin.


    Login to view more content

  • ML0050 Logistic Regression III

    Why is Mean Squared Error (L2 Loss) an unsuitable loss function for logistic regression compared to cross-entropy?

    Answer

    Mean Squared Error (MSE) is unsuitable for logistic regression primarily because, when combined with the sigmoid function, it can lead to a non-convex loss landscape, making optimization harder and increasing the risk of poor convergence. Additionally, it provides weaker gradients when predictions are confidently incorrect, slowing down learning. Cross-entropy loss is better suited as it aligns with the Bernoulli distribution assumption, produces stronger gradients, and leads to a well-behaved convex loss for a single neuron binary classification setting.

    (1) Wrong Assumption: MSE assumes a Gaussian distribution of errors, while logistic regression assumes a Bernoulli (binary) distribution.
    (2) Non-convex Optimization: MSE with sigmoid can create a non-convex loss surface, making optimization harder and less stable.
    (3) Gradient Issues: MSE leads to smaller gradients for confident wrong predictions, slowing down learning compared to cross-entropy.
    (4) Interpretation: Cross-entropy directly compares predicted probabilities to true labels, which is more appropriate for classification.

    The figure below shows the non-convex loss surface when MSE is used for logistic regression.


    Login to view more content
  • ML0049 Logistic Regression II

    Please compare Logistic Regression and Neural Networks.

    Answer

    Logistic Regression is a straightforward, linear model suitable for linearly separable data and offers good interpretability. In contrast, Neural Networks are powerful, non-linear models capable of capturing intricate patterns in large datasets, often at the expense of interpretability and higher computational demands.

    The table below compares Logistic Regression and Neural Networks in more detail.


    Login to view more content
  • ML0048 Logistic Regression

    Can you explain logistic regression and how it contrasts with linear regression?

    Answer

    Logistic regression maps inputs to a probability space for classification, while linear regression estimates continuous outcomes through a direct linear relationship.

    Logistic regression model estimates the probability that a binary outcome (y = 1) occurs, given an input vector (x)
    \Pr(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^{\top}\mathbf{x} + b)}}
    Where:
    \mathbf{x} is the input feature vector,
    \mathbf{w} is the weight vector, and
    b is the bias term.

    Logistic Regression vs. Linear Regression:
    Linear Regression:
    Purpose: Predicts a continuous output (e.g., price, height).
    Output: Real number (can be negative or >1).
    Assumes: Linearity between input features and output.

    Logistic Regression:
    Purpose: Predicts a probability for classification (e.g., spam or not).
    Output: Value between 0 and 1 using sigmoid function.
    Interpreted as: Probability of class membership.

    Here is a table comparing Logistic Regression with Linear Regression.


    Login to view more content
  • DL0018 NaN Values

    What are the common causes for a deep learning model to output NaN values?

    Answer

    NaN outputs in deep learning usually stem from unstable math operations, gradient issues, bad hyperparameters, or data problems. Prevent this with proper initialization, proper normalization, stable activation functions, and well-tuned hyperparameters.

    Here are the common causes for a deep learning model to output NaN values:
    (1) Exploding Gradients: Gradients become excessively large during training, leading to NaN weight updates
    (2) Numerical Instability: Operations like log(0), division by zero, or square roots of negative numbers. Without a small constant (epsilon) in its denominator, batch normalization will suffer from division by zero if a batch has zero variance.
    (3) Improper Learning Rate: Too high a learning rate can cause parameter updates to diverge and push model parameters to extreme values.
    (4) Incorrect Weight Initialization: Incorrectly initializing all weights to very large positive numbers can cause activations to overflow immediately.
    (5) Data Issues: Input data contains NaN or extreme values.


    Login to view more content

  • DL0017 Reproducibility

    How to ensure the reproducibility of the deep learning experiments?

    Answer

    Reproducibility in deep learning is achieved by controlling randomness via fixed seeds and deterministic operations, maintaining strict code and dependency versioning, managing datasets carefully, and keeping comprehensive logs of all experiment settings. These practices ensure that experiments can be reliably repeated and validated, regardless of external factors.

    (1) Seed Control and Deterministic Operations:
    Set random seeds for all libraries (Python, NumPy, TensorFlow/PyTorch).
    Enable deterministic settings in your deep learning framework to reduce nondeterminism.
    (2) Code Versioning and Configuration Management:
    Use version control systems like Git.
    Maintain detailed configuration files (using YAML or JSON) that log hyperparameters and settings for each experiment.
    (3) Environment and Dependency Control:
    Use virtual environments (e.g., Conda) or containerize your projects with Docker.
    Freeze library versions to ensure consistency in the software environment.
    (4) Dataset Management:
    Fix train-test splits and document data preprocessing steps.
    Use versioned or static datasets to prevent unintentional changes over time.
    (5) Logging and Documentation:
    Log hardware details, random seeds, and experiment configurations.
    Utilize experiment tracking tools (like MLflow or Weights & Biases) to archive training runs and parameters.

    Below is one example that illustrates the experiments are not reproducible.


    Login to view more content
  • ML0047 Parameters

    What are the differences between parameters and hyperparameters?

    Answer

    Parameters are the values that a model learns from its training data, while hyperparameters are settings defined by the user that guide the training process and model architecture.

    Parameters:
    (1) Internal variables learned from data (e.g., weights and biases).
    (2) Adjusted during training using optimization algorithms.
    (3) Capture the model’s learned patterns and information.

    Hyperparameters:
    (1) External configurations set before training (e.g., learning rate, batch size, number of layers).
    (2) Remain fixed during training and are not updated by the learning process.
    (3) Influence how the model learns and its overall structure.


    Login to view more content
  • DL0016 Learning Rate Warmup

    What is Learning Rate Warmup? What is the purpose of using Learning Rate Warmup?

    Answer

    Learning Rate Warmup is a training technique where the learning rate starts from a small value and gradually increases to a target (base) learning rate over the first few steps or epochs of training.

    Purpose of Using Learning Rate Warmup:
    (1) Stabilizes Early Training: At the beginning of training, weights are randomly initialized, making the model sensitive to large updates. A warmup gradually increases the learning rate, preventing unstable behavior.
    (2) Allow Optimizers to Adapt: Optimizers like Adam and AdamW rely on gradient statistics that can be unstable at the start. Warmup allows these optimizers to accumulate more accurate estimates before using a high learning rate.
    (3) Enables Large Batch Training: Mitigates issues that can arise when combining a large batch size with a high initial learning rate.

    Below shows an example using Learning Warmup followed by Cosine Decay.


    Login to view more content