Interview for Machine Learning

Tag: Basics

DL0050 Knowledge Distillation
Describe the process and benefits of knowledge distillation.
Answer
Knowledge Distillation (KD) transfers “dark knowledge” about inter-class relationships from a large, accurate teacher model to a smaller student model. The student learns via a temperature-scaled softmax and a combined distillation plus supervised loss, enabling substantial model compression and faster inference while retaining high accuracy, provided that the teacher quality, student capacity, and hyperparameters are well chosen.
Definition: Knowledge distillation is a process where a smaller model (student) learns to mimic the behavior of a larger, well-trained model (teacher).
Soft Targets: The student is trained not only on hard labels (one-hot) but also on the soft output probabilities of the teacher.
Temperature Scaling: Teacher logits are softened using a temperature $T$ to reveal more information about class similarities:
$\mbox{Softmax}(z_i / T) = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}$
Where:
$z_i$ : Raw score (logit) for the i-th class.
$K$ : Total number of classes in the classification problem.
$T$ : Temperature parameter (>0) used to soften the probabilities. Higher $T$ produces a smoother distribution, revealing relationships between classes (“dark knowledge”).
The below plot shows the Softmax probabilities for a fixed set of Teacher logits under three different temperatures. Increasing the temperature smooths the distribution.
Loss Function: Typically combines distillation loss (difference between teacher and student soft outputs) and standard cross-entropy loss with true labels.
Key Benefits of KD:
(1) Model compression: the student is smaller and faster while retaining much of the teacher’s performance, enabling deployment on resource-constrained devices.
(2) Inference Speed: Significantly decreases latency, making the model suitable for deployment on edge devices or real-time applications.
(3) Improved Generalization: The Teacher’s smooth soft targets act as a form of powerful regularization, often leading the Student to generalize better than if it were trained only on hard labels.
The plot below demonstrates the Knowledge Distillation (KD) process.
Login to view more content
October 28, 2025
DL0049 Weight Init
Why is “weight initialization” important in deep neural networks?
Answer
Weight initialization is crucial for stabilizing activations and gradients, enabling deep neural networks to train efficiently and converge faster without numerical instability.
(1) Prevents vanishing/exploding gradients: Proper initialization keeps activations and gradients within reasonable ranges during forward and backward passes
(2) Ensures faster convergence: Good initialization allows the optimizer to reach a good solution more quickly.
(3) Breaks symmetry: Different initial weights ensure neurons learn unique features rather than identical outputs.

Xavier Initialization is used for activations like Sigmoid or Tanh:
Xavier aims to keep the variance of activations consistent across layers for symmetric activations (tanh/sigmoid).
$W \sim \mathcal{N}\left(0, \frac{1}{n_{\text{in}} + n_{\text{out}}}\right)$
or (uniform version):
$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$
Where:
$n_{\text{in}}$ = number of input units
$n_{\text{out}}$ = number of output units
$\mathcal{N}(\mu, \sigma^2)$ means weights are sampled from a normal (Gaussian) distribution with mean $\mu$ and variance $\sigma^2$ .
$\mathcal{U}(a, b)$ means weights are sampled from a uniform distribution in the range $[a, b]$ .
He Initialization is used for activations like ReLU or Leaky ReLU:
He initialization scales up the variance for ReLU, since half of its outputs are zero, preventing vanishing activations.
$W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$
or (uniform version):
$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}}}}, \sqrt{\frac{6}{n_{\text{in}}}}\right)$
Where:
$n_{\text{in}}$ = number of input units
He initialization is recommended for ReLU networks as shown in the plot below.
Login to view more content
October 28, 2025
DL0048 Adam Optimizer
Can you explain how the Adam optimizer works?
Answer
The Adam (Adaptive Moment Estimation) optimizer is a powerful algorithm for training deep learning models, combining the principles of Momentum and RMSprop to compute adaptive learning rates for each parameter.
Adam updates following the steps below:
(1) First Moment Calculation(Mean/Momentum)
It computes an exponentially decaying average of past gradients, which is the estimate of the first moment (mean) of the gradient. This introduces a momentum effect to smooth out the updates.
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
Where:
$m_t$ is the 1st moment (mean of gradients).
$g_t$ is the gradient at step $t$ .
$\beta_1$ controls momentum (default: 0.9).
(2) Second Moment Calculation (Variance)
Adam also computes an exponentially decaying average of past squared gradients, which is the estimate of the second moment (uncentered variance) of the gradient. This provides a measure of the scale (magnitude) of the gradients.
$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
Where:
$v_t$ is the 2nd moment (variance of gradients).
$\beta_2$ controls smoothing of squared gradients (default: 0.999).
(3) Bias Correction
Since $m_t$ and $v_t$ are initialized as zero vectors, they are biased towards zero, especially during the initial steps. Adam applies bias correction to these estimates.
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
$\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
Where
$\hat{m}_t$ is the bias-corrected 1st moment.
$\hat{v}_t$ is the bias-corrected 2nd moment.
$\beta_1^t$ , $\beta_2^t$ are the exponential decay raised to step $t$ , correcting bias from initialization.
(4) Parameter Update
The final parameter update scales the bias-corrected first moment ( $m_t$ ) by the overall learning rate ( $\alpha$ ) and divides by the square root of the bias-corrected second moment ( $v_t$ ).
$\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
Where:
$\theta_t$ are model parameters.
$\alpha$ is the learning rate.
$\epsilon$ prevents division by zero.
The plot below shows how the Adam optimizer efficiently moves from the starting point to the minimum of the quadratic bowl, taking adaptive steps that quickly converge to the origin.
Login to view more content
October 27, 2025
ML0050 Logistic Regression III
Why is Mean Squared Error (L2 Loss) an unsuitable loss function for logistic regression compared to cross-entropy?
Answer
Mean Squared Error (MSE) is unsuitable for logistic regression primarily because, when combined with the sigmoid function, it can lead to a non-convex loss landscape, making optimization harder and increasing the risk of poor convergence. Additionally, it provides weaker gradients when predictions are confidently incorrect, slowing down learning. Cross-entropy loss is better suited as it aligns with the Bernoulli distribution assumption, produces stronger gradients, and leads to a well-behaved convex loss for a single neuron binary classification setting.
(1) Wrong Assumption: MSE assumes a Gaussian distribution of errors, while logistic regression assumes a Bernoulli (binary) distribution.
(2) Non-convex Optimization: MSE with sigmoid can create a non-convex loss surface, making optimization harder and less stable.
(3) Gradient Issues: MSE leads to smaller gradients for confident wrong predictions, slowing down learning compared to cross-entropy.
(4) Interpretation: Cross-entropy directly compares predicted probabilities to true labels, which is more appropriate for classification.
The figure below shows the non-convex loss surface when MSE is used for logistic regression.
Login to view more content
June 10, 2025
ML0049 Logistic Regression II
Please compare Logistic Regression and Neural Networks.
Answer
Logistic Regression is a straightforward, linear model suitable for linearly separable data and offers good interpretability. In contrast, Neural Networks are powerful, non-linear models capable of capturing intricate patterns in large datasets, often at the expense of interpretability and higher computational demands.
The table below compares Logistic Regression and Neural Networks in more detail.
Login to view more content
June 10, 2025
ML0048 Logistic Regression
Can you explain logistic regression and how it contrasts with linear regression?
Answer
Logistic regression maps inputs to a probability space for classification, while linear regression estimates continuous outcomes through a direct linear relationship.
Logistic regression model estimates the probability that a binary outcome (y = 1) occurs, given an input vector (x)
$\Pr(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^{\top}\mathbf{x} + b)}}$
Where:
$\mathbf{x}$ is the input feature vector,
$\mathbf{w}$ is the weight vector, and
$b$ is the bias term.
Logistic Regression vs. Linear Regression:
Linear Regression:
Purpose: Predicts a continuous output (e.g., price, height).
Output: Real number (can be negative or >1).
Assumes: Linearity between input features and output.
Logistic Regression:
Purpose: Predicts a probability for classification (e.g., spam or not).
Output: Value between 0 and 1 using sigmoid function.
Interpreted as: Probability of class membership.
Here is a table comparing Logistic Regression with Linear Regression.
Login to view more content
June 9, 2025
DL0018 NaN Values
What are the common causes for a deep learning model to output NaN values?
Answer
NaN outputs in deep learning usually stem from unstable math operations, gradient issues, bad hyperparameters, or data problems. Prevent this with proper initialization, proper normalization, stable activation functions, and well-tuned hyperparameters.
Here are the common causes for a deep learning model to output NaN values:
(1) Exploding Gradients: Gradients become excessively large during training, leading to NaN weight updates
(2) Numerical Instability: Operations like log(0), division by zero, or square roots of negative numbers. Without a small constant (epsilon) in its denominator, batch normalization will suffer from division by zero if a batch has zero variance.
(3) Improper Learning Rate: Too high a learning rate can cause parameter updates to diverge and push model parameters to extreme values.
(4) Incorrect Weight Initialization: Incorrectly initializing all weights to very large positive numbers can cause activations to overflow immediately.
(5) Data Issues: Input data contains NaN or extreme values.
Login to view more content
May 30, 2025
DL0017 Reproducibility
How to ensure the reproducibility of the deep learning experiments?
Answer
Reproducibility in deep learning is achieved by controlling randomness via fixed seeds and deterministic operations, maintaining strict code and dependency versioning, managing datasets carefully, and keeping comprehensive logs of all experiment settings. These practices ensure that experiments can be reliably repeated and validated, regardless of external factors.
(1) Seed Control and Deterministic Operations:
Set random seeds for all libraries (Python, NumPy, TensorFlow/PyTorch).
Enable deterministic settings in your deep learning framework to reduce nondeterminism.
(2) Code Versioning and Configuration Management:
Use version control systems like Git.
Maintain detailed configuration files (using YAML or JSON) that log hyperparameters and settings for each experiment.
(3) Environment and Dependency Control:
Use virtual environments (e.g., Conda) or containerize your projects with Docker.
Freeze library versions to ensure consistency in the software environment.
(4) Dataset Management:
Fix train-test splits and document data preprocessing steps.
Use versioned or static datasets to prevent unintentional changes over time.
(5) Logging and Documentation:
Log hardware details, random seeds, and experiment configurations.
Utilize experiment tracking tools (like MLflow or Weights & Biases) to archive training runs and parameters.
Below is one example that illustrates the experiments are not reproducible.
Login to view more content
May 29, 2025
ML0047 Parameters
What are the differences between parameters and hyperparameters?
Answer
Parameters are the values that a model learns from its training data, while hyperparameters are settings defined by the user that guide the training process and model architecture.
Parameters:
(1) Internal variables learned from data (e.g., weights and biases).
(2) Adjusted during training using optimization algorithms.
(3) Capture the model’s learned patterns and information.
Hyperparameters:
(1) External configurations set before training (e.g., learning rate, batch size, number of layers).
(2) Remain fixed during training and are not updated by the learning process.
(3) Influence how the model learns and its overall structure.
Login to view more content
May 29, 2025
DL0016 Learning Rate Warmup
What is Learning Rate Warmup? What is the purpose of using Learning Rate Warmup?
Answer
Learning Rate Warmup is a training technique where the learning rate starts from a small value and gradually increases to a target (base) learning rate over the first few steps or epochs of training.
Purpose of Using Learning Rate Warmup:
(1) Stabilizes Early Training: At the beginning of training, weights are randomly initialized, making the model sensitive to large updates. A warmup gradually increases the learning rate, preventing unstable behavior.
(2) Allow Optimizers to Adapt: Optimizers like Adam and AdamW rely on gradient statistics that can be unstable at the start. Warmup allows these optimizers to accumulate more accurate estimates before using a high learning rate.
(3) Enables Large Batch Training: Mitigates issues that can arise when combining a large batch size with a high initial learning rate.
Below shows an example using Learning Warmup followed by Cosine Decay.
Login to view more content
May 27, 2025