Interview for Machine Learning

Author: admin

ML0041 Concept of NN
Please explain the concept of a Neural Network.
Answer
A neural network (NN) is a machine learning model composed of layers of interconnected neurons. It learns patterns in data by adjusting weights through training, enabling it to perform tasks like classification, regression, and more.
(1) Inspired by Biology: Neural networks are computer systems modeled after the human brain’s network of neurons.
(2) Layered Structure: Neural networks consist of an input layer, one or more hidden layers, and an output layer.
(3) Neurons and Activation: Each neuron performs a weighted sum of its inputs, adds a bias, and applies an activation function to produce an output. Weights and Biases are learnable parameters adjusted during training. Activation Functions can introduce non-linearity (e.g., ReLU, Sigmoid).
(4) Learning Process: Neural networks learn by adjusting the weights and biases through training algorithms such as backpropagation, minimizing errors between predictions and actual results.
(5) Versatility in Applications: Neural networks can identify complex patterns, making them suitable for tasks like image recognition, natural language processing, and data classification.
Below shows an example of an NN.
Login to view more content
May 24, 2025
DL0014 Mixed Precision Training
Can you explain the primary benefits of using mixed precision training in deep learning?
Answer
Mixed precision training accelerates deep learning by using both FP32 and FP16 operations, which reduces memory and computational requirements while maintaining model accuracy, resulting in faster and more efficient training.
(1) Faster Training: Uses lower-precision (e.g., FP16) operations on supported hardware (like GPUs/TPUs), which are faster than FP32.
(2) Reduced Memory Usage: Lower-bit representations decrease memory footprint, allowing larger batch sizes or models.
(3) Higher Throughput: More computations per second due to reduced precision, which can speed up training time.
(4) Supports Large Models: Enables training of models that wouldn’t fit in memory with full precision.
(5) Maintains Accuracy: With proper scaling (e.g., loss scaling), training stability and final model accuracy can typically be preserved.
Login to view more content
May 22, 2025
DL0013 Instance Normalization
Can you explain what Instance Normalization is in the context of deep learning?
Answer
Instance Normalization (IN) normalizes each individual data sample (often per channel) by subtracting its own mean and dividing by its variance, then applying a scale and shift. This makes it ideal for applications where per-instance adjustment is needed, such as artistic style transfer, ensuring that the normalization is not affected by the mini-batch composition.
Here are the equations for calculating Instance Normalization output $y_{nchw}$ for input $x_{nchw}$ :
$\mu_{nc} = \frac{1}{HW} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{nchw}$
$\sigma_{nc}^2 = \frac{1}{HW} \sum_{h=1}^{H} \sum_{w=1}^{W} (x_{nchw} - \mu_{nc})^2$
$\hat{x}_{nchw}=\frac{{x}_{nchw} - \mu_{nc}}{\sqrt{\sigma_{nc}^2 + \epsilon}}$
$y_{nchw} = \gamma_c \hat{x}_{nchw} + \beta_c$
Where:
$x_{nchw}$ is the input feature at batch $n$ , channel $c$ , height $h$ , and width $w$ .
$H$ is the height of the feature map (number of rows per channel).
$W$ is the width of the feature map (number of columns per channel).
$\mu_{nc}$ is the mean of all spatial values in channel $c$ of instance $n$ .
$\sigma_{nc}^2$ is the variance of spatial values in channel $c$ of instance $n$ .
$\hat{x}_{nchw}$ is the normalized value after subtracting the mean and dividing by the standard deviation.
$\epsilon$ is a small constant added to the denominator to prevent division by zero and improve numerical stability.
$y_{nchw}$ is the final output after applying normalization and scaling.
$\gamma_c$ is a learnable scale parameter for channel $c$ .
$\beta_c$ is a learnable shift parameter for channel $c$ .
Login to view more content
May 22, 2025
ML0040 Bias and Variance
Can you explain the bias-variance tradeoff?
Answer
Bias:
Error due to overly simplified assumptions in the model.
High bias may lead to underfitting, where the model misses key patterns in the data.
Variance:
Error due to high sensitivity to variations in the training data.
High variance may result in overfitting, where the model captures noise and underlying patterns.
Bias-Variance Tradeoff:
Increasing model complexity typically decreases bias but increases variance, while a simpler model increases bias but decreases variance.
The goal is to balance both to minimize the total error on unseen data.
The bias-variance tradeoff illustrates that there’s a delicate balance to strike when building a machine learning model. A simpler model tends to have high bias and low variance, underfitting the data. A more complex model tends to have low bias and high variance, overfitting the data. The goal is to find the right level of model complexity to minimize the total prediction error, which is the sum of squared bias, variance, and irreducible error.
The example below shows scenarios of high bias (underfitting), high variance (overfitting), and a good balance.
Login to view more content
May 21, 2025
ML0039 Distributed Training
What are the two main distributed training approaches for machine learning?
Answer
The two main distributed training approaches for machine learning are Data Parallelism and Model Parallelism.
Data Parallelism: In this approach, the training dataset is divided and distributed among multiple computing devices, with each device holding a complete copy of the machine learning model. Each device then trains its model on its assigned subset of the data in parallel. After each training step, the updates (gradients or new parameters) from all devices are aggregated and synchronized to maintain a consistent model across the system. This method is highly effective for scaling training when the dataset is large and the model can fit within a single device’s memory.
Model Parallelism: This approach is used when the machine learning model itself is too large to fit into the memory of a single computing device. In model parallelism, different parts of the model (e.g., specific layers of a neural network) are distributed across multiple devices. Data typically flows sequentially through these distributed model parts. This necessitates more complex communication between devices as intermediate computations and activations must be passed along. Model parallelism is crucial for training extremely large models that would otherwise be computationally intractable on a single machine.
Login to view more content
May 21, 2025
ML0038 Validation and Test

What are the key purposes of using both a validation and a test set when building machine learning models?
Answer
Using a validation set separates model development from tuning, enabling informed hyperparameter decisions and overfitting control, while reserving a test set ensures a completely unbiased, final assessment of how the model will perform in real‑world, unseen scenarios.
Validation Set:
(1) Tune Hyperparameters: Optimize model settings without test set bias.
(2) Select Best Model: Compare different models objectively during development.
(3) Prevent Overfitting (During Training): Monitor performance on unseen data to stop training early if needed.
Test Set:
(1) Final, Unbiased Evaluation: Assess the truly generalized performance of the final model.
(2)Simulate Real-World Performance: Estimate how the model will perform on completely new data.
(3) Avoid Data Leakage: Ensure no information from the test set influences model building.
Login to view more content
May 17, 2025
ML0037 Bias in NN
Why is bias used in neural networks?
Answer
Bias in Neural Networks is used to introduce flexibility and adaptability in learning.
(1) Shifts Activation Threshold: Allows a neuron’s activation function to move left or right, so it can fire even when inputs sum to zero.
(2) Avoids Origin Constraint: Lets decision boundaries and fitted functions not be forced through the origin (0,0).
(3) Increases Flexibility: Provides an extra learnable parameter for better approximation of complex functions.
(4) Compensates for Imbalance: Helps adjust for biases in data or features.
Login to view more content
May 17, 2025
ML0036 Confusion Matrix
In which scenarios is a Confusion Matrix most useful for evaluating machine learning models, and why?
Answer
A Confusion Matrix is a table that visualizes the performance of a classification model by comparing the predicted and actual class labels. It displays the counts of True Positives (correctly predicted positives), True Negatives (correctly predicted negatives), False Positives (incorrectly predicted positives), and False Negatives (incorrectly predicted negatives). While its form is simple, it becomes indispensable whenever you need more insight than overall accuracy. Below are the key scenarios where a confusion matrix shines.
(1) Imbalanced Datasets: Reveals if the minority class is being predicted well, unlike overall accuracy.
(2) Understanding Error Types: Shows True Positives, True Negatives, False Positives, and False Negatives, which is crucial when different errors have different costs (e.g., medical tests, fraud detection).
(3) Multi-Class Classification: Identifies which specific classes are being confused.
(4) Comparing Models: A detailed comparison of model strengths and weaknesses beyond overall accuracy.
Here is an example binary class Confusion Matrix.
Login to view more content
May 17, 2025
ML0035 Model Comparison
How to compare different machine learning models?
Answer
Compare machine learning models by defining clear objectives and metrics, using consistent data splits, training and tuning each model, and evaluating them through robust metrics and statistical tests. Finally, consider trade-offs like model complexity and interpretability to make an informed choice.
(1) Choose Relevant Metrics: Select evaluation metrics that align with your task (e.g., accuracy or F1 for classification)
Below shows an example using ROC curves for model comparison.
(2) Use Consistent Data Splits: Evaluate all models on the same train/validation/test splits—or identical cross-validation folds—to ensure fairness
(3) Apply Cross-Validation: Employ k-fold or nested cross-validation to reduce variance in performance estimates, especially with limited data
(4) Control Randomness: Run each model multiple times with different random seeds (data shuffles, weight initializations) and average the results to gauge stability
(5) Perform Statistical Tests: Use paired tests to determine if observed differences are statistically significant
(6) Measure Efficiency: Record training time, inference latency, and resource usage (CPU/GPU and memory) to assess practical deployability
(7) Evaluate Robustness & Interpretability: Test models under data perturbations or adversarial noise, and compare explainability
Login to view more content
May 11, 2025
DL0012 Zero Padding
Why is zero padding used in deep learning?
Answer
Zero padding in deep learning, particularly in CNNs, is a technique of adding layers of zeros around the input to convolutional layers. This is crucial for maintaining the spatial dimensions of feature maps, preventing loss of information at the image or feature map borders, enabling the use of larger receptive fields via larger kernels, and providing control over the output size of convolutional operations. Ultimately, it helps in building deeper and more effective neural networks by preserving important spatial information throughout the network.
Here are the benefits of using zero padding for CNN.
(1) Preserves Spatial Dimensions: Prevents feature maps from shrinking after convolution.
Below is a 2D image example for using zero padding in CNN.

(2) Retains Boundary Information: Ensures edge pixels are processed adequately.
(3) Enables Larger Kernels: Allows using bigger filters without excessive size reduction.
(4) Controls Output Size: Provides a mechanism to manage the dimensions of output feature maps.
Beyond CNNs, zero padding plays a vital role in deep learning by standardizing variable-length sequences in tasks like NLP and time-series modeling. It ensures inputs have uniform dimensions for efficient batching and computation, enhances frequency resolution in spectral analyses, and allows for effective loss masking to focus learning on actual data.
(1) Standardizing Variable-Length Inputs: In NLP and time-series analysis, zero padding ensures that sequences of varying lengths have a uniform size. This uniformity is crucial for batch processing and for models like recurrent neural networks (RNNs) or transformers.
(2) Attention Masking in Transformers: Padding tokens in Transformer inputs are assigned zero values and then excluded via padding masks in self-attention layers, preventing the model from attending to irrelevant positions in the sequence.
Login to view more content
May 11, 2025