Interview for Machine Learning

Author: admin

ML0051 Linear SVM
Can you explain the key concepts behind a Linear Support Vector Machine?
Answer
A Linear Support Vector Machine (Linear SVM) is a classifier that finds the optimal straight-line (or hyperplane) separating two classes by maximizing the margin between them. It relies on a few critical points (support vectors) and offers strong generalization, especially for linearly separable data.
Key Concepts of a Linear Support Vector Machine:
(1) Hyperplane: A decision boundary that separates data points of different classes.
(2) Margin: The distance between the hyperplane and the nearest data points from each class.
(3) Support Vectors: Data points that lie closest to the hyperplane and define the margin.
(4) Objective: Maximize the margin while minimizing classification errors.
Here is the Linear SVM Decision Function:
$f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b$
Where:
$\mathbf{x}$ is the input feature vector.
$\mathbf{w}$ is the weight vector.
$b$ is the bias term.

Here is the Linear SVM Classification Rule:
$\hat{y} = \mbox{sign}(\mathbf{w}^\top \mathbf{x} + b) = \mbox{sign}(f(\mathbf{x}))$
Where:
$\hat{y}$ is the predicted class label.
$\mbox{sign}(\cdot)$ returns +1 if the argument is ≥ 0, and −1 otherwise.
For Hard Margin SVM, here is the Optimization Objective:
$\min_{\mathbf{w}, b} \quad \frac{1}{2} \|\mathbf{w}\|^2$
Subject to:
$y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \text{for all } i$
Where:
$y_i \in {-1, 1}$ is the class label for the i-th data point.
$\mathbf{x}_i$ is the i-th feature vector.
The example below shows Hard Margin SVM for solving a classification task.
Login to view more content
June 11, 2025
ML0050 Logistic Regression III
Why is Mean Squared Error (L2 Loss) an unsuitable loss function for logistic regression compared to cross-entropy?
Answer
Mean Squared Error (MSE) is unsuitable for logistic regression primarily because, when combined with the sigmoid function, it can lead to a non-convex loss landscape, making optimization harder and increasing the risk of poor convergence. Additionally, it provides weaker gradients when predictions are confidently incorrect, slowing down learning. Cross-entropy loss is better suited as it aligns with the Bernoulli distribution assumption, produces stronger gradients, and leads to a well-behaved convex loss for a single neuron binary classification setting.
(1) Wrong Assumption: MSE assumes a Gaussian distribution of errors, while logistic regression assumes a Bernoulli (binary) distribution.
(2) Non-convex Optimization: MSE with sigmoid can create a non-convex loss surface, making optimization harder and less stable.
(3) Gradient Issues: MSE leads to smaller gradients for confident wrong predictions, slowing down learning compared to cross-entropy.
(4) Interpretation: Cross-entropy directly compares predicted probabilities to true labels, which is more appropriate for classification.
The figure below shows the non-convex loss surface when MSE is used for logistic regression.
Login to view more content
June 10, 2025
ML0049 Logistic Regression II
Please compare Logistic Regression and Neural Networks.
Answer
Logistic Regression is a straightforward, linear model suitable for linearly separable data and offers good interpretability. In contrast, Neural Networks are powerful, non-linear models capable of capturing intricate patterns in large datasets, often at the expense of interpretability and higher computational demands.
The table below compares Logistic Regression and Neural Networks in more detail.
Login to view more content
June 10, 2025
ML0048 Logistic Regression
Can you explain logistic regression and how it contrasts with linear regression?
Answer
Logistic regression maps inputs to a probability space for classification, while linear regression estimates continuous outcomes through a direct linear relationship.
Logistic regression model estimates the probability that a binary outcome (y = 1) occurs, given an input vector (x)
$\Pr(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^{\top}\mathbf{x} + b)}}$
Where:
$\mathbf{x}$ is the input feature vector,
$\mathbf{w}$ is the weight vector, and
$b$ is the bias term.
Logistic Regression vs. Linear Regression:
Linear Regression:
Purpose: Predicts a continuous output (e.g., price, height).
Output: Real number (can be negative or >1).
Assumes: Linearity between input features and output.
Logistic Regression:
Purpose: Predicts a probability for classification (e.g., spam or not).
Output: Value between 0 and 1 using sigmoid function.
Interpreted as: Probability of class membership.
Here is a table comparing Logistic Regression with Linear Regression.
Login to view more content
June 9, 2025
DL0024 Fixed-size Input in CNN
What is the “dilemma of fixed-size input” for CNNs? How is it typically resolved?
Answer
The “dilemma of fixed-size input” for Convolutional Neural Networks (CNNs) refers to the requirement that traditional CNN architectures demand input images of a predetermined, fixed size. This presents a challenge because real-world images often vary widely in dimensions.
Fixed Input Requirement: Traditional CNN architectures (like VGG or ResNet) require inputs of a fixed size due to the structure of fully connected layers at the end.
Data Preprocessing Constraint: Real-world images vary in size, so they must be resized or cropped, which may distort or lose important features.
Inefficiency & Information Loss: Resizing may stretch or compress content unnaturally, affecting model performance.
Below shows an example of information loss during resizing or cropping.

Common Solutions for the dilemma of fixed-size input:
(1) Global Average Pooling (GAP): Replaces fully connected layers, allowing input of variable size and reducing overfitting.
(2) Fully Convolutional Networks (FCNs): Use only convolutional and pooling layers, which can handle variable-sized inputs.
(3) Adaptive Pooling (e.g., in PyTorch): Pools features to a fixed size regardless of input dimensions.
Login to view more content
June 8, 2025
DL0023 Dilated Convolution
What are dilated convolutions? When would you use them?
Answer
Dilated convolutions enhance standard convolution by inserting gaps between filter elements, thereby allowing the network to gather more context (a larger receptive field) without an increase in parameters or a reduction in resolution.
Dilated convolutions (also known as atrous convolutions) modify standard convolution by inserting gaps (zeros) between kernel elements. A “dilation rate” dictates the spacing of these gaps. A dilation rate of 1 is a standard convolution.
Contrast with Pooling:
Pooling reduces spatial resolution (downsamples) while increasing the receptive field.
Dilated convolutions increase the receptive field without reducing resolution.
Multi-Scale Feature Extraction:
By adjusting the dilation rate, these convolutions can aggregate features from both local neighborhoods and larger regions, making it easier for the network to learn from multi-scale context.
Common Use Cases: Any task needing large receptive fields without downsampling.
(1) Semantic segmentation (e.g., DeepLab): Expand the receptive field and capture multi-scale context.
(2) Audio processing (e.g., WaveNet): Model long-range temporal dependencies.
Here is a 1D Dilated Convolution illustration.
Here is a 2D Dilated Convolution illustration.
Login to view more content
May 31, 2025
DL0022 CNN Architecture
Describe the typical architecture of a CNN.
Answer
A Convolutional Neural Network (CNN) is structured to efficiently recognize complex patterns in data. It begins with an input layer that feeds in raw data. Convolutional layers then extract key features using filters, which are enhanced through non-linear activation functions like ReLU. Pooling layers are used to reduce the size or dimensions of these features, thereby improving computational efficiency and promoting invariance to small shifts. The extracted features are flattened and passed through fully connected layers that culminate in an output layer for final predictions, typically employing a softmax function for classification tasks. Optional techniques, such as dropout and batch normalization, further refine learning and help prevent overfitting.
(1) Input Layer: Accepts raw data as multi-dimensional arrays.
(2) Convolutional Layers: Use learnable filters (kernels) to scan the input and extract local features.
(3) Activation Functions: Apply non-linearity (commonly ReLU) after each convolution operation.
(4) Pooling Layers: Downsample feature maps using techniques like max or average pooling to reduce spatial dimensions and computations.
(5) Stacked Convolutional and Pooling Blocks: Multiple iterations to progressively extract intricate hierarchical features.
(6) Flattening: Converts feature maps into one-dimensional vectors.
(7) Fully Connected Layers: Learn complex patterns and perform decision-making.
(8) Output Layer: Produces final predictions using appropriate activation functions (e.g., softmax for classification)
(9) Additional Components (Optional): Dropout for regularization, batch normalization for training stability, and skip connections in more advanced models.
Below is a visual representation of a typical CNN architecture. Padding is used in convolution to maintain dimensions.
Login to view more content
May 31, 2025
DL0021 Feature Map
What is the feature map in Convolutional Neural Networks?
Answer
A feature map is the output of a convolution operation in a Convolutional Neural Network (CNN) that highlights where specific features appear in the input, enabling the network to understand patterns and structures in input data.
Feature Map in CNNs:
(1) Output of a Filter: It’s the 2D (or 3D) output generated when a single convolutional filter slides across the input data.
(2) Highlighting a Specific Feature: Each feature map represents the spatial locations and strengths where a particular pattern or characteristic (e.g., a vertical edge, a specific texture, a corner) is detected in the input.
(3) Multiple Feature Maps per Layer: A convolutional layer typically uses multiple filters, with each filter producing its unique feature map.
The following example shows feature map examples calculated with different filters on the original image.
Login to view more content
May 31, 2025
DL0020 CNN Parameter Sharing
How do Convolutional Neural Networks achieve parameter sharing? Why is it beneficial?
Answer
Convolutional Neural Networks (CNNs) share parameters by using the same convolutional filter across different spatial locations, enabling them to learn location-independent features efficiently with fewer parameters and better generalization.
How CNNs Achieve Parameter Sharing:
(1) Convolutional Filters/Kernels: A small matrix of learnable weights (the filter) is defined.
(2) Sliding Window Operation: This filter slides across the entire input image (or feature map).
(3) Weight Reuse: The same weights within that filter are used to compute outputs at every spatial location where the filter is applied.
Why Parameter Sharing is Beneficial:
(1) Reduced Parameters: Significantly fewer learnable parameters compared to fully connected networks.
(2) Translation equivariance: Detects features regardless of their position in the image.
The following example demonstrates translation equivariance using a CNN-like convolution with a shared filter.
(3) Improved Generalization: Less prone to overfitting due to fewer parameters.
(4) Computational Efficiency: Faster training and inference.
Login to view more content
May 30, 2025
DL0019 Go Deep
How does increasing network depth impact the learning process?
Answer
Increasing network depth enhances feature learning and model power, but brings training instability, higher cost, and design complexity.
Increasing network depth can bring benefits:
(1) Improved Feature Hierarchy: Deeper layers can learn more abstract, high-level features. In image classification, early layers learn edges, deeper ones learn shapes and objects.
(2) Increased Model Capacity: More layers allow the network to model more complex functions and patterns.
(3) Improved Efficiency for Complex Functions: For certain complex functions, deep networks can represent them more efficiently with fewer neurons compared to shallow ones.
Increasing network depth can bring challenges:
(1) Vanishing/Exploding Gradients: Gradients can become extremely small or large as they propagate through many layers, hindering effective training, e.g., “Without techniques like skip connections, a 100-layer network might struggle to learn because gradients vanish before reaching early layers.“
(2) Increased Computational Cost (Challenge): Training deeper networks requires significantly more computational resources and time.
(3) Higher Data Requirements (Challenge): Deeper models have more parameters and are more prone to overfitting if not trained on large datasets.
The following example visually compares a shallow and a deep neural network on learning a complex function.
Login to view more content
May 30, 2025