Author: admin

  • DL0018 NaN Values

    What are the common causes for a deep learning model to output NaN values?

    Answer

    NaN outputs in deep learning usually stem from unstable math operations, gradient issues, bad hyperparameters, or data problems. Prevent this with proper initialization, proper normalization, stable activation functions, and well-tuned hyperparameters.

    Here are the common causes for a deep learning model to output NaN values:
    (1) Exploding Gradients: Gradients become excessively large during training, leading to NaN weight updates
    (2) Numerical Instability: Operations like log(0), division by zero, or square roots of negative numbers. Without a small constant (epsilon) in its denominator, batch normalization will suffer from division by zero if a batch has zero variance.
    (3) Improper Learning Rate: Too high a learning rate can cause parameter updates to diverge and push model parameters to extreme values.
    (4) Incorrect Weight Initialization: Incorrectly initializing all weights to very large positive numbers can cause activations to overflow immediately.
    (5) Data Issues: Input data contains NaN or extreme values.


    Login to view more content

  • DL0017 Reproducibility

    How to ensure the reproducibility of the deep learning experiments?

    Answer

    Reproducibility in deep learning is achieved by controlling randomness via fixed seeds and deterministic operations, maintaining strict code and dependency versioning, managing datasets carefully, and keeping comprehensive logs of all experiment settings. These practices ensure that experiments can be reliably repeated and validated, regardless of external factors.

    (1) Seed Control and Deterministic Operations:
    Set random seeds for all libraries (Python, NumPy, TensorFlow/PyTorch).
    Enable deterministic settings in your deep learning framework to reduce nondeterminism.
    (2) Code Versioning and Configuration Management:
    Use version control systems like Git.
    Maintain detailed configuration files (using YAML or JSON) that log hyperparameters and settings for each experiment.
    (3) Environment and Dependency Control:
    Use virtual environments (e.g., Conda) or containerize your projects with Docker.
    Freeze library versions to ensure consistency in the software environment.
    (4) Dataset Management:
    Fix train-test splits and document data preprocessing steps.
    Use versioned or static datasets to prevent unintentional changes over time.
    (5) Logging and Documentation:
    Log hardware details, random seeds, and experiment configurations.
    Utilize experiment tracking tools (like MLflow or Weights & Biases) to archive training runs and parameters.

    Below is one example that illustrates the experiments are not reproducible.


    Login to view more content
  • ML0047 Parameters

    What are the differences between parameters and hyperparameters?

    Answer

    Parameters are the values that a model learns from its training data, while hyperparameters are settings defined by the user that guide the training process and model architecture.

    Parameters:
    (1) Internal variables learned from data (e.g., weights and biases).
    (2) Adjusted during training using optimization algorithms.
    (3) Capture the model’s learned patterns and information.

    Hyperparameters:
    (1) External configurations set before training (e.g., learning rate, batch size, number of layers).
    (2) Remain fixed during training and are not updated by the learning process.
    (3) Influence how the model learns and its overall structure.


    Login to view more content
  • DL0016 Learning Rate Warmup

    What is Learning Rate Warmup? What is the purpose of using Learning Rate Warmup?

    Answer

    Learning Rate Warmup is a training technique where the learning rate starts from a small value and gradually increases to a target (base) learning rate over the first few steps or epochs of training.

    Purpose of Using Learning Rate Warmup:
    (1) Stabilizes Early Training: At the beginning of training, weights are randomly initialized, making the model sensitive to large updates. A warmup gradually increases the learning rate, preventing unstable behavior.
    (2) Allow Optimizers to Adapt: Optimizers like Adam and AdamW rely on gradient statistics that can be unstable at the start. Warmup allows these optimizers to accumulate more accurate estimates before using a high learning rate.
    (3) Enables Large Batch Training: Mitigates issues that can arise when combining a large batch size with a high initial learning rate.

    Below shows an example using Learning Warmup followed by Cosine Decay.


    Login to view more content
  • ML0046 Forward Propagation

    Please explain the process of Forward Propagation.

    Answer

    Forward propagation is when a neural network takes an input and generates a prediction. It involves systematically passing the input data through each layer of the network. A weighted sum of the inputs from the previous layer is calculated at each neuron, and then a nonlinear activation function is applied. This process is repeated layer by layer until the data reaches the output layer, where the final prediction is generated.

    Here is the process of Forward Propagation:
    (1) Input Layer: The network receives the raw input data.
    (2) Layer-wise Processing:
    Linear Combination: Each neuron calculates a weighted sum of its inputs and adds a bias.
    Non-linear Activation: The resulting value is passed through an activation function (e.g., ReLU, sigmoid, tanh) to introduce non-linearity.
    (3) Propagation Through Layers: The output from one layer becomes the input for the next layer, progressing through all hidden layers.
    (4) Output Generation: The final layer applies a function (like softmax for classification or a linear function for regression) to produce the network’s prediction.


    Login to view more content

  • ML0045 Multi-Layer Perceptron

    What is a Multi-Layer Perceptron (MLP)? How does it overcome Perceptron limitations?

    Answer

    A Multi-Layer Perceptron (MLP) is a feedforward neural network with one or more hidden layers between the input and output layers. Hidden layers in MLP use non-linear activation functions (like ReLU, sigmoid, or tanh) to model complex relationships. MLP can be used for classification, regression, and function approximation. MLP is trained using backpropagation, which adjusts the weights to minimize errors.

    Overcoming Limitations:
    (1) Learn non-linear: Unlike a single-layer perceptron that can only solve linearly separable problems, an MLP can learn non-linear decision boundaries, handling problems such as the XOR problem.
    (2) Universal Approximation: With enough neurons and layers, an MLP can approximate any continuous function, making it a powerful model for various applications.

    The plot below illustrates an example of a Multi-Layer Perceptron (MLP) applied to a classification problem.


    Login to view more content
  • ML0044 Perceptron

    Describe the Perceptron and its limitations.

    Answer

    The perceptron is a simple linear classifier that computes a weighted sum of input features, adds a bias, and applies a step function to produce a binary decision. The perceptron works well only for data sets that are linearly separable, where a straight line (or hyperplane in higher dimensions) can separate the classes.

    The perception output can be calculated by
     y = f(w^T x + b)
    Where:
     y is the predicted output (0 or 1)
     w is the weight vector
     x is the input vector
     b is the bias term
     f(\cdot) is the activation function (typically a step function)

    Below shows a perceptron diagram.

    Limitations for using perception:
    (1) Linearly Separable Data Only: Cannot solve problems like XOR, which are not linearly separable.
    (2) Single-Layer Only: Cannot model complex or non-linear patterns.
    (3) No Probabilistic Output: Outputs only binary values, not confidence or probabilities.


    Login to view more content
  • DL0015 Cold Start

    What is a “cold start” problem in deep learning?

    Answer

    The cold start problem is the difficulty of making reliable predictions for new entities (such as users, items, or contexts) lacking historical data.
    Many deep learning models, especially in recommendation systems, rely on abundant past data to learn meaningful patterns. When a new user or item is introduced, the model struggles because it doesn’t have enough information to produce accurate predictions.

    Mitigation Strategies for the Cold Start Problem:
    (1) Transfer Learning / Pretrained Models: Use embeddings or models pre-trained on similar tasks to provide a starting point.
    (2) Hybrid Recommendation Models: Combine collaborative filtering (CF) and content-based methods.
    (3) Active Learning / User Onboarding: Actively gather more data for new entities through user interactions.


    Login to view more content
  • ML0043 Feature Scaling

    Walk me through the rationale behind Feature Scaling in machine learning.

    Answer

    Feature scaling is a fundamental data preprocessing step that normalizes or standardizes the range of numerical features. It is essential for many machine learning algorithms to ensure that all features contribute equally to the model, leading to faster convergence, improved accuracy, and better overall model performance, especially for algorithms sensitive to the magnitude of feature values or those based on distance calculations.

    Definition: Process of normalizing or standardizing input features so they’re on a similar scale.
    Why Needed: Many ML models (e.g., SVM, KNN) are sensitive to feature magnitude. Prevents dominant features from overpowering others due to scale.

    Common Methods:
    Min-Max Scaling: Scales features to a range (usually [0, 1]).
    \mbox \quad X_{\text{normalized}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
    Where:
     X represents the original value of the feature.
     X_{\text{min}} represents the minimum value of the feature in the dataset.
     X_{\text{max}} represents the maximum value of the feature in the dataset.

    Standardization (Z-score Normalization, centers data to mean 0, standard deviation to 1):
    \mbox \quad X_{\text{standardized}} = \frac{X - \mu}{\sigma}
    Where:
     X represents the original value of the feature.
     \mu represents the mean of the feature in the dataset.
     \sigma represents the standard deviation of the feature in the dataset.

    Below shows an example plot for original, min-max scaled, and standardized data.


    Login to view more content

  • ML0042 Early Stopping

    What is Early Stopping? How is it implemented?

    Answer

    Early Stopping is a regularization technique used to halt training when a model’s performance on a validation set stops improving, thus avoiding overfitting. It monitors metrics like validation loss or validation accuracy and stops after a defined number of stagnant epochs (patience). This ensures efficient training and better generalization.

    Implementation:
    Split data into training and validation sets.
    After each epoch, evaluate on the validation set.
    If performance improves, save the model and reset the patience counter.
    If no improvement, counter add one; if the counter reaches the patience epochs, stop training.
    Restore best weights after stopping, load the model weights from the epoch that yielded the best validation performance.

    Below is one example loss plot when using early stop.


    Login to view more content