Author: admin

  • DL0004 Small Kernels

    What are the key advantages of using small convolutional kernels, such as 3×3, over utilizing a few larger kernels in deep learning architectures?

    Answer

    Using small convolutional kernels instead of a few larger kernels offers several significant advantages in deep learning architectures:

    (1) Deeper Networks & More Non-Linearity: Stacking multiple 3×3 layers (e.g., three 3×3 layers) allows for a deeper network with more non-linear activation functions compared to a single large kernel.
    (2) Reduced Parameters: Multiple small kernels can achieve the same receptive field as a larger one, but with fewer parameters.
    Example: Two stacked 3×3 layers ( 18 \cdot C_{in} \cdot C_{out} total parameters) have the same receptive field as a 5×5 layer ( 25 \cdot C_{in} \cdot C_{out} total parameters) but fewer parameters.

    (3) Computational Efficiency: Fewer parameters in smaller kernels generally lead to lower computation costs during training and inference.
    (4) Gradual Receptive Field Expansion: Successive 3×3 convolutions progressively build a larger receptive field while maintaining fine detail. (3×3 filters focus on local detail capture with pixel neighborhoods, ideal for textures or edges.)


    Login to view more content
  • DL0003 1×1 Convolution

    What are the benefits of using 1×1 convolutional layers in deep learning architectures?

    Answer

    A 1×1 convolution, also known as a pointwise convolution, is a convolutional operation where the kernel size is 1×1, which plays several crucial roles in deep learning architectures.
    (1) Dimensionality control: 1×1 convolution can reduce or expand the number of feature maps, trading off representational capacity and computational cost.

    For example, Bottleneck designs: In architectures like ResNet’s bottleneck block, a 1×1 conv first reduces channels (e.g., 256→64), then a 3×3 conv processes those, and finally another 1×1 conv expands back (64→256) to restore capacity while keeping compute manageable.

    (2) Increased Network Depth with Controlled Cost: Allows for the design of deeper networks by reducing channel dimensionality before computationally expensive spatial convolutions.
    (3) Cross-Channel Feature Fusion: Enables interaction and combination of information across different feature channels at the same spatial location.
    (4) Non-linear mixing: When followed by activations (ReLU, etc.), 1×1 convolutions introduce non‐linear channel mixing that enhances model expressiveness.


    Login to view more content
  • DL0002 All Ones Init

    What are the potential consequences of initializing all weights to one in a deep learning model?

    Answer

    Below are the key consequences of initializing all weights in a deep-learning model to one (a constant non-zero value), illustrating why random, scaled initializations (e.g., Xavier/He) are essential.
    (1) Symmetry Problem: Neurons receive identical gradients, causing them to learn the same features rather than developing distinct representations.
    (2) Limited Representational Capacity: The network cannot capture complex, varied patterns because all neurons behave identically.
    (3) Slow/No Convergence: The lack of Representational Capacity further makes it difficult for the model to update to the optimal weights. (The below image shows an example for training loss comparison for ones initialization vs random initialization)

    (4) Activation Saturation: Can push neurons into saturated regions of activation functions (e.g., sigmoid, tanh), leading to vanishing gradients.


    Login to view more content
  • DL0001 Residual Connection

    Why are residual connections important in deep neural networks?

    Answer

    Residual connections, also known as skip connections, are vital in deep neural networks primarily because they tackle the infamous vanishing gradient problem and help with the related issue of network degradation as the network depth increases.

    Residual connection is often expressed by the following equation:
     y = F(x) + x
    Where:
     F(x) represents the residual mapping that the network learns (i.e., what needs to be added to the input  x to achieve the desired output).
     x is the input to the residual block.

    (1) Tackle vanishing gradient problem:
    Residual connections create a direct shortcut for gradient flow by incorporating an identity mapping into the learned transformation. This ensures that even if the gradient through the learned component is small, a strong, direct gradient component persists, preventing vanishing gradients in deep networks. This improves gradient flow during backpropagation, reducing vanishing gradients and enabling the training of very deep networks.

    (2) Address network degradation:
    Residual connections mitigate the degradation problem often seen in deep networks. Without these connections, simply stacking more layers can result in higher training errors, as the network struggles to update its weights effectively. With residual connections, any layer that doesn’t contribute useful information can effectively learn to output zeros in the residual branch, letting the network default to an identity mapping.


    Login to view more content
  • ML0031 Linear Regression

    What are the advantages and disadvantages of linear regression?

    Answer

    Linear regression aims to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.

    \displaystyle h_\theta(x) = \theta_0 + \sum_{j=1}^{p} \theta_j x_j
    Where
    h_\theta(x) represents the hypothesis (predicted value) for input feature vector x.
    \theta_0 is the bias (intercept) parameter, shifting the prediction up or down independent of features.
    \theta_j are the weight parameters multiplying each feature.
    x_j denotes the j‑th feature of the input vector x.
    p is the total number of features (excluding the bias) used in the model.

    Advantages:
    (1) Simplicity & Interpretability: Linear regression is easy to understand and implement. The coefficients of the model directly indicate the strength and direction of the relationship between the features and the target variable, making it highly interpretable.
    (2) Computational Efficiency: Its low computational cost makes linear regression fast to train, even on large datasets.
    (3) Effective for Linearly Separable Data: It performs well when the relationship between the independent and dependent variables is approximately linear.

    Disadvantages:
    (1) Assumes Linearity: The primary limitation is the assumption that the relationship between the variables is linear. It will perform poorly if the underlying relationship is nonlinear.
    (2) Sensitivity to Outliers: Extreme values can disproportionately affect the model, distorting the results.
    (3) Multicollinearity Issues: When predictors are highly correlated, it becomes difficult to isolate individual effects, leading to unreliable coefficient estimates
    (4) Potential for Underfitting: The simplicity of the model may fail to capture the nuances and complexities of more intricate datasets.


    Login to view more content

  • ML0030 Sigmoid

    What are the advantages and disadvantages of using a sigmoid activation function?

    Answer

    The sigmoid activation function transforms input values into a range between 0 and 1, making it useful in various applications like binary classification.
    \mbox{Sigmoid}(x) = \frac{1}{1 + e^{-x}}

    Advantages:
    (1) Smooth, Bounded Gradient: The sigmoid’s S‑shape yields a continuous derivative, preventing abrupt changes in backpropagation and aiding stable training on shallow networks.
    (2) Probability interpretation: Since the output is between 0 and 1, it can be useful for problems where predictions need to represent probabilities.

    Disadvantages:
    (1) Vanishing gradient problem: For very large or small inputs, the gradient becomes almost zero, slowing down training in deep networks.
    (2) Not zero-centered: The outputs are always positive, which can lead to inefficient weight updates and slower convergence.


    Login to view more content
  • ML0029 Tanh

    What are the advantages and disadvantages of using the tanh activation function?

    Answer

    In machine learning, the hyperbolic tangent (tanh) activation function is defined as
    \mbox{tanh}(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
    This function transforms input values into a range between -1 and 1, helping with faster convergence in neural networks.

    Advantages:
    (1) Zero-centered outputs: Unlike sigmoid, which outputs values between 0 and 1, tanh produces values between -1 and 1, making optimization easier and reducing bias in gradient updates.
    (2) Smooth and Differentiable: The function is infinitely differentiable, supporting stable gradient‑based methods.

    Disadvantages:
    (1) Vanishing gradient problem: For very large or very small input values, the derivative of tanh approaches zero, leading to slow weight updates and potentially hindering deep network training.
    (2) Computationally expensive: Compared to ReLU, tanh requires exponentially complex calculations, which may slow down model inference.


    Login to view more content

  • ML0028 Softmax

    What is the Softmax activation function, and what is its purpose?

    Answer

    Softmax is an activation function typically used in the output layer of a neural network for multi-class classification problems. Its purpose is to convert a vector of raw scores (logits) into a probability distribution over the possible output classes. The output of Softmax is a vector where each element represents the probability of the input belonging to a specific class, and the sum of these probabilities is always 1.
    \mbox{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
    Where:
     z_i represents the raw score (also known as a “logit”) for the i th class.
     K represents the total number of classes in the classification problem.

    The combination of the softmax function with the cross-entropy loss function is standard for multi-class classification problems. The softmax function provides a probability distribution over classes, and the cross-entropy loss measures how well this predicted distribution aligns with the true distribution (typically a one-hot encoded vector).


    Login to view more content
  • ML0027 Leaky ReLU

    What are the benefits of the Leaky ReLU activation function?

    Answer

    Leaky ReLU modifies the standard ReLU by allowing a small, non-zero gradient for negative inputs. Its formula is typically written as:
    {\large \text{Leaky ReLU}(x) = x \text{ if } x \ge 0,\quad \alpha x \text{ if } x < 0}

    Advantages of Leaky ReLU:
    1. Addresses the dying ReLU problem: By having a small non-zero slope for negative inputs, Leaky ReLU allows a small gradient to flow even when the neuron is not active in the positive region. This prevents neurons from getting stuck in a permanently inactive state and potentially helps them recover during training.
    2. Retains the benefits of ReLU for positive inputs: Maintains the linearity and non-saturation for positive values, contributing to efficient computation and gradient propagation.  


    Login to view more content

  • ML0026 ReLU

    What are the benefits and limitations of the ReLU activation function?

    Answer

    ReLU offers substantial benefits in terms of computational efficiency, gradient propagation, and sparsity, which have made it a popular choice for activation functions in deep learning.
    {\large \text{ReLU}(x) = \max(0, x)}

    Advantages of ReLU:
    1. Mitigation of the Vanishing Gradient Problem: In the positive region (x>0), ReLU has a constant gradient of 1. This helps to alleviate the vanishing gradient problem that plagues sigmoid and tanh functions, especially in deep networks. A constant gradient allows for more effective backpropagation of the error signal to earlier layers.
    2. Sparse Activation:
    By outputting zero for all negative input values, ReLU naturally induces sparsity in the network. This means that, at any given time, only a subset of neurons are active. Sparse activations can lead to more efficient representations and can help the network learn more robust features.
    3. Computational Efficiency:
    ReLU is computationally simple, requiring only a threshold operation, which accelerates both training and inference processes compared to functions like sigmoid or tanh that involve more complex calculations.

    Drawbacks of ReLU:
    1. Dying ReLU Problem:
    Neurons can become inactive if they consistently receive negative inputs, leading them to output zero and potentially never recover, thus reducing the model’s capacity.
    2. Unbounded Output:
    The unbounded nature of ReLU’s positive outputs can lead to large activation values, potentially causing issues like exploding gradients if not properly managed.


    Login to view more content