Tag: Data

  • ML0043 Feature Scaling

    Walk me through the rationale behind Feature Scaling in machine learning.

    Answer

    Feature scaling is a fundamental data preprocessing step that normalizes or standardizes the range of numerical features. It is essential for many machine learning algorithms to ensure that all features contribute equally to the model, leading to faster convergence, improved accuracy, and better overall model performance, especially for algorithms sensitive to the magnitude of feature values or those based on distance calculations.

    Definition: Process of normalizing or standardizing input features so they’re on a similar scale.
    Why Needed: Many ML models (e.g., SVM, KNN) are sensitive to feature magnitude. Prevents dominant features from overpowering others due to scale.

    Common Methods:
    Min-Max Scaling: Scales features to a range (usually [0, 1]).
    \mbox \quad X_{\text{normalized}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
    Where:
     X represents the original value of the feature.
     X_{\text{min}} represents the minimum value of the feature in the dataset.
     X_{\text{max}} represents the maximum value of the feature in the dataset.

    Standardization (Z-score Normalization, centers data to mean 0, standard deviation to 1):
    \mbox \quad X_{\text{standardized}} = \frac{X - \mu}{\sigma}
    Where:
     X represents the original value of the feature.
     \mu represents the mean of the feature in the dataset.
     \sigma represents the standard deviation of the feature in the dataset.

    Below shows an example plot for original, min-max scaled, and standardized data.


    Login to view more content

  • ML0038 Validation and Test


    What are the key purposes of using both a validation and a test set when building machine learning models?

    Answer

    Using a validation set separates model development from tuning, enabling informed hyperparameter decisions and overfitting control, while reserving a test set ensures a completely unbiased, final assessment of how the model will perform in real‑world, unseen scenarios.

    Validation Set:
    (1) Tune Hyperparameters: Optimize model settings without test set bias.
    (2) Select Best Model: Compare different models objectively during development.
    (3) Prevent Overfitting (During Training): Monitor performance on unseen data to stop training early if needed.

    Test Set:
    (1) Final, Unbiased Evaluation: Assess the truly generalized performance of the final model.
    (2)Simulate Real-World Performance: Estimate how the model will perform on completely new data.
    (3) Avoid Data Leakage: Ensure no information from the test set influences model building.


    Login to view more content
  • ML0020 Data Split

    How to split the dataset?

    Answer

    A good data split in machine learning ensures that the model is trained, validated, and tested effectively to generalize well on unseen data.
    The typical approach involves dividing the dataset into three sets: Training Set, Validation Set, and Test Set.

    Training Set: Used to train the machine learning model. The model learns patterns and relationships in the data from this set.
    Validation Set: Used to tune hyperparameters of the model and evaluate its performance during training. This helps prevent overfitting to the training data and allows you to select the best model configuration.  
    Test Set: Used for a final, unbiased evaluation of the trained model’s performance on completely unseen data. This provides an estimate of how well the model will generalize to new, real-world data.

    Stratification for Imbalanced Data: For imbalanced datasets, consider using stratified splits to maintain the same proportion of classes across the training and test sets.


    Login to view more content
  • ML0019 Imbalanced Data

    How to handle imbalanced data in Machine Learning?

    Answer

    Handling imbalanced data in machine learning involves addressing scenarios where one class significantly outnumbers the other, which can skew model performance. Here are common techniques:

    Dataset Resampling:
    Oversampling: Increase the minority class samples (e.g., using SMOTE or ADASYN to generate synthetic data points).
    Undersampling: Reduce the majority class samples to balance the dataset.

    Data Augmentation:
    Create synthetic data for the minority class with data augmentation techniques.

    Class Weights Adjustment:
    Assign higher weights to the minority class during training to penalize misclassifications more heavily.

    Metrics Selection:
    Use evaluation metrics like Precision, Recall, F1 Score, or AUC-ROC rather than accuracy.


    Login to view more content
  • ML0018 Data Normalization

    Why is data normalization used in Machine Learning?

    Answer

    Data normalization is the process of scaling data to fit within a specific range or distribution, often between 0 and 1 or with a mean of 0 and standard deviation of 1. It’s used in machine learning and statistical modeling to ensure that features contribute equally to the model’s learning process.


    Login to view more content
  • ML0017 Data Augmentation

    What are the common data augmentation techniques?

    Answer

    Data augmentation refers to techniques used to increase the diversity and size of a training dataset by creating modified versions of the existing data. It’s especially popular in applications like computer vision and natural language processing, where collecting large datasets can be expensive or time-consuming.

    Common Techniques:
    Computer Vision:

    Geometric Transformations: Rotate, flip, crop, or scale images
    Color Adjustments: Change brightness, contrast, saturation, or apply color jittering.
    Noise Injection: Add random noise or blur to images.

    Natural Language Processing:
    Synonym Replacement: Replace words with their synonyms.
    Back Translation: Translate text to another language and back.
    Random Insertion/Deletion: Add/remove words randomly.

    Tabular Data:
    SMOTE (Synthetic Minority Oversampling Technique): Generate synthetic data points for minority classes.
    Noise Injection: Add small random noise to numeric features.


    Login to view more content