Interview for Machine Learning

Author: admin

MSD0007 Demand Forecasting System for Retailer
Design a demand forecasting system for a large retail company like Costco, Walmart, or Target. The system should predict future product demand across stores and time to support inventory planning, replenishment, and promotions.
Answer
The demand forecasting system ingests diverse data from sales, inventory, weather, and promotions to predict product demand using ML models like time series, tree-based, or deep learning methods.
This demand forecasting system features a scalable architecture with data pipelines, real-time processing, and integration for inventory management.
Key benefits include reducing stockouts, optimizing supply chains, and improving accuracy through iterative model training.
Problem Definition & Success Metrics:
Define the forecast granularity (e.g., SKU-store-day), horizon (e.g., 2-week operational, 3-month tactical), and objective (e.g., minimize out-of-stocks and waste).
Key success metrics would be Weighted Mean Absolute Percentage Error (WMAPE) for overall accuracy and forecast bias to detect systematic over/under-prediction.
Data Strategy & Feature Engineering:
Integrate diverse data sources into a unified feature store:
(1) Internal: Historical sales, product hierarchies, pricing, promotional calendars, inventory levels, and online search/click data.
(2) External: Calendar events (holidays, paydays), weather, local events, competitor activity (scraped), and macroeconomic trends.
System Architecture:
(1) Data Ingestion Layer: Batch and real-time streams.
(2) Processing & Feature Store: Clean, validate, and compute features.
(3) Modeling Layer: A repository for multiple models, allowing experimentation.
(4) Serving Layer: Exposes forecasts via APIs to downstream systems (replenishment, pricing).
(5) Monitoring & Feedback: Tracks model performance, data drift, and incorporates actual sales as ground truth for retraining.

Modeling Approach with Hierarchical Ensemble Strategy:
Use a “Top-Down, Bottom-Up” approach. Forecast at the aggregate level (Category/Region) to capture macro-trends and reconcile these with granular SKU(Stock Keeping Unit)-level predictions to ensure total inventory alignment.
(1) Base Layer (Interpretability): Implement Prophet or Exponential Smoothing for high-level aggregates. This captures clear seasonalities (holidays, paydays) in a way that is easily explainable to business stakeholders.
(2) Granular Layer (The “Workhorse”): Use Global LightGBM or XGBoost models trained across entire product categories. This allows the model to learn shared patterns across similar items while efficiently handling categorical metadata like Store ID and Brand.
(3) High-Volatility Layer (Deep Learning): Deploy Temporal Fusion Transformers (TFT) or DeepAR specifically for high-volume or volatile items. These models capture complex, non-linear dependencies and multi-horizon temporal patterns that tree-based models might miss.
(4) Probabilistic Forecasting: Instead of a single point estimate, generate Quantile Forecasts (e.g., P10, P50, P90). This provides a range of uncertainty, allowing the logistics team to make data-driven decisions on safety stock levels.
Login to view more content
January 3, 2026
MSD0006 Video Recommendation System
How would you design a scalable and personalized video recommendation system for a platform like YouTube, Netflix, or TikTok that can recommend relevant videos in real time to billions of users?
Answer
A modern recommendation system uses a multi-stage pipeline to narrow down billions of videos to a top-20 list for a user in milliseconds.
It typically consists of:
(1) Candidate Generation (filtering down to hundreds),
(2) Ranking (scoring those hundreds using deep learning), and
(3) Re-ranking (applying business logic to ensure diversity, freshness, and safety or applying ad insertion).

The pipeline is: Data Logging -> Candidate Generation -> Ranking -> Re-ranking -> Serving.
Preparation for Data & Features
(1) User: watch history, watch time, likes, skips, follows
(2) Video: visual/audio/text embeddings, popularity, freshness
(3) Context: time of day, device, network
Candidate Generation (Retrieval):
This stage quickly reduces billions of videos to a manageable set (~100-500) from several sources; these sources are merged, deduplicated, and passed to the ranking stage:
(1) Collaborative Filtering (CF): Use matrix factorization or two-tower neural networks to create user and video embeddings. Retrieve videos similar to those the user has engaged with. This is the primary source.
(2) Content-Based: Use video title, description, audio, and frame embeddings to find videos similar to those the user likes.
(3) Seed-Based (Graph): For a “Watch Next” scenario, use the current video as a seed and find co-watched videos (e.g., “users who watched X also watched Y”).
(4) Trending/Global: Inject popular videos in the user’s region/language to promote freshness and viral content.
Ranking (Scoring):
The goal is to precisely order the ~500 candidates from retrieval. This model can be more complex and slower.
(1) Deep Neural Networks (DNNs): The industry standard. Takes in hundreds of concatenated features (user, video, cross-features) through multiple fully-connected layers to output a single score (e.g., predicted watch time). Captures complex, non-linear interactions.
(2) Multi-Task Learning (MTL): A key advancement. Instead of predicting just one objective (e.g., click), a single model with shared hidden layers has multiple output heads (e.g., for click, watch time, like, share). This improves generalization by sharing signals between tasks and helps balance engagement with satisfaction.
(3) Sequence/Transformer Models: To model the user’s immediate session context, models can treat the sequence of recently watched videos as input (using RNNs or Transformers). This helps predict the “next best video” in the context of the current viewing mood.

Re-ranking & Post-Processing:
Final polish of the list. Apply business and quality constraints such as diversity, freshness, safety filters, and exploration strategies before producing the final feed.
(1) Filters: Remove videos the user has already seen, filter out “shadow-banned” or inappropriate content.
(2)Diversity: Ensure the top 10 isn’t just one creator; inject different categories to avoid “filter bubbles.”
Login to view more content
January 2, 2026
MSD0005 Surveillance Video Anomaly Detection
How would you design an end-to-end surveillance system that automatically detects and alerts security personnel to ‘anomalous events’ (e.g., break-ins, fainting, or prohibited movements) in a large shopping mall?
Answer
A surveillance anomaly detection system captures video streams, preprocesses them into clips, and uses a deep learning model, typically a pretrained video backbone plus a lightweight anomaly scoring head, to identify unusual behavior.
It operates in a semi-supervised setup trained on normal data, runs in real time with sliding windows and temporal smoothing.
The system also includes alerting, monitoring, and a human-in-the-loop feedback loop for calibration and retraining.
Data Ingestion & Preprocessing: Capture real-time video streams from multiple cameras. Preprocess by resizing frames and normalizing pixel values.
Model architecture:
(1) Feature Extraction: A 2D CNN (like EfficientNet) extracts spatial features. To capture motion, we use Optical Flow or a 3D CNN (I3D) or a Video Transformer (Video Swin Transformer or TimeSformer) to look at blocks of frames together.
(2) The “Normal” Model: We train an Autoencoder or a Generative Adversarial Network (GAN) on months of “normal” mall activity.
(3) Detection Logic: When the model sees something new, its “reconstruction error” will be high. If the error exceeds a set threshold, it is flagged as an anomaly. Use the validation dataset for threshold calibration.
Alerting & Visualization: Generate real-time alerts. Send anomalous frames for human operators to review. Implement a Human-in-the-Loop system where guards can click “Not an Anomaly.”

System Considerations:
(1) Scalability: Use edge devices for preliminary processing to reduce bandwidth; cloud processing for heavy computation.
(2) Latency: Optimize frame rate and model inference time to enable near real-time detection.
(3) Evaluation: Test using precision, recall, F1-score, and monitor false positives/negatives.
Login to view more content
January 2, 2026
DL0052 Rotary Positional Embedding
What is Rotary Positional Embedding (RoPE)?
Answer
Rotary Positional Embedding (RoPE) is a positional encoding method that rotates query and key vectors in multi‑head attention by position‑dependent angles. This rotation naturally encodes relative positional information, improves generalization to longer contexts, and avoids the limitations of fixed or learned absolute positional embeddings. It is used in GPT-NeoX, LLaMA, PaLM, Qwen, etc.
It has below charactretidstics:
(1) Relative position encoding method for Transformers
(2) Applies rotation to query (Q) and key (K) vectors using position-dependent angles
(3) Encodes position via geometry, not by adding vectors
(4) Preserves relative distance naturally in dot-product attention
(5) Extrapolates well to longer sequences than the training length
RoPE rotates each 2D pair of hidden dimensions:
$f(x, m)=\begin{pmatrix}\cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta)\end{pmatrix}\begin{pmatrix}x_1 \\x_2\end{pmatrix}$
Where:
$m$ represents the absolute position of the token in the sequence.
$\theta$ represents the base frequency/rotation angle.
$x_1, x_2$ represent the components of the embedding vector.
The below plot visualizes how RoPE makes attention decay smoothly with relative distance, while standard sinusoidal PE reflects absolute position similarity.
Login to view more content
January 2, 2026
DL0051 Sparsity in NN
Explain the concept of “Sparsity” in neural networks.
Answer
Sparsity in neural networks refers to the property that many parameters (weights) or activations are exactly zero (or very close to zero).
This leads to lighter, faster, and more interpretable models. Techniques such as L1 regularization, pruning, and ReLU activations help enforce sparsity, making networks more efficient without compromising performance.
Common techniques and their equations:
(1) L1 Regularization (encourages sparse weights)
$L = L_{\text{task}} + \lambda \sum_i |w_i|$
Where:
$w_i$ represents the i-th model weight
$\lambda$ controls the strength of sparsity
(2) ReLU Activation (induces sparse activations)
$\mathrm{ReLU}(x) = \max(0, x)$
Where:
$x$ is the neuron input.
The plot below shows weight distributions trained without using L1 and with L1-induced sparsity.
Login to view more content
December 30, 2025
MSD0004 Long Document Attention Scalability
The standard Transformer’s self-attention mechanism has a computational and memory complexity of $O(N^2)$ , where $N$ is the sequence length. For long document classification (e.g., thousands of tokens), this quadratic scaling becomes prohibitive.
Describe one or more attention modifications you would design or choose to enable efficient and effective long document classification.
Answer
To handle long documents, the quadratic complexity of full self-attention ( $O(N^2)$ ) must be reduced. The primary approaches involve Sparse Attention (like Longformer or BigBird) and Hierarchical Attention (like HATN).
Sparse attention constrains each token to only attend to a limited, relevant subset of tokens (local window and global tokens), while hierarchical attention segments the document and applies attention at both the sentence/segment level and the document level.
(1) Sparse Attention (Mechanism):
Replaces the full attention matrix with a sparse design.
Local Window Attention: Each token attends to its immediate neighbors, which is crucial for local context.
Global Attention: A few special tokens (e.g., [CLS]) act as global connectors that exchange information with all tokens, preserving long-range dependencies. (Models: Longformer, BigBird).
Here is a side-by-side comparison of Global Attention and Sliding Window Attention:

(2) Hierarchical Attention (Structure):
Hierarchical Attention (Structure)
Splits the long document into smaller, manageable segments (sentences or paragraphs).
Segment-level: Applies a standard Transformer token-level attention within each segment.
Document-level: Applies a separate document attention over the segment-level representations (e.g., the [CLS] token of each segment) to capture global dependencies. (Models: HATN, LNLF-BERT).
The figure below shows Hierarchical Attention used in the document classification use case.
Login to view more content
November 4, 2025
MSD0003 Spam Email Detection
Design an end-to-end Machine Learning system to effectively detect and filter spam emails in a high-volume email service.
Describe how you would design, train, and deploy this system.
Answer
The ML system for spam detection is a real-time classification pipeline. It begins with data collection and preprocessing (features extracted from text and metadata). A Supervised Learning model (e.g., Logistic Regression, Gradient Boosting, or a Neural Network) is trained on labeled data. The model is deployed as a real-time prediction service that intercepts incoming emails. Performance is monitored using metrics like precision and recall, and the model is continuously retrained to adapt to new spamming techniques (concept drift).
(1) Objectives and Metrics:
The goal is to classify incoming emails as spam or not spam in real time, minimizing false positives (mislabeling important emails as spam) while maintaining high recall (catching most spam).
(a) Primary Metric: Precision is critical. A high False Positive rate (marking legitimate emails as spam) is highly detrimental to user experience.
(b) Secondary Metric: Recall is also important to ensure most spam is caught (low False Negative rate).
(c) Evaluation Metric: The F1-score or Area Under the ROC Curve (AUC) provides a good balance.
(2) Data Collection:
(a) Sources: Historical emails labeled by users (spam / not spam). External spam datasets (e.g., Enron spam dataset).
(b) Features to collect:
Email text: subject and body.
Metadata: sender address, domain reputation, number of recipients.
Other: Embedded links, presence of attachments, message frequency.
(3) Data Preprocessing and Feature Engineering:
(a) Text cleaning: Remove HTML tags, URLs, punctuation.
(b) Tokenization: Split text into words/subwords (WordPiece).
(c) Textual Features Vectorization:
Classical: Term Frequency-Inverse Document Frequency (TF-IDF) or bag-of-words.
Modern: Pretrained embeddings (BERT, DistilBERT).
(d) Metadata Features Engineering:
Sender reputation score.
Ratio of uppercase words or spam keywords.
Number of links or suspicious domains.
(4) Model Selection and Training:
(a) Baseline Models: Start with Naive Bayes or Logistic Regression.
(b) Advanced Models: Ensemble methods like Random Forest or XGBoost, deep learning with CNNs/RNNs on text sequences, or pre-trained transformers like BERT for state-of-the-art performance.
(c) Training Process: Split the dataset for train/test; Use K-Fold Cross-Validation on a historical dataset, and maintain a separate held-out test set for final evaluation. For large-scale, distributed training with TensorFlow or PyTorch on GPUs.
(5) Deployment & Inference:
(a) Deployment Architecture: The trained model is saved (e.g., in a model registry) and loaded into a low-latency prediction service.
(b) Inference Flow:
The inference flow is shown in the figure below.

Step 1: The mail server receives incoming email.
Step 2: The email content and headers are passed to the Spam Prediction Service API.
Step 3: The service performs real-time feature extraction and feeds the feature vector to the loaded model.
Step 4: The model returns a spam probability score (e.g., 0.95).
Step 5: A threshold is applied (e.g., a score greater than 0.8 is classified as spam).
Final Action: If classified as spam, the email is moved to the user’s spam folder; otherwise, it goes to the inbox.
(6) Maintenance and Monitoring
One critical part of a spam system is its ability to adapt to Concept Drift—spammers constantly change their tactics.
(a) Performance Monitoring: Track and alert on key metrics.
User Feedback: Explicit ‘Mark as Spam’ or ‘Not Spam’ actions are the best source of new labeled data.
Model Accuracy: Monitor Precision, Recall, and F1-score daily.
Prediction Drift: Monitor the distribution of prediction scores. A sudden drop in the average predicted spam score might indicate the model is no longer effective.
(b) Retraining Pipeline: Implement a Continuous Training pipeline.
Login to view more content
November 2, 2025
MSD0002 Image to Video Classification
You are given a pretrained image classification network, such as ResNet. How would you adapt it to perform video classification, ensuring that both spatial and temporal information are captured?
Please discuss possible architectural modifications and trade-offs between different approaches.
Answer
Adding temporal modeling is essential when adapting an image-classification CNN for video classification. Options include:
(1) 3D CNNs (C3D/I3D):
Extend 2D convs to 3D to learn motion. Inflating 2D convolutions into 3D. (Example: I3D inflates 2D ResNet filters into 3D filters using pretrained ImageNet weights.)
Pros: Superior capability to capture fine-grained motion and spatio-temporal features.
Cons: High computational cost and high demand for video training data (Can be mitigated by I3D-style inflation).
(2) Combining frame-level CNN features with RNN/LSTM/TCN/Transformer:
Use CNN for spatial feature extraction, sequence model for temporal modeling with the extracted spatial features.
Pre-train CNN on the target dataset to conduct image classification might improve the performance.
Pros: Leverages powerful 2D CNN pre-training easily, lower computational cost. Flexible, handles variable-length sequences and can be good at modeling long-term sequence dependencies.
Cons: Less effective at modeling local and subtle motion features without further specialize temporal modeling design.
The below figure demonstrate the CNN combined with RNN/LSTM/TCN/Transformer modelling process.
(3) Temporal pooling/attention:
Simple frame aggregation with average/max pooling or attention.
Pros: Lightweight, efficient. Useful when frame order is less critical or resources are limited.
Cons: May lose fine-grained motion cues.
Login to view more content
November 2, 2025
MSD0001 Real-Time Factory Product Inspection
You are tasked with designing and deploying a deep learning-based computer vision system for real-time quality control on a high-speed manufacturing assembly line. The system must classify each product as ‘Pass’ or ‘Fail’ due to surface defects (scratches, cracks, misalignments).
Describe the complete end-to-end system design, from data acquisition and model selection to deployment and post-deployment maintenance.
Crucially, how would you address the challenges of real-time inference speed and the severe class imbalance due to the fact that defects are rare?”
Answer
The solution is an Edge-AI Computer Vision Pipeline. It starts with a controlled imaging setup to capture high-quality, consistent images. The core is a lightweight CNN (e.g., MobileNet) leveraging Transfer Learning, with a specialized loss function (e.g., Focal Loss) to handle class imbalance. Deployment occurs on a local Edge GPU to guarantee low-latency inference. A continuous MLOps loop monitors performance and facilitates model retraining against new or subtle defects (concept drift).
(1) Data & Setup: Controlled environment (lighting/staging), high-resolution cameras, and conduct Transfer Learning to reduce the need for large-scale data collection.
(2) Imbalance Handling: Use Focal Loss or weighted loss functions, combined with heavy data augmentation and oversampling of the ‘Fail’ class.
(3) Model Architecture: Choose a lightweight CNN (e.g., MobileNetV2, EfficientNet-B0) optimized for speed over a very large, deep network.
(4) Real-Time Deployment: Edge deployment on an industrial GPU (e.g., NVIDIA Jetson) using model optimization/quantization (e.g., ONNX, TensorRT) to ensure sub-100ms inference.
(5) Post-Deployment MLOps: Implement a feedback loop for logging all classifications (especially False Negatives) and trigger periodic retraining to combat model drift.
Login to view more content
October 30, 2025
DL0050 Knowledge Distillation
Describe the process and benefits of knowledge distillation.
Answer
Knowledge Distillation (KD) transfers “dark knowledge” about inter-class relationships from a large, accurate teacher model to a smaller student model. The student learns via a temperature-scaled softmax and a combined distillation plus supervised loss, enabling substantial model compression and faster inference while retaining high accuracy, provided that the teacher quality, student capacity, and hyperparameters are well chosen.
Definition: Knowledge distillation is a process where a smaller model (student) learns to mimic the behavior of a larger, well-trained model (teacher).
Soft Targets: The student is trained not only on hard labels (one-hot) but also on the soft output probabilities of the teacher.
Temperature Scaling: Teacher logits are softened using a temperature $T$ to reveal more information about class similarities:
$\mbox{Softmax}(z_i / T) = \frac{e^{z_i / T}}{\sum_{j=1}^{K} e^{z_j / T}}$
Where:
$z_i$ : Raw score (logit) for the i-th class.
$K$ : Total number of classes in the classification problem.
$T$ : Temperature parameter (>0) used to soften the probabilities. Higher $T$ produces a smoother distribution, revealing relationships between classes (“dark knowledge”).
The below plot shows the Softmax probabilities for a fixed set of Teacher logits under three different temperatures. Increasing the temperature smooths the distribution.
Loss Function: Typically combines distillation loss (difference between teacher and student soft outputs) and standard cross-entropy loss with true labels.
Key Benefits of KD:
(1) Model compression: the student is smaller and faster while retaining much of the teacher’s performance, enabling deployment on resource-constrained devices.
(2) Inference Speed: Significantly decreases latency, making the model suitable for deployment on edge devices or real-time applications.
(3) Improved Generalization: The Teacher’s smooth soft targets act as a form of powerful regularization, often leading the Student to generalize better than if it were trained only on hard labels.
The plot below demonstrates the Knowledge Distillation (KD) process.
Login to view more content
October 28, 2025