Interview for Machine Learning

Category: ML

MSD0007 Demand Forecasting System for Retailer
Design a demand forecasting system for a large retail company like Costco, Walmart, or Target. The system should predict future product demand across stores and time to support inventory planning, replenishment, and promotions.
Answer
The demand forecasting system ingests diverse data from sales, inventory, weather, and promotions to predict product demand using ML models like time series, tree-based, or deep learning methods.
This demand forecasting system features a scalable architecture with data pipelines, real-time processing, and integration for inventory management.
Key benefits include reducing stockouts, optimizing supply chains, and improving accuracy through iterative model training.
Problem Definition & Success Metrics:
Define the forecast granularity (e.g., SKU-store-day), horizon (e.g., 2-week operational, 3-month tactical), and objective (e.g., minimize out-of-stocks and waste).
Key success metrics would be Weighted Mean Absolute Percentage Error (WMAPE) for overall accuracy and forecast bias to detect systematic over/under-prediction.
Data Strategy & Feature Engineering:
Integrate diverse data sources into a unified feature store:
(1) Internal: Historical sales, product hierarchies, pricing, promotional calendars, inventory levels, and online search/click data.
(2) External: Calendar events (holidays, paydays), weather, local events, competitor activity (scraped), and macroeconomic trends.
System Architecture:
(1) Data Ingestion Layer: Batch and real-time streams.
(2) Processing & Feature Store: Clean, validate, and compute features.
(3) Modeling Layer: A repository for multiple models, allowing experimentation.
(4) Serving Layer: Exposes forecasts via APIs to downstream systems (replenishment, pricing).
(5) Monitoring & Feedback: Tracks model performance, data drift, and incorporates actual sales as ground truth for retraining.

Modeling Approach with Hierarchical Ensemble Strategy:
Use a “Top-Down, Bottom-Up” approach. Forecast at the aggregate level (Category/Region) to capture macro-trends and reconcile these with granular SKU(Stock Keeping Unit)-level predictions to ensure total inventory alignment.
(1) Base Layer (Interpretability): Implement Prophet or Exponential Smoothing for high-level aggregates. This captures clear seasonalities (holidays, paydays) in a way that is easily explainable to business stakeholders.
(2) Granular Layer (The “Workhorse”): Use Global LightGBM or XGBoost models trained across entire product categories. This allows the model to learn shared patterns across similar items while efficiently handling categorical metadata like Store ID and Brand.
(3) High-Volatility Layer (Deep Learning): Deploy Temporal Fusion Transformers (TFT) or DeepAR specifically for high-volume or volatile items. These models capture complex, non-linear dependencies and multi-horizon temporal patterns that tree-based models might miss.
(4) Probabilistic Forecasting: Instead of a single point estimate, generate Quantile Forecasts (e.g., P10, P50, P90). This provides a range of uncertainty, allowing the logistics team to make data-driven decisions on safety stock levels.
Login to view more content
January 3, 2026
MSD0006 Video Recommendation System
How would you design a scalable and personalized video recommendation system for a platform like YouTube, Netflix, or TikTok that can recommend relevant videos in real time to billions of users?
Answer
A modern recommendation system uses a multi-stage pipeline to narrow down billions of videos to a top-20 list for a user in milliseconds.
It typically consists of:
(1) Candidate Generation (filtering down to hundreds),
(2) Ranking (scoring those hundreds using deep learning), and
(3) Re-ranking (applying business logic to ensure diversity, freshness, and safety or applying ad insertion).

The pipeline is: Data Logging -> Candidate Generation -> Ranking -> Re-ranking -> Serving.
Preparation for Data & Features
(1) User: watch history, watch time, likes, skips, follows
(2) Video: visual/audio/text embeddings, popularity, freshness
(3) Context: time of day, device, network
Candidate Generation (Retrieval):
This stage quickly reduces billions of videos to a manageable set (~100-500) from several sources; these sources are merged, deduplicated, and passed to the ranking stage:
(1) Collaborative Filtering (CF): Use matrix factorization or two-tower neural networks to create user and video embeddings. Retrieve videos similar to those the user has engaged with. This is the primary source.
(2) Content-Based: Use video title, description, audio, and frame embeddings to find videos similar to those the user likes.
(3) Seed-Based (Graph): For a “Watch Next” scenario, use the current video as a seed and find co-watched videos (e.g., “users who watched X also watched Y”).
(4) Trending/Global: Inject popular videos in the user’s region/language to promote freshness and viral content.
Ranking (Scoring):
The goal is to precisely order the ~500 candidates from retrieval. This model can be more complex and slower.
(1) Deep Neural Networks (DNNs): The industry standard. Takes in hundreds of concatenated features (user, video, cross-features) through multiple fully-connected layers to output a single score (e.g., predicted watch time). Captures complex, non-linear interactions.
(2) Multi-Task Learning (MTL): A key advancement. Instead of predicting just one objective (e.g., click), a single model with shared hidden layers has multiple output heads (e.g., for click, watch time, like, share). This improves generalization by sharing signals between tasks and helps balance engagement with satisfaction.
(3) Sequence/Transformer Models: To model the user’s immediate session context, models can treat the sequence of recently watched videos as input (using RNNs or Transformers). This helps predict the “next best video” in the context of the current viewing mood.

Re-ranking & Post-Processing:
Final polish of the list. Apply business and quality constraints such as diversity, freshness, safety filters, and exploration strategies before producing the final feed.
(1) Filters: Remove videos the user has already seen, filter out “shadow-banned” or inappropriate content.
(2)Diversity: Ensure the top 10 isn’t just one creator; inject different categories to avoid “filter bubbles.”
Login to view more content
January 2, 2026
MSD0004 Long Document Attention Scalability
The standard Transformer’s self-attention mechanism has a computational and memory complexity of $O(N^2)$ , where $N$ is the sequence length. For long document classification (e.g., thousands of tokens), this quadratic scaling becomes prohibitive.
Describe one or more attention modifications you would design or choose to enable efficient and effective long document classification.
Answer
To handle long documents, the quadratic complexity of full self-attention ( $O(N^2)$ ) must be reduced. The primary approaches involve Sparse Attention (like Longformer or BigBird) and Hierarchical Attention (like HATN).
Sparse attention constrains each token to only attend to a limited, relevant subset of tokens (local window and global tokens), while hierarchical attention segments the document and applies attention at both the sentence/segment level and the document level.
(1) Sparse Attention (Mechanism):
Replaces the full attention matrix with a sparse design.
Local Window Attention: Each token attends to its immediate neighbors, which is crucial for local context.
Global Attention: A few special tokens (e.g., [CLS]) act as global connectors that exchange information with all tokens, preserving long-range dependencies. (Models: Longformer, BigBird).
Here is a side-by-side comparison of Global Attention and Sliding Window Attention:

(2) Hierarchical Attention (Structure):
Hierarchical Attention (Structure)
Splits the long document into smaller, manageable segments (sentences or paragraphs).
Segment-level: Applies a standard Transformer token-level attention within each segment.
Document-level: Applies a separate document attention over the segment-level representations (e.g., the [CLS] token of each segment) to capture global dependencies. (Models: HATN, LNLF-BERT).
The figure below shows Hierarchical Attention used in the document classification use case.
Login to view more content
November 4, 2025
MSD0003 Spam Email Detection
Design an end-to-end Machine Learning system to effectively detect and filter spam emails in a high-volume email service.
Describe how you would design, train, and deploy this system.
Answer
The ML system for spam detection is a real-time classification pipeline. It begins with data collection and preprocessing (features extracted from text and metadata). A Supervised Learning model (e.g., Logistic Regression, Gradient Boosting, or a Neural Network) is trained on labeled data. The model is deployed as a real-time prediction service that intercepts incoming emails. Performance is monitored using metrics like precision and recall, and the model is continuously retrained to adapt to new spamming techniques (concept drift).
(1) Objectives and Metrics:
The goal is to classify incoming emails as spam or not spam in real time, minimizing false positives (mislabeling important emails as spam) while maintaining high recall (catching most spam).
(a) Primary Metric: Precision is critical. A high False Positive rate (marking legitimate emails as spam) is highly detrimental to user experience.
(b) Secondary Metric: Recall is also important to ensure most spam is caught (low False Negative rate).
(c) Evaluation Metric: The F1-score or Area Under the ROC Curve (AUC) provides a good balance.
(2) Data Collection:
(a) Sources: Historical emails labeled by users (spam / not spam). External spam datasets (e.g., Enron spam dataset).
(b) Features to collect:
Email text: subject and body.
Metadata: sender address, domain reputation, number of recipients.
Other: Embedded links, presence of attachments, message frequency.
(3) Data Preprocessing and Feature Engineering:
(a) Text cleaning: Remove HTML tags, URLs, punctuation.
(b) Tokenization: Split text into words/subwords (WordPiece).
(c) Textual Features Vectorization:
Classical: Term Frequency-Inverse Document Frequency (TF-IDF) or bag-of-words.
Modern: Pretrained embeddings (BERT, DistilBERT).
(d) Metadata Features Engineering:
Sender reputation score.
Ratio of uppercase words or spam keywords.
Number of links or suspicious domains.
(4) Model Selection and Training:
(a) Baseline Models: Start with Naive Bayes or Logistic Regression.
(b) Advanced Models: Ensemble methods like Random Forest or XGBoost, deep learning with CNNs/RNNs on text sequences, or pre-trained transformers like BERT for state-of-the-art performance.
(c) Training Process: Split the dataset for train/test; Use K-Fold Cross-Validation on a historical dataset, and maintain a separate held-out test set for final evaluation. For large-scale, distributed training with TensorFlow or PyTorch on GPUs.
(5) Deployment & Inference:
(a) Deployment Architecture: The trained model is saved (e.g., in a model registry) and loaded into a low-latency prediction service.
(b) Inference Flow:
The inference flow is shown in the figure below.

Step 1: The mail server receives incoming email.
Step 2: The email content and headers are passed to the Spam Prediction Service API.
Step 3: The service performs real-time feature extraction and feeds the feature vector to the loaded model.
Step 4: The model returns a spam probability score (e.g., 0.95).
Step 5: A threshold is applied (e.g., a score greater than 0.8 is classified as spam).
Final Action: If classified as spam, the email is moved to the user’s spam folder; otherwise, it goes to the inbox.
(6) Maintenance and Monitoring
One critical part of a spam system is its ability to adapt to Concept Drift—spammers constantly change their tactics.
(a) Performance Monitoring: Track and alert on key metrics.
User Feedback: Explicit ‘Mark as Spam’ or ‘Not Spam’ actions are the best source of new labeled data.
Model Accuracy: Monitor Precision, Recall, and F1-score daily.
Prediction Drift: Monitor the distribution of prediction scores. A sudden drop in the average predicted spam score might indicate the model is no longer effective.
(b) Retraining Pipeline: Implement a Continuous Training pipeline.
Login to view more content
November 2, 2025