Category: CV

  • MSD0006 Video Recommendation System

    How would you design a scalable and personalized video recommendation system for a platform like YouTube, Netflix, or TikTok that can recommend relevant videos in real time to billions of users?

    Answer

    A modern recommendation system uses a multi-stage pipeline to narrow down billions of videos to a top-20 list for a user in milliseconds.
    It typically consists of:
    (1) Candidate Generation (filtering down to hundreds),
    (2) Ranking (scoring those hundreds using deep learning), and
    (3) Re-ranking (applying business logic to ensure diversity, freshness, and safety or applying ad insertion).

    The pipeline is: Data Logging -> Candidate Generation -> Ranking -> Re-ranking -> Serving.

    Preparation for Data & Features
    (1) User: watch history, watch time, likes, skips, follows
    (2) Video: visual/audio/text embeddings, popularity, freshness
    (3) Context: time of day, device, network

    Candidate Generation (Retrieval):
    This stage quickly reduces billions of videos to a manageable set (~100-500) from several sources; these sources are merged, deduplicated, and passed to the ranking stage:
    (1) Collaborative Filtering (CF): Use matrix factorization or two-tower neural networks to create user and video embeddings. Retrieve videos similar to those the user has engaged with. This is the primary source.
    (2) Content-Based: Use video title, description, audio, and frame embeddings to find videos similar to those the user likes.
    (3) Seed-Based (Graph): For a “Watch Next” scenario, use the current video as a seed and find co-watched videos (e.g., “users who watched X also watched Y”).
    (4) Trending/Global: Inject popular videos in the user’s region/language to promote freshness and viral content.

    Ranking (Scoring):
    The goal is to precisely order the ~500 candidates from retrieval. This model can be more complex and slower.
    (1) Deep Neural Networks (DNNs): The industry standard. Takes in hundreds of concatenated features (user, video, cross-features) through multiple fully-connected layers to output a single score (e.g., predicted watch time). Captures complex, non-linear interactions.
    (2) Multi-Task Learning (MTL): A key advancement. Instead of predicting just one objective (e.g., click), a single model with shared hidden layers has multiple output heads (e.g., for click, watch time, like, share). This improves generalization by sharing signals between tasks and helps balance engagement with satisfaction.
    (3) Sequence/Transformer Models: To model the user’s immediate session context, models can treat the sequence of recently watched videos as input (using RNNs or Transformers). This helps predict the “next best video” in the context of the current viewing mood.

    Re-ranking & Post-Processing:
    Final polish of the list. Apply business and quality constraints such as diversity, freshness, safety filters, and exploration strategies before producing the final feed.
    (1) Filters: Remove videos the user has already seen, filter out “shadow-banned” or inappropriate content.
    (2)Diversity: Ensure the top 10 isn’t just one creator; inject different categories to avoid “filter bubbles.”


    Login to view more content
  • MSD0005 Surveillance Video Anomaly Detection

    How would you design an end-to-end surveillance system that automatically detects and alerts security personnel to ‘anomalous events’ (e.g., break-ins, fainting, or prohibited movements) in a large shopping mall?

    Answer

    A surveillance anomaly detection system captures video streams, preprocesses them into clips, and uses a deep learning model, typically a pretrained video backbone plus a lightweight anomaly scoring head, to identify unusual behavior.
    It operates in a semi-supervised setup trained on normal data, runs in real time with sliding windows and temporal smoothing.
    The system also includes alerting, monitoring, and a human-in-the-loop feedback loop for calibration and retraining.

    Data Ingestion & Preprocessing: Capture real-time video streams from multiple cameras. Preprocess by resizing frames and normalizing pixel values.

    Model architecture:
    (1) Feature Extraction: A 2D CNN (like EfficientNet) extracts spatial features. To capture motion, we use Optical Flow or a 3D CNN (I3D) or a Video Transformer (Video Swin Transformer or TimeSformer) to look at blocks of frames together.
    (2) The “Normal” Model: We train an Autoencoder or a Generative Adversarial Network (GAN) on months of “normal” mall activity.
    (3) Detection Logic: When the model sees something new, its “reconstruction error” will be high. If the error exceeds a set threshold, it is flagged as an anomaly. Use the validation dataset for threshold calibration.

    Alerting & Visualization: Generate real-time alerts. Send anomalous frames for human operators to review. Implement a Human-in-the-Loop system where guards can click “Not an Anomaly.”

    System Considerations:

    (1) Scalability: Use edge devices for preliminary processing to reduce bandwidth; cloud processing for heavy computation.
    (2) Latency: Optimize frame rate and model inference time to enable near real-time detection.
    (3) Evaluation: Test using precision, recall, F1-score, and monitor false positives/negatives.


    Login to view more content
  • MSD0002 Image to Video Classification

    You are given a pretrained image classification network, such as ResNet. How would you adapt it to perform video classification, ensuring that both spatial and temporal information are captured?

    Please discuss possible architectural modifications and trade-offs between different approaches.

    Answer

    Adding temporal modeling is essential when adapting an image-classification CNN for video classification. Options include:
    (1) 3D CNNs (C3D/I3D):
    Extend 2D convs to 3D to learn motion. Inflating 2D convolutions into 3D. (Example: I3D inflates 2D ResNet filters into 3D filters using pretrained ImageNet weights.)
    Pros: Superior capability to capture fine-grained motion and spatio-temporal features.
    Cons: High computational cost and high demand for video training data (Can be mitigated by I3D-style inflation).

    (2) Combining frame-level CNN features with RNN/LSTM/TCN/Transformer:
    Use CNN for spatial feature extraction, sequence model for temporal modeling with the extracted spatial features.
    Pre-train CNN on the target dataset to conduct image classification might improve the performance.
    Pros: Leverages powerful 2D CNN pre-training easily, lower computational cost. Flexible, handles variable-length sequences and can be good at modeling long-term sequence dependencies.
    Cons: Less effective at modeling local and subtle motion features without further specialize temporal modeling design.

    The below figure demonstrate the CNN combined with RNN/LSTM/TCN/Transformer modelling process.

    (3) Temporal pooling/attention:
    Simple frame aggregation with average/max pooling or attention.
    Pros: Lightweight, efficient. Useful when frame order is less critical or resources are limited.
    Cons: May lose fine-grained motion cues.


    Login to view more content
  • MSD0001 Real-Time Factory Product Inspection

    You are tasked with designing and deploying a deep learning-based computer vision system for real-time quality control on a high-speed manufacturing assembly line. The system must classify each product as ‘Pass’ or ‘Fail’ due to surface defects (scratches, cracks, misalignments).

    Describe the complete end-to-end system design, from data acquisition and model selection to deployment and post-deployment maintenance.

    Crucially, how would you address the challenges of real-time inference speed and the severe class imbalance due to the fact that defects are rare?”

    Answer

    The solution is an Edge-AI Computer Vision Pipeline. It starts with a controlled imaging setup to capture high-quality, consistent images. The core is a lightweight CNN (e.g., MobileNet) leveraging Transfer Learning, with a specialized loss function (e.g., Focal Loss) to handle class imbalance. Deployment occurs on a local Edge GPU to guarantee low-latency inference. A continuous MLOps loop monitors performance and facilitates model retraining against new or subtle defects (concept drift).

    (1) Data & Setup: Controlled environment (lighting/staging), high-resolution cameras, and conduct Transfer Learning to reduce the need for large-scale data collection.
    (2) Imbalance Handling: Use Focal Loss or weighted loss functions, combined with heavy data augmentation and oversampling of the ‘Fail’ class.
    (3) Model Architecture: Choose a lightweight CNN (e.g., MobileNetV2, EfficientNet-B0) optimized for speed over a very large, deep network.
    (4) Real-Time Deployment: Edge deployment on an industrial GPU (e.g., NVIDIA Jetson) using model optimization/quantization (e.g., ONNX, TensorRT) to ensure sub-100ms inference.
    (5) Post-Deployment MLOps: Implement a feedback loop for logging all classifications (especially False Negatives) and trigger periodic retraining to combat model drift.


    Login to view more content