How would you design a scalable and personalized video recommendation system for a platform like YouTube, Netflix, or TikTok that can recommend relevant videos in real time to billions of users?

Answer
A modern recommendation system uses a multi-stage pipeline to narrow down billions of videos to a top-20 list for a user in milliseconds.
It typically consists of:
(1) Candidate Generation (filtering down to hundreds),
(2) Ranking (scoring those hundreds using deep learning), and
(3) Re-ranking (applying business logic to ensure diversity, freshness, and safety or applying ad insertion).
The pipeline is: Data Logging -> Candidate Generation -> Ranking -> Re-ranking -> Serving.
Preparation for Data & Features
(1) User: watch history, watch time, likes, skips, follows
(2) Video: visual/audio/text embeddings, popularity, freshness
(3) Context: time of day, device, network
Candidate Generation (Retrieval):
This stage quickly reduces billions of videos to a manageable set (~100-500) from several sources; these sources are merged, deduplicated, and passed to the ranking stage:
(1) Collaborative Filtering (CF): Use matrix factorization or two-tower neural networks to create user and video embeddings. Retrieve videos similar to those the user has engaged with. This is the primary source.
(2) Content-Based: Use video title, description, audio, and frame embeddings to find videos similar to those the user likes.
(3) Seed-Based (Graph): For a “Watch Next” scenario, use the current video as a seed and find co-watched videos (e.g., “users who watched X also watched Y”).
(4) Trending/Global: Inject popular videos in the user’s region/language to promote freshness and viral content.
Ranking (Scoring):
The goal is to precisely order the ~500 candidates from retrieval. This model can be more complex and slower.
(1) Deep Neural Networks (DNNs): The industry standard. Takes in hundreds of concatenated features (user, video, cross-features) through multiple fully-connected layers to output a single score (e.g., predicted watch time). Captures complex, non-linear interactions.
(2) Multi-Task Learning (MTL): A key advancement. Instead of predicting just one objective (e.g., click), a single model with shared hidden layers has multiple output heads (e.g., for click, watch time, like, share). This improves generalization by sharing signals between tasks and helps balance engagement with satisfaction.
(3) Sequence/Transformer Models: To model the user’s immediate session context, models can treat the sequence of recently watched videos as input (using RNNs or Transformers). This helps predict the “next best video” in the context of the current viewing mood.
Re-ranking & Post-Processing:
Final polish of the list. Apply business and quality constraints such as diversity, freshness, safety filters, and exploration strategies before producing the final feed.
(1) Filters: Remove videos the user has already seen, filter out “shadow-banned” or inappropriate content.
(2)Diversity: Ensure the top 10 isn’t just one creator; inject different categories to avoid “filter bubbles.”

Leave a Reply