Reverse Engineering Youtube Statistics Generation Algorithm
While surfing Youtube a while back, I noticed that you could view the statistics for a given video. While most of the videos I view are quite boring and have low viewcounts, I thought that might be the trigger — Only popular videos have stats. However, while surfing Youtube today to see how they handled some statistics, I saw some patterns emerge that tossed that theory out the window. Videos with even a few hundred views had statistics.
Since we can assume that Google has kept track of every view and statistic possible since it was merged with their platform, even old videos have data back into late 2007 as evidenced by many different videos. Some videos mention 30 Nov 2007 as the earliest data collection date.
So, we face a quandary. We have videos from 2005 through today, stats from late 2007 through today and stats displayed on the video display page that have been rolled out since mid 2010. Old videos that don’t currently display stats obviously are gathering stats but must have a flag saying that the old data hasn’t been imported as it will only mention Honors for this Video. How do you approach the problem?
We know that the data is collected and applied in batches and it appears that every video has statistics from a particular date forward. Recent videos all have full statistics, even with a few hundred views, no comments, no favorites. The catalyst doesn’t appear to be when someone has an interaction with a video, merely viewing a video must signal the system to backfill statistics. There is probably some weight given to popular videos, though, those videos would have a lot more history. One must balance the time required to import a very popular video versus importing the history from hundreds of less popular videos. One of the benefits of bigtable – if architected properly – would be to process each video’s history in one shot, set the stats processed flag and do the next video. One might surmise that Google knew to collect the view data, but, may not have thought about how the data would be used.
How do you choose videos to be processed? When you process the live views, you might decide to put a video into a queue for backfill processing. But, on a very lightly viewed video, this might delay backfilling another video where statistics might be more interesting or provocative. We can assume that we have a fixed date in time where a video doesn’t require backfilling which makes our data backfill decision a little easier.
As the logs are being processed, we might keep a list of the video_id, creation date and number of daily views. That data would be inserted into a backfill queue for our backfill process. In the backfill process, we would look at the creation date, number of daily views and number of mentions in the backfill queue. To figure out a priority list of the items to process, we might look at the velocity of hits from one day to the next – triggering a job queue entry on a video that is suddenly getting popular. We might also influence decisions based on the views and the creation date delta off the fixed point in time where stat displays started. This would allow us to take a lightly viewed video that was created just before our fixed point and prioritize that in the backfill queue. Now we’ve got a dual priority system that would allow us to tackle two problems at the same time, and intersect in the middle. Each day, new entries are inserted into the queue, altering priority of existing and current entries which would allow the stats to be backfilled in a manner that would appear to be very proactive.
At some point, videos that haven’t been viewed that were created prior to the fixed point in time could be added to the cleanup queue. Since they weren’t viewed, generating the statistics for them isn’t as important. And, if a video has been viewed, it was already in the queue. Since the queue could dispatch the jobs to as many machines as Google wanted, stats could be added to Youtube videos based on the load of their distributed computer.
What do you think?
How would you backfill log data from an estimated 900 million videos, serving 2 billion video views a week.