Using A Sharper Stick
Posted by Trixter on November 22, 2007
I met with Brian the other day to help out with some MobyGames infrastructure activities and we rekindled the idea of allowing video submissions. These could be actual videos of gameplay, or video reviews, or commercials… but before we work all that out, we have to work out how feasible it is from a technical standpoint. Which gives me yet another excuse to toy with video :-)
So we’ve got a few questions to answer here: What video and audio codecs is YouTube using? What parameters (approximately) is YouTube using to encode video?
Thankfully for me, my friend Mike Melanson has already answered most of these questions: It’s Flash video, using Sorensen Spark (h.263 with some limitations) for video and MP3 for audio. To determine the encoding parameters, I grabbed a few .flvs from YouTube and ran “ffmpeg -i myvideo.flv” to see what ffmpeg could glean from the file format, and it identified the audio stream as 22050Hz audio, 1 channel (mono), at 56kbps. But it couldn’t identify the video stream other than to tell me it was indeed using flv’s limited h.263. I asked Mike how he thought it was being encoded, and also what best practices were.
“No idea; I’m not really that much of an encoder guy.” Luckily, I am.
The first thing I wanted to determine was how the bitrate control was. Was it quality-based or bandwidth-based? I encoded two test videos, a simple one (few changes per frame) and a complex one (many changes per frame), both one minute in length. (By the way, I noticed a few people other than Mike and myself doing this, with names from the innocent “video quality test” to the unabashedly-named “ffmpeg FLV encoding raw audio mux test”.) After letting YouTube encode them, I sucked the end results back and immediately it was apparent that the encoder was bandwidth-based because both file sizes were identical. If it were quality-based (“constant quantization” for the encoding nerds out there), the simple file would have been substantially smaller.
After figuring this out, of course I had to determine what that constant bitrate setting was. After some manual binary partitioning trials, I narrowed it down to 240kb/s. Matching youtube’s 56kbps mp3 audio and encoding my originals, a setting of 240kb/s for video resulted in nearly the same filesize as the encoded YouTube .flvs.
I was happy with this until I actually viewed my encoding results, which looked substantially worse than YouTube’s files. It took a few more viewings of the “simple” example before I noticed that the YouTube video had a very long GOP (group of pictures). Meaning, a very long time went by before the entire frame was repainted with a keyframe. In my trials, my keyframe was changing every 15 frames; that’s twice a second @ 30fps. Since intraframes (keyframes) are much larger than interframes (frames where only the differences are stored), this was eating up my bandwidth, and visual quality suffered. I was able to determine YouTube’s intraframe interval by staring at a static section of the “simple” file, hitting a stopwatch when it changed, then hitting the stopwatch again when it changed again. I measured 8 seconds, which is 240 frames @ 30fps. Setting the GOP to an interval of 240 frames, my encoded files now matched YouTubes’ results. For the video encoding nerds out there, I can even see some of the same DCTs being used in the same places :-)
If you want to reproduce YouTube encoding yourself, possibly to see how your encoded video will look on YouTube without waiting for the upload+processing, grab yourself a command-line version of FFMPEG and use this syntax:
ffmpeg -i (inputfilename.whatever) -s 320x240 -b 240kb -ab 56kb -ar 22050 -ac 1 -g 240 (outputfilename.flv)
I’m sure it’s not 100% complete, but it certainly gets you 98% of the way there.
Will MobyGames improve on this? Most likely. For one thing, I didn’t agree with the decision of limiting everything to mono sound. For the typical YouTube talking head over laptop speakers, it makes perfect sense, but for stereo video game music or positional sound effects, it doesn’t. I was also not happy with the video quality at 240kb/s because, for fast-moving sources like FPS games, everything becomes a blocky mess. We will also probably come up with some convention (read: Encode It Yourself Dammit) for frame sizes over 320×240, like games that run 640×480 and are simple (ie. have very little motion or changes between frames). One of the “obvious” optimizations that turns out to be not so obvious is 2-pass encoding to spread out the bits more optimally, but this was only a benefit (in my tests) for complex sources. It actually made simple/static sources more blocky in the static parts.
After all this discovery, I was bothered by something. Check YouTube’s parameters again:
- 240 lines (framesize is 320×240)
- 240 kbps (video bitrate)
- 240 frames in a GOP
I’m a big fan of Occam’s Razor, but I was depressed to apply the Razor here because, after doing so, it sure seems like the YouTube guys were taking blind stabs at encoding parameters (ie. “Hey, let’s set everything to 240 and see what happens!”).