A case study for content adaptive encoding

6 min readMay 3, 2020

If you have been wondering about why “content adaptive encoding” is such a favorite subject of the video streaming industry, this post will show you a concrete example to help you “see” why it matters and where the challenge lies.

What is content adaptive encoding (CAE)?

Content adaptive encoding, or CAE, is actually a bloated concept. People can totally mean different things while using the same term. At a very high level, it refers to adapting the encoding algorithm or system to video content characteristics to optimize for compression efficiency.

The promise of CAE comes from the observation that encoding efficiency varies a lot for different video content.

Some properties can make videos hard to compress, e.g. random noise, fine details, and dynamic textures.

Natural scenes often contain complex and dynamic details.

Some properties can make videos easy to compress, e.g. simple color and shapes, static scenes, and linear movements.

Man made artifacts are usually much easier to compress.

Then some properties are less obvious to predict on how it will turn out after being compressed by different encoders.

In order to adapt the encoding system, the video content is typically classified into classes with distinct encoding properties and adaptation decisions are made at different scales, e.g. per video category, per title, per group of frames, per frame, or per frame tile.

First an example of why CAE could help

I picked two 1080p source videos: CrowdRun and FoxBird. CrowdRun has an outdoor scene with many moving objects and small human faces. FoxBird is a cartoon with mostly solid colored simple shapes and lines.

I compressed the two videos with the same target bitrate of 4Mbps using FFmpeg/h.264. For easier visual comparison, I overlaid the original video and the compressed video side by side in a split view and slowed it down by 5X to make it easier to track.

The setup is motivated by a question commonly faced by video streaming services: for a given resolution, say 1080p, what is the right bitrate to stream at? In other words, will a certain bitrate, say 4Mbps, deliver satisfactory video quality?

Play the comparison video below for CrowdRun. You should notice the significant quality degradation on the right side with blurriness, blocks, malformed edges, and other compression artifacts. (If you are watching on a small screen, focus on the quality change of the runners’ faces when they move across the black line)

CrowdRun source (left) and compressed (right)

Play the video below for FoxBird. You probably hardly perceive any quality difference between the original and the compressed.

FoxBird source (left) and compressed (right)

Same resolution, same bitrate, but dramatically different quality.

I actually lied when I say “same bitrate”. I set the same “target” bitrate for both videos, but the actual bitrate after encoding is different: 4.5Mbps for CrowdRun, and 3.7Mbps for FoxBird.

So if your goal is to deliver videos with the minimal bandwidth cost while maintaining consistently high quality across the whole category, setting a global bitrate target will fail, by either causing quality issues for some videos, or forcing a higher bitrate that’s often wasteful.

In a basic form of CAE, we adapt the target bitrate for each video, or each category of videos, so more bitrate can be saved for content that are easier to encode.

This concludes my simple experiment to motivate CAE. Next I want to dive a little deeper for a question that is usually the first critical design decision for CAE algorithms:

What quality metric(s) should I use?

Quality metrics: the key and the challenge

Since most CAE approaches use some algorithm to optimize toward a goal balancing bitrate and video quality, we need a video quality metric to guide the optimization algorithm. An accurate and reliable metric is an essential part of making sensible encoding adaptation decisions.

It’s undoubtedly a complicated matter and is worth a lot more discussion than a single blog post can cover. So again I’ll approach it in the form of a simple case study, knowing what I’m revealing is only a tip of the iceberg.

I took the same compressed videos used in the previous experiment and plotted the per frame quality scores for VMAF, PSNR, and SSIM for CrowdRun.

Per frame quality metrics for CrowdRun. Top to bottom: vmaf, psnr, ssim.

The first observation that comes to me is:

PSNR and SSIM agree with each other most of the time. They have different ranges of value and so different scales, but the relative ranking of a frame within a video is almost always the same as rated by PSNR and SSIM.

VMAF behaves quite differently. In general VMAF tends to have larger frame to frame variation, i.e. its curve is less smooth and appears to be more noisy.

Does the larger VMAF variation predict actual visual quality difference and thus a strength of VMAF, or the opposite?

Let’s look at an example at frame 27 and frame 28:

Top: visual comparison for frame 27 and 28; Bottom: frame 27 to 28 metric score changes as circled.

From frame 27 to frame 28, a sharp VMAF jump from 64 to 72 happens. An increase of 8 in VMAF should ideally predict perceivable quality improvement. But as shown in the image above, I couldn’t find any visual evidence to support that prediction, even after looking carefully through the whole frame. PSNR and SSIM also rate frame 28 higher than frame 27, but to a much smaller extent.

This example allures me to believe that VMAF seems to be more sensitive to content “noise” that can be falsely predicted as quality change.

What about FoxBird? Take a look at the per frame quality metrics below:

Per frame quality metrics for FoxBird. Top to bottom: vmaf, psnr, ssim.

The observations are quite similar to CrowdRun:

In terms of the relative ranking of frame quality, PSNR and SSIM agree more often with each other than with VMAF.

This is an example where VMAF doesn’t agree with PSNR and SSIM, at frame 1 and frame 20:

Visual comparison for FoxBird frame 1 (top) and frame 20 (bottom).

VMAF score for frame 1 is 94, and for frame 20 is 100 (perfect!). On the contrary, PSNR and SSIM both rate frame 1 considerably lower than frame 20.

I actually see more compression artifacts like mosquito noise around sharp edges in frame 20. You may need to zoom in to see the difference. In this case PSNR and SSIM seems to be more accurate than VMAF.

Although I happened to pick two examples that seem to favor VMAF less, the sample size is too small to draw any conclusion on the general accuracy of these metrics. However, it’s clear that:

Even though VMAF is widely believed to have better correlation with perceptual quality than PSNR/SSIM, naively using VMAF for CAE optimization could lead to sub-optimal result.

Considering VMAF’s large variation at frame level, it’ll be particularly tricky to use VMAF to guide per frame encoding optimization.

To conclude the post, my key takeaways are:

CAE is a promising technique for video encoding system optimization, but its exact purpose and benefits have to be put in the context of the application.
One cannot overstate the importance of a right quality metric to the effectiveness of CAE. Moreover, the answer is highly application specific and encoder specific. A general purpose solution may fail to deliver in practice.
The key to success is lots of experiments on real world videos in real world context. Decide the user engagement metrics essential to your business, thoroughly measure them, and let the data tell you the truth.

A case study for content adaptive encoding

What is content adaptive encoding (CAE)?

First an example of why CAE could help

Quality metrics: the key and the challenge

Written by Jina Jiayang Liu