Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"10-02-10 - WebP"

17 Comments -

1 – 17 of 17
Blogger ryg said...

"That is, it lets the encoder choose what the error looks like, and if your encoder knows what kinds of errors look better, that is very strong."
That's a useful way to think about lossy coders in general. Ultimately, improvements to the actual coding stages of a lossy encoder won't make a big difference; we got improvements in the double-digit percents by going from the very basic DPCM + RLE + Huffman in early standards (JPEG, MP3 etc.) to something more efficient (arithmetic coder, rudimentary context modeling). That's maybe 15-20% improvement for a >2x complexity increase in that stage. If we really brute-forced the heck out of that stage (throw something of PAQ-level complexity at it), we might get another 15% out of it, at >1000x the runtime cost. Barring fundamental breakthroughs in lossless coding (something at LZ+Huffman complexity levels that comes within a few percent of PAQ), we're just not going to see big improvements on the coding side (researchers seem to agree; entropy coder research on H.265 is focused on making the coders easier to parallelize and more HW-friendly, not compress better).

In the lossy stage, there's way bigger gains to expect - we could throw away a lot more information if we made sure it wasn't missed. It's a noise shaping problem. For components on the lossy side, the type of noise they produce is very important.

That's why lapped transforms suck for image coding - they're more complex than block-based transforms and they don't really fix blocking artifacts: they get rid of sharp edges and replace them with smooth basic functions that decay into the next block. The blocks are still there, they just overlap (that's the whole point after all), and the end result looks the part - as if you took a blocky image and blurred it. It doesn't have the sharp edges but it still looks crap. Not only doesn't it fix the problem, it spreads the error around so it's harder to fix. That's an important point: it's okay to make errors if you can fix them elsewhere.

Block-based coders have blocking artifacts, but we know where to expect them and how to detect them if they occur, so we can fix them with deblocking filters. You can get away with quantizing DC coefficients strongly if you go easier on the first AC coefs so you don't mess up gradients (and if you have gradient-like prediction filters, you can heavily quantize ACs too). And so on.

October 2, 2010 at 12:58 PM

Blogger ryg said...

A nice thing about deblocking filters in a block-based mocomp algorithm is that you have the DCT blocks and the mocomp blocks and you can fix both with the same preprocess - two birds with one stone. Wavelets have the problem that the low-frequency stuff (which you usually spend more bits on) is less localized than the high-frequency content, so errors are spread over wider areas and harder to detect (e.g. ringing can occur everywhere in an image). Stuff like OBMC has the same problem as lapped transforms: the blocks are still there, they just blend smoothly into each other, and the visible artifacts are now all over the place instead of nicely concentrated at a few known locations in the image.

Interestingly, there's not as much research into deringing filters as into deblocking filters, although there are a few papers. That would presumably help wavelet coders a lot in terms of subjective performance (of course it also helps on block-based coders, and once you go 16x16, it's definitely something to think about). As for lapped transforms / OBMC, you'd need to do any post-processing per NxN block (you need the block coefficients to determine thresholds etc.), but then look at 2Nx2N pixels (or whatever your overlap is) to determine regions to fix. There's complexity problems (even if you have a fast lapped filter that's <2x the work of a block DCT, you still have 4x the work for deblocking now!) and the more thorny question of who gets to determine the threshold: with 2N x 2N blocks, most pixels are covered by 4 blocks - so which threshold are you gonna use? Min/Max them? Interpolate across the block? More complexity in the code, and way more complex to analyze theoretically too.

October 2, 2010 at 12:58 PM

Blogger cbloom said...

"Interestingly, there's not as much research into deringing filters as into deblocking filters, although there are a few papers. That would presumably help wavelet coders a lot in terms of subjective performance"

Yeah, there's also only a handful of papers on perceptually tuned wavelet coders. It seems clear that with modern techniques you could make a wavelet coder that is perceptually very good.

There's sort of an over-reaction belief going around now that "the promise of wavelets was a myth because they are not perceptually very good". That's not really quite right - what's true is that a non-perceptually-tuned wavelet coder (eg. the old ones) is not as good as a perceptually-tuned transform coder, but you are not really comparing apples to apples there.

So far as I know a fully "modern" wavelet coder (with more encoder choice, maybe directional wavelets, perceptual RDO, maybe adaptive wavelet shapes, etc.) doesn't exist. But it would probably fail in terms of complexity tradeoff anyway.

October 2, 2010 at 1:19 PM

Blogger cbloom said...

BTW I think there are some interesting areas to explore in the realm of visual quality :

1. Non-linear quantization (eg not equal sized buckets). We've basically given up on this because linear quantization is good for RMSE and easy to entropy code. But non-linear adaptive quantization might in fact be much better for visual quality. In fact I have a very primitive version of this in the new Rad lossy transform coder and even that very simple hack was a big perceptual win.

2. Noise reinjection, perhaps via quantization that doesn't just restore to the average expected value within a bucket (center of bucket restoration is archaic btw). This is not at all trivial, and as usual hurts RMSE, but something like this might be very good at preserving the amount of detail even if it's the wrong detail (eg. avoid over-smoothing).

3. Adaptive transform bases. Again these are a small win for RMSE, but I wonder if with a better understanding of perceptual quality it might be a huge win. (note that the H264 I predictors can be seen as a very basic form of this - they just add a constant shape to all the basis functions, and choosing a predictor is choosing one of these bases).

October 2, 2010 at 1:24 PM

Blogger ryg said...

4. Non-orthogonal transforms

Orthogonality is a no-brainer if you're optimizing for L2 error (=MSE=PSNR), but not so much when you're looking at other metrics. Even KLT/PCA gives you orthogonality, and from a coding standpoint that's not the metric to optimize for.

From a coding standpoint, what we exploit is the sparsity of the transform coefficients post-quantization. If we have a non-orthogonal transform (either because the basis functions aren't orthogonal or because we have more "basis" functions than we need) that leads to a higher degree of sparsity, that's a win.

From a perceptual standpoint, artifacts such as ringing are a result of orthogonality: we deliberately construct our basis such that quantization errors we make on coefficient 5 can't be compensated by the other coefficients.

One approach would be to go in the opposite direction: Use a highly redundant set of "basis" functions and try to optimize for maximum sparsity (another term for your cost function). You then need to code which functions you use per block, but that should be well-predicted from context. (There's lots of refinements of course: don't pick individual functions but groups of N so you have less sideband information to encode, etc.)

It's basically just a finer granularity version of the "adaptive transform bases" idea.

October 2, 2010 at 3:17 PM

Blogger   said...

This all makes more sense to me if instead of looking at google as an engineering company I start looking at google as a advertising company. Advertising is google's main revenue source. They need to make waves and be in the public eye all the time. To not do so is to die for them. It doesn't matter if something is better or not to them (of course better is always preferable I'm sure).

October 4, 2010 at 8:34 PM

Blogger Aaron said...

Google has engineers that *could* have done better than VP8/WebP, but..... they didn't, and instead half-assed it to bang something out stay in the public eye?

October 4, 2010 at 10:13 PM

Blogger ryg said...

I think advertising is pretty much Google's only revenue source. But Google doesn't make money from people hearing about Google all the time, they make money from people using Google search. (They also make money from ads in their other products, but search is by far the biggest chunk).

They've been fairly hit and miss the last two years or so: Buzz, Wave, the slowly degrading usability of the Google search page, "search as you type", now WebP... either they're spread too thin or they're losing touch.

October 4, 2010 at 10:37 PM

Blogger   said...

I actually thought search as you type to be pretty awesome :) That and the priority inbox is pretty cool.

October 5, 2010 at 8:25 AM

Blogger cbloom said...

Meh, I think it's just an example of the Google operating model. They don't really have a "strategy", just random groups that do random things, and they see what sticks. I'm sure this is just some random portion of the WebM group that say "hey we can do this random thing" and it doesn't necessarily fit into any kind of master plan.

October 5, 2010 at 10:34 AM

Blogger Unknown said...

More marketing, along the same lines as WebP:
http://www.hipixpro.com/

October 6, 2010 at 2:28 AM

Blogger cbloom said...

Yeah, there's been tons of them over the years. None of them that rely on the consumer to choose to use it have every taken off because they just don't make sense.

Google has the rare ability to *force* a new format down everyone's throat. eg. if the Google image search cache automatically gave you WebP images, or Picassa automatically converted uploads to WebP or whatever. (Youtube already does this of course with its video recompression)

October 6, 2010 at 10:40 AM

Blogger   said...

Pretty soon that ability may be less rare. Two reasons,
1) Javascript is getting faster
2) HTML5 Canvas

You can implement any image compression format you want with those two tools. The question is, could you develop a JS implementation of a compression algorithm that is better than JPG with reasonable performance. If JS keeps getting faster, it might be possible.

October 6, 2010 at 10:55 AM

Blogger cbloom said...

That's crazy talk.

October 6, 2010 at 10:58 AM

Blogger   said...

Yeah, probably. But ya never know what those crazy JS people will do. Would be an interesting challenge.

Perhaps you could leverage WebGL and do some or all of the work on the GPU?

October 6, 2010 at 11:04 AM

Blogger cbloom said...

What you could do is send an H264 video with just a bunch of I frames and then run a little WebGL/JS to blit the frames out to bitmaps that you then use in your page.

October 6, 2010 at 11:07 AM

Blogger ryg said...

"Perhaps you could leverage WebGL and do some or all of the work on the GPU?"
You can do the DSP-ish stuff on the GPU, but even in C/C++ implementations without heavily optimized DSP code, you still spend a significant amount of time doing bitstream parsing/decoding - the part you can't load off to the GPU easily. For that part, you're just gonna have to live with whatever you get out of your JavaScript implementation.

October 6, 2010 at 9:24 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.