Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"08-25-09 - Oodle Image Compression Looking Back"

11 Comments -

1 – 11 of 11
Anonymous Anonymous said...

As per our bad image quality metrics, sounds like you are suggesting we need the image analog of what we have for audio compression (psychoacoustical Bark scale, equal-loudness contours, model of frequency masking, etc).

August 26, 2009 at 7:08 AM

Blogger cbloom said...

Yes, though even with audio I'm not aware of automated programs that rate the quality with those metrics.

What we really need is a function call that can compare two images and give you back a quality rating.

Actually a lot of that psycho-perceptual stuff does not work well for images, because it presumes a certain viewing method and viewing environment. For example, the original JPEG visibility threshold research is actually quite good, but it is specific to a certain pixel size, a certain monitor time, a certain sitting distance, etc. It models visual masking in the eyes, and the result is that it's actually very bad when the image is used as a texture.

I actually think the visual masking and relative importance of different colors and things like that is not as important as the global effects. Maybe I'll try to make some example images to show what I mean.

August 26, 2009 at 9:21 AM

Anonymous Anonymous said...

I was thinking something more along the lines of a function which would provide a perceptual weight per DCT coef given a source block and the surrounding 8 blocks as context.

Looking at the micro context (just block encoding with context) could be a good start.

Macro context stuff, like ability to shift around content to match basis functions as long as things like edges and relative tonal gradients are preserved globally, just gets awfully complicated (but might indeed provide better overall gains in perceptual quality per bitrate).

August 26, 2009 at 9:44 AM

Blogger cbloom said...

Yeah that's a start, and certainly would be easier to integrate into existing frameworks which rely on linear error, but my intuition is that the global stuff actually matters a lot. The simplest example is just if you have a big solid color patch of value 137 surrounded by noise, then changing any one pixel in there to 138 is very bad visually, but changing the whole patch to 138 is almost zero perceptual error.

August 26, 2009 at 10:45 AM

Blogger cbloom said...

I should say : using immediate neighbors is what I am doing now for my hacky improvement in Oodle, and it does in fact get you a huge win. I just feel like that's only the tip of the iceberg.

August 26, 2009 at 12:49 PM

Blogger ryg said...

I don't see how audio compression is much better in that regard. Yes, they incorporate psychoacoustic models at the encoder side, but only in a very local form of quantizer optimization, not unlike R-D optimization in image/video coders (though by now with relatively consistent success). They don't have any global quality optimization either, and compared with e.g. video encoders, their bitrate allocation strategies are very short-sighted. They also have basically the same problems as image/video coders: for example, like the individual over-smoothed blocks that Charles mentioned, perceptual audio codecs have the tendency to mess up transients, causing pre-echoes and overall mushiness (what would be called "blurriness" for visual signals, basically).

Finally, audio and image codecs are also in the same league compression ratio-wise. If you take 24bit RGB images and 44.1kHz 16-bit stereo CD audio (comes out at 1411kbit/s), you can see what I mean: at a ratio of 12:1 (2bits/pixel for images, 120kbit/s for audio) you get decent quality, 32:1 (0.75bits/pixel or 45kbit/s for audio) is recognizable but has very notable artifacts.

If anything, images actually do a bit better than audio: 1bit/pixel tends to look okay for most natural images, while the corresponding 60kbit/s for audio is already well into the region where most audio codecs produce really annoying artifacts for music.

August 26, 2009 at 1:04 PM

Blogger cbloom said...

Yeah, let's not even get started on audio, Jeff and I could do a long rant on how bad the audio coders suck. In compression there are two separate but related ways that you get wins :

1. Modeling the data for prediction ; eg. using self-similarity and global similarity to make more likely streams smaller. In general audio codecs suck pretty bad at this; they know nothing about instruments or music theory.

2. Modeling the data for lossiness; eg. knowing how you are free to mutate the data in ways that make it perceptually similar to the original. In audio there is obviously tons and tons of degrees of freedom on how you could make different bit streams that sound the same, but codecs don't have enough sophisticated to know this. For example, have someone hit a cymbal 1000 times. Every one of those sound segments will sound perceptually interchangeable to the human ear, but will code very differently.

August 26, 2009 at 1:13 PM

Blogger ryg said...

Exactly; it's really the same problem in both audio and image compression. All the lossiness is targeted at processes happening at the lowest levels of information processing in the brain; the models are all on a signal processing level. For both audio and image coding, the gains between the first algorithms to do this and the subsequent refinements we're using now are really quite small; AAC compresses maybe 2x as well as MPEG-1 Layer 2, and current still image coders have less than that on JPEG for most images. Video has seen larger improvements than that, but video coding efficiency is severely constrained by the requirement that it should be playable in realtime, which severely limited options until the late 90s.

I'm pretty certain that it's possible to gain at least by one order of magnitude in all of these applications, but that's not gonna happen by tweaking current state of the art approaches; in fact, all of them (JPEG 2k, AAC, H.264) are over-engineered and pretty far down the curve into diminishing returns already.

What's really necessary to make a big dent is to get away from the signal level and into higher-level semantic properties like the cymbal example, or the "all the gazillion different instances of uniform random noise blocks look exactly the same to the human eye" example that's already been mentioned.

Of course, everyone dealing with lossy compression knows this, and nobody really has a good solution.

A big problem is that we don't know all that much about how these parts of human perception work internally, either; the "plumbing" for both the aural and visual systems have been researched a long time ago and are well-understood by now, but e.g. methods to determine the similarity between two different shapes are still a very active research area, and that's after you've distilled them into a relatively concise, abstract representation. It's either a set of genuinely hard problems (which seems likely), or there's something subtile but crucial going on that everyone's been missing so far.

August 26, 2009 at 1:54 PM

Blogger   said...

I've done exactly what you described in a previous project... Classified regions as noisey, smooth, etc and then change the quanitzation to match. In a noisey area, the quantization can be harsh and you won't notice, but in a smooth area, you want finer quantization steps. I had a table of quantization characteristics which the pixels could get classified into. It very much so improved the compression ratio. As well as I would recommend trying out injecting noise into certain areas on a block basis as some parts of the image can be helped by it and others hurt. That was a big win in perceptual quality as well.

August 27, 2009 at 1:56 PM

Blogger Jaba Adams said...

Uh, so I know nothing about compression ...

What about borrowing from the machine vision community? Look for feature points in the source image, then look for feature points in the compressed image.

Wave hands about a suitable definition of perceptual features.

Naturally, I'm wary of any proposed solution that is congruent with General AI.

August 31, 2009 at 11:23 AM

Blogger cbloom said...

I'm actually using some machine vision stuff in my new video work, I'll write about it soon. A lot of their stuff is very hacky.

August 31, 2009 at 11:31 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.