Bleh, I just spent the whole day trying to track down a bug in my MS-SSIM implementation, that turned out to be a total
red herring. I was trying to test my MS-SSIM on the TID2008 database to confirm their results (they find that MS-SSIM is
the best by far). (note that perceptual image papers currently have the nasty property that every author demonstrates
that their method is the "best" , because they all use different testing methods and on different databases).
Anyway, I was getting totally wrong results, so I went over my MS-SSIM implementation with a fine toothed comb, checked
everything against the original Matlab; I found a few deviations, but they were all irrelevant. The problem was I was
running the files in the TID database in a different order than the TID test program wanted, so it was matching the wrong
file to the wrong file.
As part of that, I downloaded all the implementations of MS-SSIM I could find. The best one is in
Metrix Mux . Most of the others have some deviation from
the original. For example, many people get the Gaussian window wrong (A Gaussian with sdev 1.5 is e^(- 0.5 * (x/1.5)^2 ) - people leave
off the 0.5), others incorrectly apply the window at every pixel position (you should only apply it where the whole window is inside
the image, not off any edge), another common flaw is to get the downsample wrong; the way authors do it is with a Daub9 lowpass tap
and then *point* downsampling (they use :1:2:end matlab notation which is a point downsample). Anyway, a lot of these details are
pretty random. Also of note : Metrix Mux uses rec601 luma and does not gamma correct.
The TID2008 perceptual distortion database is a lot better than the Live database, but it's still not great.
The problem with both of them is that the types of distortions applied is just too small of a subset. Both of
them mainly just apply some random noise, and then they both apply JPEG and JPEG2000 distortions.
That's okay if you want a metric that is good at specifically judging the human evaluation of those types of
distortions. But it's a big problem in general.
It means that metrics which consider other factors are not given credit for their considerations.
For example, TID2008 contains no hue rotations, or images that have constant luma channels but visible detail
in chroma. That means that metrics which only evaluate luma fidelity do quite well on TID2008. It has no images
where just R or just G or just OpponentBY is modified, so you can't tell anything about the importance of different
color channels to perception of error.
TID2008 has 25 images, which is too many really; you only need about 8. Too many of the TID images are the same,
in the sense that they are photographs of natural scenes. You only need 2 or 3 of those to be a reasonable
representative sample, since natural scene photos have amazingly consistent characteristics. What is needed are
more distortion types.
Furthermore, TID has a bunch of distortion types that I believe are bogus; in particular all the "exotic" distortions,
such as injecting huge solid color rectangles into the image, or changing the mean luminance. The vast majority of
metrics do not handle this kind of distortion well, and TID scores unfairly penalize those metrics. The reason it's
bogus is that I believe these types of distortions are irrelevant to what we are doing most of the time, which is
measuring compressor artifacts. No compressor will ever make distortions like that.
And also on that thread, too many of the distortions are very large. Again many metrics only work well near the threshold
of detection (that is, the distorted image almost looks the same as the original). That limitted function is actually okay,
because that's the area that we really care about. The most interesting area of work is near threshold, because that is
where we want to be making our lossless compressed data - you want it to be as small as possible, but still pretty close to
visually unchanged. By having very huge distortions in your database, you give too many points to metrics that handle those
huge distortions, and you penalize
Lastly, because the databases are all too small, any effort to tune to the databases is highly dubious. For example you
could easily do something like the Netflix prize winners where you create an ensemble of experts - various image metrics that
you combine by estimating how good each metric will be for the current query. But training that on these databases would
surely just give you overfitting and not a good general purpose metric.
(as a simpler example, MS-SSIM has tons of tweaky hacky bits, and I could easily optimize those to improve the scores
on the databases, but that would be highly questionable).
Anyway, I think rather than checking scores on databases, there are two other ways to test metrics that are better :
1. Metric competition as in "Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities".
Unfortunately for any kind of serious metrics, the gradient is hard to solve analytically, so it requires numerical optimization and this can be
very time consuming. The basic method I would take is to take an image, try a few random steps, measure how they affect metrics M1 and M2, then
form a linear combo of steps such that the affect on M2 is zero, while the affect on M1 is maximized. Obviously the metrics are not actually linear,
but in the limit of tiny steps they are.
2. Metric RD optimization. Using a very flexible hypothetical image coder (*), do R/D optimization using the metric you wish to test. A very
general brute for RD optimizer will dig out any oddities in the metric. You can test metrics against each other by making the image that this
compressor will general for 1.0 bits per pixel (or whatever).
* = you shouldn't just use JPEG or something here, because then you are only exploring the space of how the metric rates DCT truncation errors. You
want an image compressor that can make lots of weird error shapes, so you can see how the metric rates them. It does not have to be an actual
competitive image coder, in fact I conjecture that for this use it would be best to have something that is *too* general. For example one idea
is to use an over-complete basis transform coder, let it send coefficients for 64 DCT shapes, and also 64 Haar shapes and maybe some straight
edges and linear ramps, etc. H264-Intra at least has the intra predictors, but maybe also in-frame motion vectors would add some more
useful freedom.
"11-18-10 - Bleh and TID2008"
No comments yet. -