I hate H264. It's very good. If you were making a video codec from scratch right now you would be hard pressed to
beat H264. And that's part of why I hate it. Because you have to compete with it, and it's a ridiculously over-complex
bucket of bolts. There are so many unnecessary modes and different kinds of blocks, different entropy coders, different
kinds of motion compensation, even making a fully compliant *decoder* is a huge pain in the ass.
And the encoder is where the real pain lies. H264 like many of the standards that I hate, is not a one-to-one transform
from decompressed streams to code streams.
There is no explicit algorithm to find the optimal stream for a given bit rate. With all the different choices
that the encoder has of different block types, different bit allocation, different motion vectors, etc. there's a massive
amount of search space, and getting good compression quality hinges entirely on having a good encoder that searches
that space well.
All of this stifles innovation, and also means that there are very few decent implementations available because it's so
damn hard to make a good implementation. It's such a big arcane standard that's tweaked to the Nth degree, there are
literally thousands of papers about it (and the Chinese seem to have really latched on to working on H264 improvements
which mean there are thousands of papers written by non-English speakers, yay).
I really don't like overcomplex standards, especially this style that specifies the decoder but not the encoder. Hey,
it's a nice idea in theory, it sounds good - you specify the decoder, and then over the years people can innovate and
come up with better encoders that are still compatible with the same decoder. Sounds nice, but it doesn't work.
What happens in the real world is that a shitty encoder gains acceptance in the mass market and that's what everyone
uses. Or NO encoder ever takes hold, such as with the so-called "MPEG 4" layered audio spec, for which there exists
zero mainstream encoders because it's just too damn complex.
Even aside from all that annoyance, it also just bugs me because it's not optimal. There are lots of ways to encode
the exact same decoded video, and that means you're wasting code space. Any time the encoder has choices that let it
produce the same output with different code streams, it means you're wasting code space. I talked about this a bit in
the past in the LZ optimal parser article, but it should be intuitively obvious - you could take some of those redundant
code streams and make them decode to something different, which would give you more output possibilities and thus reduce
error at the same bit rate. Obviously H264 still performs well so it's not a very significant waste of code space, but
you could make the coder simpler and more efficient by eliminating those choices.
Furthermore, while the motion compensation and all that is very fancy, it's still "ghetto". It's still a gross approximation
of what we really want to do, which is *predict* the new pixel from its neighbors and from the past sequence of frames.
That is, don't just create motion vectors and subtract the value and encode the difference - doing subtraction is a very
primitive form of prediction.
Making a single predicted value and subtracting is okay *if* the predicted probability spectrum is unimodal laplacian, and you
also use what information you can to predict the width of the laplacian. But it often isn't. Often there are pixels
that you can make a good prediction that this pixel is very likely either A or B, and each is laplacian, but making a single
prediction you'd have to guess (A+B)/2 which is no good. (an example of this is along the edges of moving objects, where you
can very strongly predict any given pixel to be either from the still background or from the edge of the moving object - a
bimodal distribution).
"12-02-08 - H264"
No comments yet. -