Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"12-16-14 - Daala PVQ Emails"

11 Comments -

1 – 11 of 11
Blogger Unknown said...

About PVQ indexing schemes: lots of these are possible. You can see the first one I came up with at , before I'd even heard of Fischer or found his 1986 paper. Eventually for CELT we switched to the original Fischer indexing scheme (because it was simpler to implement and being published over 20 years ago trumped any other minor advantages in bit error robustness another scheme might have). If you look at Section 4.3.4.2, you can see it's pretty simple to decode, and in Section 5.3.8.2, it's _really_ simple to encode.

You don't want to plug these kind of dimension-at-a-time schemes directly into an arithmetic coder, though, because for large dimensions the probabilities become highly skewed. Enough that it's hard to model with typical arithmetic coder precision (and since you pay computational overhead per-symbol, highly skewed distributions mean you spend a lot of cycles without learning much new about your signal). Instead, you'd want to do something like we describe for SILK in Section 4.2.7.8.3, where you split the vector and say how many pulses lie in each half. That gives you much more balanced probabilities.

But as Jean-Marc mentioned, we don't use these uniformly distributed schemes in Daala. They're a fun aside (highly relevant for audio, not so much for video).

Inter-band masking: we were doing this back in April (when I made the slides you point to). However, we removed it in October: . Its effect was always pretty minor, it was fairly expensive (as originally implemented, I'm sure it could've been simplified), and removing it actually improved metrics slightly. More importantly, it introduced dependencies between the entropy coder and the "signal processing" parts of PVQ, which prevented deeply pipelining these stages in hardware.

Also, apologies for the example image I used in those slides. It was literally the first image I tried.

Directional intra: I'm sure you've seen . After spending far too much time trying to get this to work in a reasonable computational complexity, we eventually disabled it. We now use a much simpler scheme that can only predict pure horizontal or pure vertical edges.

Lossy AC encoding: I suspect for very low bitrates, rather than try fill with random noise or using splitting planes or whatever, what you really want to do is just use small, trained VQ codebooks. At low rates the codebooks shouldn't be very large, and any other scheme is going to wind up equivalent to some form of VQ, without the benefit of training. One thing we've discussed is trying to classify blocks based on the first (few) band(s), and then using the classification to select from one of several different VQ codebooks for the remaining bands. For inter, you could potentially do the classification on the predictor instead of the LF, so you could apply this scheme to the LF bands, too. For intra, where we don't have good directional predictors anymore, you could even use the trained VQ codeword as the PVQ "reference" for the diagonal bands, letting you extend this scheme to higher rates.

December 17, 2014 at 5:44 AM

Comment deleted

This comment has been removed by a blog administrator.

December 17, 2014 at 5:45 AM

Blogger Fabian Giesen said...

"Lossy AC encoding: I suspect for very low bitrates, rather than try fill with random noise or using splitting planes or whatever, what you really want to do is just use small, trained VQ codebooks."

Yep, Charles and me talked about this in person right after that thread and that's basically what we ended up with. Pre-baked codebooks does mean that you want to have a small set of dimensionalities you're coding.

That one's a bit awkward; take the Daala subband partitioning (as in the SPIE abstract), which has (up to 16x16 blocks) subbands with 15 coeffs (top-left 4x4 minus DC), 8 coeffs (8x8 top/left), 32 coeffs (8x8 bottom-right and 16x16 top/left), and 128 coeffs (16x16 rest).

The 32- and 128-coeff subbands, you just wouldn't bother trying to find a short code for, I think. Coding anything meaningful in there is just gonna take a bunch of bits, and the codebooks (even with small number of entries) would be large and very hard to train without getting a lot of bias from your training set. But for the 8-coeff and 15-coeff subbands, a specialized codebook for the low-rate cases is potentially interesting. Only with prediction and biprediction (which remove 1 coeff each), you really have 6 cases: [6,8] coeffs and [13,15] coeffs. That's a lot of codebooks...

It might seem that this is over-thinking it, but at least for Bink 2 (which is a *super*-simple codec optimized for low complexity and low memory use, not state of the art low-bitrate performance, mind), it really paid off visually to tweak the coding so that basically at any point in the code stream, you could spend another 5 bits and get a reduction in distortion, even when doing so made the coding "asymptotically worse" in the sense of being suboptimal in the medium-to-high bit rate regime.

A typical problem with RD optimization is the "one blurry block in the middle of a sharp region" case. This is perceptually really bad (you notice it immediately), and it usually happens when a block is coded intra but you have no way of coding more than DC and maybe 1 AC coeff without going over your bit budget. Giving your codec (perceptually) good options in that case is definitely worthwhile.

December 17, 2014 at 9:18 AM

Blogger cbloom said...

"Directional intra: I'm sure you've seen"

Presumably this is one of the difficulties in working with lapped transforms, and predicting in transformed-space, etc.

Directional intra is really just great for edges. I've often wondered if a more explicit edge-adaptive predictor that uses the neighbors and no explicit signalling would be better.

"Lossy AC encoding: I suspect for very low bitrates, rather than try fill with random noise or using splitting planes or whatever, what you really want to do is just use small, trained VQ codebooks."

Yeah. My concern about this is if you're always restoring the missing high-frequency data in the same way, the pattern will become visible. But that remains to be seen, and perhaps that could be a few codebooks, etc.

This is getting a bit into the realm of fantasy land, but in a lot of cases what you'd really like to do is send some parameters for a noise generating function. You just send the total energy of high frequency noise, and you have some trained noise generators that you select from. eg. obviously for film grain. Even for things like grass and leaves, what you want to do is send a small patch once at full precision, and then make all future blocks have the same spectrum as the sample.

"One thing we've discussed is trying to classify blocks based on the first (few) band(s), and then using the classification to select from one of several different VQ codebooks for the remaining bands. For inter, you could potentially do the classification on the predictor instead of the LF, so you could apply this scheme to the LF bands, too."

Right. I've used a similar idea in wavelet experiments. When a high-band is cut off, rather than fill with zeros, fill from the parent, scaled down by some factor. Because of the subband octave correlation property. You can tell from just the first few AC's what kind of detail is in the block (usually).

You could have a VQ codebook that comes from PCA/KLT of the previous frame.

December 18, 2014 at 10:33 AM

Blogger Unknown said...

Well, as I said, classifying things by directionality in the LF and using that to select a VQ codebook for the HF is one potential way to get small codebooks that still do something useful. I.e., if you have a straight edge with a known orientation, the HF is not empty, but it does not actually have that many degrees of freedom.

That means lots of codebooks for the large bands, though, which is exactly where you don't need them. It may wind up just too expensive.

One other approach we have (by which I mean Monty has) been playing with is to train filters that will compact the energy from straight edges in the HF into the LF, and then predict the HF from just the LF coefficients. In VQ terms, you could think of this like an adaptive codebook, as is used in speech codecs (though the mechanism is very different). The problem is that you've traded memory complexity of large codebook tables for the computational complexity of the filters (which are not nice separable things). So it's not clear yet any of this will be worth the cost.

"At least their proposal was to mix overlapping-block MC and Control Grid Interpolation (CGI, essentially you specify a small mesh with texture coordinates and do per-pixel tex coord interpolation)."

Actually, we removed CGI fairly early on. It was just too expensive. You basically can't vectorize the loads without a chip with N MMUs available, and not even x86 with its new gather instructions really has that... a GPU-style 2D texture cache would work, but that's also a metric boatload of transistors. That might have been something we could live with, but didn't have any clear gains. It could do something _different_, but it wasn't clear the encoder I had for it was taking advantage of that to do something _useful_, nor how to write an encoder that could. I definitely could not reproduce the gains that other researchers reported by switching methods, though that may be related to the extra constraints added to avoid blocking artifacts where the switch occurred.

One thought Jean-Marc had instead is to have a codeword that signals, "Make this whole region 4x4 blocks and fill in the MVs with interpolation instead of coding them." I.e., it's CGI, but at the block level instead of the pixel level, so you get at least 4-way parallelism in your loads.

December 19, 2014 at 3:25 AM

Blogger Unknown said...

"I've often wondered if a more explicit edge-adaptive predictor that uses the neighbors and no explicit signalling would be better."

Well, at the very least you need to be able to shut something like this off, because when it's wrong it will be disastrously so.

"This is getting a bit into the realm of fantasy land, but in a lot of cases what you'd really like to do is send some parameters for a noise generating function."

This has been done for still images. See Marcus Nadenau's 2000 Ph.D thesis, "Integration of Human Color Vision Models into High Quality Image Compression" for one example with wavelets. Extending this to video is hard, for the reasons Jean-Marc said above.

"Even for things like grass and leaves, what you want to do is send a small patch once at full precision, and then make all future blocks have the same spectrum as the sample."

"Same spectrum" here has to be a bit vague, because of phase offsets and leakage and various other things that will keep it from being "the same". However, my canonical example of "video compression is nowhere near the limit of what it could do if we had infinite CPU available" is to apply some of the dynamic texture analysis research that's been done to generate low-dimensional models for things like water, leaves, etc., and run that analysis in the decoder, so that the encoder can just start sending a few model parameters to get you waves or fluttering leaves or whatever. The kinds of things that are predicted very poorly with traditional motion compensation, and for which local metrics like MSE are completely irrelevant.

December 19, 2014 at 3:56 AM

Blogger Fabian 'ryg' Giesen said...

(Using texture synthesis to fill in detail)
In the "infinite CPU" mindset, you probably wouldn't even bother sending a lot of model coefficients, but go for a predictor-corrector kind of structure on both the encoder and decoder side: the decoder "learns" structure from frames (slices, tiles...) it's seen before. After one good keyframe with useful texture the decoder could learn it and replicate that structure later. Given (say) a motion vector with high weight, the decoder would know that the target area should have similar texture properties to the region it was copied from. Hmm.

You could still use them as a parametric model for synthesis, but purely "appearance-preserving mocomp" would probably help reduce the generation artifacts (and resulting wasted bit rate) for edge-heavy structure. Alas, not cheap at all.

Another thing that would need a lot more CPU than we're willing to throw at it right now: dictionary-based schemes (linear algebra dictionaries, i.e. over-complete "bases"). Orthogonal matching pursuit, K-SVD etc. That's been done for still images (either sending the dictionary atoms in the bit stream, or having a pre-baked dict for special-purpose applications like faces) but with video you could conceivably start with a generic basis (DCT-derived or similar) and then keep adapting it to match the data you actually have.

There's no shortage of things to try, all we need is a couple orders of magnitude increase in cycles per pixel. :)

December 19, 2014 at 4:09 PM

Blogger cbloom said...

"In the "infinite CPU" mindset, you probably wouldn't even bother sending a lot of model coefficients, but go for a predictor-corrector kind of structure on both the encoder and decoder side: the decoder "learns" structure from frames (slices, tiles...) it's seen before."

Yeah, but the problem is you probably never sent a great sample of the texture in a previous frame. (if you did, it would be weird because it would be too high quality for the bit rate).

So you'd have to send a patch of 32x32 pixels as side data to be used as the sample for generation for a certain type of texture. Then you do standard texture-from-example kind of stuff.

"Another thing that would need a lot more CPU than we're willing to throw at it right now: dictionary-based schemes "

I think there are fundamental theoretical problems with these schemes that are unsolved. It's not CPU holding them back; there are definite steps in this direction that could be done right now. For example the 4x4 spatial transforms can be done as a matrix multiply, so custom matrices could be sent per frame.

The problem is once you start making the bases adaptive, then you can no longer hand-tweak your codec for them. Where are the subbands, what should the CSF's be, or the quantizers, how do you bit-rate allocate for them. All the perceptual kludges fall apart.

I believe you can only go to adaptive bases when you have a really good software perceptual metric to evaluate them, so that we aren't relying on offline evaluation for encoder tweaks. And we're still quite far off there.

(and running the theoretical super-perceptual-metric in-loop to make coding decision is insanely CPU prohibitive)

December 19, 2014 at 8:19 PM

Blogger Fabian 'ryg' Giesen said...

Yeah, but the problem is you probably never sent a great sample of the texture in a previous frame. (if you did, it would be weird because it would be too high quality for the bit rate).

Not really true. Things like x264 MBTree already spend time to identify areas that are going to be frequently used as motion references, and boost their bit budget accordingly. That's not experimental, that's been shipping (and on by default for two-pass encoding) for years.

You're not gonna have a pristine reference, but you can definitely assume that a) often-referenced areas will start out higher quality than the rest (and thus definitely useful candidate references to take "probes" from), b) that working somewhat harder to "leak" less of that quality over time would save bits overall.

December 19, 2014 at 10:42 PM

Blogger cbloom said...

Well I don't agree. (this is getting way off topic though)

The whole point of the texture synthesis (or the lossy AC stuff) is to send detail at levels *beyond* what you can otherwise send.

By the very definition of the problem, you don't have that detail in previous frames.

You can try force that detail in previous frames, which is really rather more like VP8's "golden frames" than MB-tree, because it's a big difference, not a small tweak, and because you need source regions larger than one block.

But doing that is a very nasty non-local optimization problem. Should I put way more bits on these blocks than usual so that they can be used as texture source for future frames? You have to try putting more bits on those blocks and see how it affects N frames in the future. Totally unpossible. Also really hard to not make it a weird pop in quality.

MB-tree is a very small bit rate adjustment. Not really the same thing.

Of course there's a general problem with this idea, which is the binary nature of "this block gets texture synth and this one doesn't" that would also create weird visual quality variation. All of a sudden you get tweeds that look really amazing because it gets good texture synth, but everything else is lower quality. I guess this is an issue with any category based S/D/E type coder, you have to be very careful to not be creating perceptual quality level differences in the different block types.

December 20, 2014 at 12:37 PM

Blogger Jarek Duda said...

Hi,
Just two comments:
- Wanting to get more uniform lattice on l2 sphere, it is worth to add some deformation, e.g. coordinate-wise power x^p allows to reduce MSE up to 26 percents ( https://arxiv.org/pdf/1705.05285 ).
- Enumerative coding often requires large arithmetic, also looses ~half bit in the final flush. For speedup and to avoid this inefficiency, it can be put into entropy coding. For this purpose we need to use the combinatorial formulas to find probabilities: for first digit for all possible (L,K).
Best,
Jarek

May 15, 2020 at 12:37 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.