Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"08-11-10 - Huffman - Arithmetic Equivalence"

9 Comments -

1 – 9 of 9
Blogger won3d said...

This is a great illustration. It even suggests a template-parameterized implementation for table-based decoders. An obvious-but-interesting thing to me: this continuum extends to coding simpler than table-based Huffman.

I wonder if it would be faster to do static huffman by compiling the tables directly into code. Maybe it is a lose because your branch targets are unpredictable, but maybe it is a win because you can more easily do things like decode multiple symbols, use immediate shifts, and have shorter load dependency chains. A fast implementation would likely look similar to a really simple indirect-threaded bytecode interpreter.

So this would be a compiled-table approach, but now what could you do if your bits->symbols mapping had structure that was easier to transpose into code? Specifically, let's get rid of the switch dispatch. I believe this leads you to with Rice/Golomb/Gamma/etc. codes.

August 12, 2010 at 3:16 AM

Blogger cbloom said...

"I wonder if it would be faster to do static huffman by compiling the tables directly into code. Maybe it is a lose because your branch targets are unpredictable, but maybe it is a win because you can more easily do things like decode multiple symbols, use immediate shifts, and have shorter load dependency chains. A fast implementation would likely look similar to a really simple indirect-threaded bytecode interpreter."

Yeah we talked about doing this. If you had to actually interpret commands it would be a lose, but if you could compile directly into machine code and insert it in your decoder, it would be a win.

"So this would be a compiled-table approach, but now what could you do if your bits->symbols mapping had structure that was easier to transpose into code? Specifically, let's get rid of the switch dispatch. I believe this leads you to with Rice/Golomb/Gamma/etc. codes. "

Ah yeah, that's an interesting idea. The most compelling case would come if your encoder was a bit flexible, eg. you had some space/speed tradeoff parameter and it could choose to build tables that were not quite optimal and had more directly encodable structure.

For example, one case that's very common is to have a fast decode table for the first 8-12 bits of the huffman code. Then, for the remainder of the symbols instead of decoding the true huffman code, you could just use a variable-length code of some sort, with very little loss because those symbols are rare anyway. But then speeding up that branch is also not very important. However one nice win of that approach is that sending the code lens can be done smaller and simpler because you don't send any of the code lens > fastLen. Hmmm...

August 12, 2010 at 10:32 AM

Anonymous Anonymous said...

Whoa, color! It's like a rainbow, all the way through the blog.

Double colors, oh my god. It's a double-colored blog, all the way. Oh my god. It's so bright and vivid! Oh! Ah! What does this mean?

August 12, 2010 at 3:28 PM

Anonymous Anonymous said...

by the way, since it's not clear, that is not me bandwagonning on that meme, I actually hate that meme.

it is actually me making fun of myself, because my first reaction to this post HONESTLY was "holy crap, it's in color, Charles is using color, unbelievable". And when I went to post that I realized that perhaps I was overdoing it a bit, and should make fun of myself for it.

August 12, 2010 at 4:19 PM

Blogger cbloom said...

Oh, I didn't even notice the rainbow meme reference.

I knew I would blow some reader's minds with the use of this new "rich text" technology.

August 12, 2010 at 8:16 PM

Blogger won3d said...

"Yeah we talked about doing this. If you had to actually interpret commands it would be a lose, but if you could compile directly into machine code and insert it in your decoder, it would be a win."

It would be fun to JIT your Huffman decoder. I imagine a simple table-based codegen (which I'm sure you have lying around) would be easy to throw together.

A few random questions about table-based decoders.

What is the size/speed tradeoff of looking up symbol length after decoding the symbol? If you looked up both together, you might get some more cache misses, but your load dependency chain shrinks. I suppose you mentioned that on in-order PPC there's some advantage to having 8-bit arrays, but you're going to be eating a 3-5 cycles even if your L1 hits.

It seems like it might be advantageous to be able to decode many small symbols simultaneously. If Huffman is able to compress 8-bit bytes to 4 bits on average, you expect to be able to 2-3 bytes from 8-12 bits. Have you ever tried this? Dealing with these special cases might not be easy in table-based implementations, but might be more feasible with codegen.

"However one nice win of that approach is that sending the code lens can be done smaller and simpler because you don't send any of the code lens greater than fastLen. Hmmm..."

You could make it so that rare symbols are transmitted as literals with a common prefix. Then you send the rare symbols + code lens, and the code lens build a huffman tree that excludes the rare symbols, but includes the rare symbol marker. This was just the naive and obvious thing I thought of just now, so I'm sure you could do much better.

Whenever I see "code lens" I think that there is some deep philosophical implication. Like: "through the lens of the algorithm" instead of "I am too lazy to type gth."

August 13, 2010 at 5:16 AM

Blogger cbloom said...

"It would be fun to JIT your Huffman decoder. I imagine a simple table-based codegen (which I'm sure you have lying around) would be easy to throw together."

Yeah there are lots of fun things you could do with dynamic code. Unfortunately, it's forbidden on most consoles for security reasons.

"What is the size/speed tradeoff of looking up symbol length after decoding the symbol? If you looked up both together, you might get some more cache misses"

Yes, the exact details of this are platform dependent. We have tried looking them up together, looking them up in two arrays both indexed by peek, looking up sym then looking up codelen from sym, etc. The code posted is not what we ship with. There are platform quirks but generally the biggest issue is the dependency chain and when you need which values.

"It seems like it might be advantageous to be able to decode many small symbols simultaneously."

Yeah, this is in the old Huffman2 that's in crblib. It's not very useful in general because you generally have a mixed stream, like huffman-raw bits-huffman.

"You could make it so that rare symbols are transmitted as literals with a common prefix."

Indeed. There's a whole spectrum to be explored where mutating the coding a bit to be not perfectly Huffman could reduce memory use and improve speed, and then you could play with the space/speed tradeoff.

August 13, 2010 at 9:51 AM

Blogger ryg said...

"Yeah, this is in the old Huffman2 that's in crblib. It's not very useful in general because you generally have a mixed stream, like huffman-raw bits-huffman."
Depends on the anatomy of your code stream (and your decoder). If the raw bits are part of the same (logical) symbols, it's definitely a win to include them in the table.

Two examples: First one is LZ offsets. You typically have a couple different symbols to encode the magnitude of the offset followed by a couple of raw bits to specify it fully. If the full code (prefix + extra bits) is shorter than your full table size, you can just decode the whole thing at once. Basically, instead of doing

slot = DecodeHuffSymbol()
offs = Base[slot]
if (ExtraBits[slot])
offs += GetBits(ExtraBits[slot])

you store Base/ExtraBits directly in your fast decode table. Then you do something like

entry = DecodeHuffSymbol()
offs = entry.offs
if (entry.extraBits) {
slot = entry.offs
// code as above, except we know
// that extraBits != 0
}

Second example: JPEG-style value coding. Similar to the above, you have a huffman tree encoding run length and log2(abs(value)) followed by extra bits to specify sign and "mantissa". You can put decoding for sign+mantissa into the table for small values (which typically have short codes anyway). Now you have 3 paths: fast path (decode huff bits + trailing bits in one table lookup), normal path (decode huff bits in one table lookup) and slow path (multiple table lookups or one table lookup + linear scan afterwards). For something JPEG-ish, it's not unusual for 85% or more of all codes to go through the fast path even with moderate table sizes (did this with 9 bits index and 16 bits/table entry, so a 1k table total... pretty good deal).

So this definitely works if you can fold multi-step decoding of a single logical symbol into one table lookup. It's useless for anything that actually controls execution flow in the decoder, though. Reading two symbols that influence control flow separately adds N+M states your decoder can be in. Parsing them together introduces (N x M) states - usually a bad idea, unless a significant fraction is trivial or similar and can be unified.

August 16, 2010 at 1:25 AM

Blogger cbloom said...

These are good points.

August 21, 2010 at 11:49 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.