Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"07-02-10 - Length-Limitted Huffman Codes"

9 Comments -

1 – 9 of 9
Anonymous Anonymous said...

I've never actually looked at this either--are the codes significantly different from what you'd get by building the regular huffman tree and then shuffling the leaves upwards?

July 3, 2010 at 7:41 AM

Blogger cbloom said...

"--are the codes significantly different from what you'd get by building the regular huffman tree and then shuffling the leaves upwards?"

Well of course "it depends".

The short answer is that yes, they are very different, but it also doesn't matter.

The question is how much pressure you are putting on the code length limit. eg. what % of the original symbols were over length limit. If the length limit is reasonably long, then that % will be very small because the least likely symbols are the long ones.

So because of that, even though the heuristic way may be far from optimal, it just doesn't matter.

July 3, 2010 at 12:42 PM

Blogger cbloom said...

For example, on a real file ("pic") with limit = 16, these are the number of codes of each length generated by the heuristic or optimal builder :

Heuristic :

1 : 1 , 0.628931%
4 : 1 , 0.628931%
5 : 2 , 1.257862%
6 : 14 , 8.805032%
7 : 11 , 6.918239%
8 : 9 , 5.660378%
9 : 9 , 5.660378%
10 : 7 , 4.402516%
11 : 10 , 6.289308%
12 : 9 , 5.660378%
13 : 13 , 8.176101%
14 : 20 , 12.578616%
15 : 3 , 1.886792%
16 : 50 , 31.446541%

Optimal :

1 : 1 , 0.628931%
4 : 1 , 0.628931%
5 : 2 , 1.257862%
6 : 14 , 8.805032%
7 : 11 , 6.918239%
8 : 9 , 5.660378%
9 : 9 , 5.660378%
10 : 7 , 4.402516%
11 : 10 , 6.289308%
12 : 9 , 5.660378%
13 : 11 , 6.918239%
14 : 21 , 13.207547%
15 : 14 , 8.805032%
16 : 40 , 25.157232%

you can see they are in fact quite different, but in both cases the actual coded entropy is exactly 1.661

July 3, 2010 at 12:44 PM

Blogger cbloom said...

If you artificially put a lot of pressure on the system you can get big differences. For example with a code length limit of 8 :

heuristic :

2 : 1 , 0.628931%
7 : 34 , 21.383648%
8 : 124 , 77.987419%
entropy : 2.655

optimal :

2 : 1 , 0.628931%
5 : 1 , 0.628931%
6 : 5 , 3.144654%
7 : 12 , 7.547170%
8 : 140 , 88.050316%
entropy : 2.607

July 3, 2010 at 12:46 PM

Anonymous Anonymous said...

What does the unconstrained version of that look like? I'm confused why a heuristic algorithm wouldn't have preserved a shorter code for the second symbol.

July 4, 2010 at 12:01 AM

Blogger cbloom said...

" I'm confused why a heuristic algorithm wouldn't have preserved a shorter code for the second symbol. "

Because it's heuristic so it's not ideal.

It bumps all the long code lens up to 8. Then it has to fix the codes by making others longer. It starts with the lowest count codes. In this case there is a real problem with that length 1 code, but before it gets there it first bumps the length 4 code all the way up to 7. But that isn't enough so it winds up making the length 1 code into length 2.

Then it does the backwards pass to shorten some codes. You'd like it to shorten that second most likely one back to 5, but it can't do that unless it also lengthens some of the 7's up to 8, which it isn't considering any more.

The way to solve this correctly of course is just to do package merge, which is addressed in the follow-up to this post.

July 4, 2010 at 12:26 AM

Anonymous Anonymous said...

In this case there is a real problem with that length 1 code, but before it gets there it first bumps the length 4 code all the way up to 7.

This is what I missed, thanks.

July 4, 2010 at 8:19 AM

Anonymous Anonymous said...

I wonder if there's a better heuristic algorithm where you figure out the highest leaf you'll need to split/move-down to have enough room, you move that down, repeat, so that you don't move things down too far.

It's just it seems like, even if you throw out the actual frequency data, the tree itself gives you information about the approximate frequencies that you ought to be able to leverage straightforwardly. (Maybe nobody's bothered looking at it because the gain is minimal in practice?)

July 4, 2010 at 12:46 PM

Blogger cbloom said...

Yeah there's a better algorithm, it's called package-merge ;)

July 4, 2010 at 1:21 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.