Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"On the entropy of distance coding"

2 Comments -

1 – 2 of 2
Blogger Charles said...

Nice article; relatively easy to follow :). Let me see how much I got right here.

I think your argument boils down too:

Algorithms which say things like 'repeat that sequence we saw 2 tokens back' are less efficient than 'repeat that sequence at token 5'.

And you present some evidence centered on a dataset with 256 tokens of 4 bytes each.

An obvious argument; if we increase the size of the file in an unbounded way (so, 256 tokens of 4 bytes each, but 50GB of data), the distance between the sequence will, on average, remain the same, while the absolute index value most recently seen would increase an unbounded (but much slower) way.

Unless, we cheat and assume we can reference the first value seen (the index of which will probably be quite small) for all instances of a value. You do seem to be making that assumption here, and I don't really see a strong reason why you shouldn't do that, especially with the additional constraint 'your data consists of a finite set of tokens'. You call it out a bit in the hat example.

As an interesting extension to this; your assumptions here would not hold up as well if data was not random and tended towards 'chunking'. For example, if we had 4 billion bytes of 'black', then 4 billion of 'white', sending the index of a white pixel would use 4 bytes or so, where as sending the 'last reference' would be (on average) 1 byte.

So if there is a time component to the data (i.e., it changes from predominately one color to another over time), sending distance might work better. I wonder if that is more common in practice? How strong does this tendency have to be to make distance more attractive?

PS; the lack of an open-source decompression library for Oodle makes me sad. I really want to extract images from one of my favorite games, but they use Oodle and cause me much sadness.

June 14, 2020 at 1:58 AM

Blogger cbloom said...

No, you're not sending the index as in a position. The alternative to distance coding is indexing by *content*.

Instead of coding "repeat the thing at 200 bytes back" you say "repeat the 10th previous thing with unique contents of length 4"

It's really just about removing repeats from the indexing.

Another way to see it is, say you want to send length 4 tokens. As your file goes to 50 GB, the number of distances you can send goes up to 50 GB, a large number. But if you only count the number of unique 4-byte strings that occur, that number goes up much more slowly, because some repeat.

How much more slowly depends on the entropy. On high entropy (random bytes) they go up at the same rate, on maximal low entropy (all bytes zero) the number of unique strings doesn't go up at all.

June 14, 2020 at 8:15 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.