Hey, you just gave me an idea to improve DMC. You build a state machine like regular DMC, but instead of predicting from the ratio of counts, feed the counts to a secondary model.
> In these sensitive parts of the coder, you > obviously want to use as much context as > possible, but if you use too much your > statistics become too sparse and you start > making big coding mistakes.
In other words, the context quantization function is a lossy compression function, but unlike lossy media coders, we have a well defined quality metric here.
> For example something that hasn't been explored > very much in general in text compression is > severely assymetric coders.
I guess you just missed it completely. In text compression, there're preprocessors, like http://www.ii.uni.wroc.pl/~inikep/ And in special cases, like Hutter Prize, people put a lot of work into selection and arrangement of words in the dictionary.
Also there're quite a few projects with parameter optimization pass. For example, see http://sites.google.com/site/toffer86/m1-project There's a "o" processing mode which builds a model profile for given data samples (which includes context masks and counter update constants and the like).
There're also many other projects with a similar approach (eg. epmopt and beeopt), and most of my coders are made like that, just that it makes more sense to use in the development stage than in public utilities.
> In these sensitive parts of the coder, you > obviously want to use as much context as > possible, but if you use too much your > statistics become too sparse and you start > making big coding mistakes.
In other words, the context quantization function is a lossy compression function, but unlike lossy media coders, we have a well defined quality metric here.
> For example something that hasn't been explored > very much in general in text compression is > severely assymetric coders.
I guess you just missed it completely. In text compression, there're preprocessors, like http://www.ii.uni.wroc.pl/~inikep/ And in special cases, like Hutter Prize, people put a lot of work into selection and arrangement of words in the dictionary.
Also there're quite a few projects with parameter optimization pass. For example, see http://sites.google.com/site/toffer86/m1-project There's a "o" processing mode which builds a model profile for given data samples (which includes context masks and counter update constants and the like).
There're also many other projects with a similar approach (eg. epmopt and beeopt), and most of my coders are made like that, just that it makes more sense to use in the development stage than in public utilities.
Increasing context length, as you mentioned sometimes hurt compresssion. But, moreover affects both speed and memory usage. If we try to see in a "lossy" way for context determination, we would definitely improve the compression. Because, "similar" statistics (context clustering) tend to do better prediction, thus improve compression, and "lossy context compression" enables it. That's one of the reason why we use FSM in our compressors (like BIT, M1, CCM, PAQ etc).
Yep, absolutely. My point is that that lossy quantization really should be adaptive, and we should be using some system that finds it for us based on what's optimal for the data. That doesn't exist yet, so we wind up usually hand tweaking it (or sometimes training it offline, I believe M1 does this).
There's not much of a theoretical formalism behind all this context merging and formation of the lossy FSM's, which means it's hard to tell how close they are to optimal and what's being left on the table.
M1 optimizes most of model parameters: context masks, weights, non-linear quantization buckets, including FSM construction parameters. Many recent context mixing compressors already tweaked by automated process (including all Shelwien's, Toffer's and my compressors).
BTW, Shelwien has an idea to compress bit history in a lossy way to use as useful context clustering.
September 15, 2010 at 11:57 PM
Anonymous said...
I didn't think my crappy stb.h compressor was actually interesting, but I saw that Matt Mahoney's "DCE" book mentioned the NTFS compression file format and some numbers for it, and they seemed suspicious so I went and tested mine against it and found mine was actually better, despite having been written mainly to keep the source code small.
I believe LZ-NTFS is the infamous "Stac" algorithm which they sued MS for (and won) despite the fact that it was almost identical to well known works LZSS and LZRW1.
At the time MS was much hated, so lots of people cheered that David was punching Goliath, but in reality it was a travesty of patent abuse.
September 17, 2010 at 10:11 AM
Sparsity & Temporality show up as big issues in lots of funny ways.
For example, they are the whole reason that we have to do funny hand-tweaks of compressors.
eg. exactly what you use as your context for various parts of the compressor will make a huge difference. In PPM the most
sensitive part is the SEE ; in CM it looks like the most sensitive part is the contexts that are used for the mixer itself
(or the APM or SSE) (not the contexts that are looked up to get statistics to mix).
In these sensitive parts of the coder, you obviously want to use as much context as possible, but if you use too much your
statistics become too sparse and you start making big coding mistakes.
This is why these parts of the model have to be carefully hand tweaked; eg. use a few bits from here, compact it into a log2-ish scale,
bucketize these things. You want only 10-16 bits of context or something but you want as much good information in that context as
possible.
The problem is that the ideal formation of these contexts depends on the file of course, so it should be figured out adaptively.
There are various possibilities for hacky solutions to this. For example something that hasn't been explored very much in general in
text compression is severely asymmetric coders. We do this in video coding for example, where the encoder spends a long time figuring
things out then transmits simple commands to the decoder. So for example the encoder could do some big processing to try to figure out
the ideal compaction of statistics and sends it to the decoder. (* maybe some of these do exist)
If sparsity wasn't an issue, you would just throw every bit of information you have at the model. But in
fact we have tons of information that we don't use because we aren't good enough at detecting what information
is useful and merging up information from various sources, and so on.
For example an obvious one is : in each context we generally store only something like the number of times each
character has occurred. We might do something like scale the counts so that more recent characters count
more. eg. you effictively do {all counts} *= 0.9 and then add in the count of the new character as ++.
But actually we have way more information than that. We have the exact time that each character occurred
(time = position in the file).
And, for each time we've used the context in the past, we know what was predicted from it and whether that was
right. All of that information should be useful to improve coding, but it's just too much because it makes
secondary statistics too sparse.
BTW it might pop into your head that this can be attacked using the very old-school approaches to sparsity
that were used in Rissanen's "Context" or DMC for example. Their approach was to use a small context, then as you
see more data you split the context, so you get richer contexts over time. That does not work, because it is
too conservative about not coding from sparse contexts; as I mentioned before, you cannot tell whether a sparse
context is good or not from information seen in that context, you need information from an outside source,
and what Context/DMC do is exactly the wrong thing - they try to use the counts within a context to decide
whether it should be split or not.
"09-14-10 - Challenges in Data Compression 2.5 - More on Sparsity"
8 Comments -
Hey, you just gave me an idea to improve DMC. You build a state machine like regular DMC, but instead of predicting from the ratio of counts, feed the counts to a secondary model.
September 14, 2010 at 4:49 PM
> In these sensitive parts of the coder, you
> obviously want to use as much context as
> possible, but if you use too much your
> statistics become too sparse and you start
> making big coding mistakes.
In other words, the context quantization function
is a lossy compression function, but unlike
lossy media coders, we have a well defined
quality metric here.
> For example something that hasn't been explored
> very much in general in text compression is
> severely assymetric coders.
I guess you just missed it completely.
In text compression, there're preprocessors, like
http://www.ii.uni.wroc.pl/~inikep/
And in special cases, like Hutter Prize, people
put a lot of work into selection and arrangement
of words in the dictionary.
Also there're quite a few projects with
parameter optimization pass.
For example, see
http://sites.google.com/site/toffer86/m1-project
There's a "o" processing mode which builds a
model profile for given data samples (which
includes context masks and counter update
constants and the like).
There're also many other projects with a similar
approach (eg. epmopt and beeopt), and most of my
coders are made like that, just that it makes
more sense to use in the development stage than
in public utilities.
September 14, 2010 at 5:08 PM
> In these sensitive parts of the coder, you
> obviously want to use as much context as
> possible, but if you use too much your
> statistics become too sparse and you start
> making big coding mistakes.
In other words, the context quantization function
is a lossy compression function, but unlike
lossy media coders, we have a well defined
quality metric here.
> For example something that hasn't been explored
> very much in general in text compression is
> severely assymetric coders.
I guess you just missed it completely.
In text compression, there're preprocessors, like
http://www.ii.uni.wroc.pl/~inikep/
And in special cases, like Hutter Prize, people
put a lot of work into selection and arrangement
of words in the dictionary.
Also there're quite a few projects with
parameter optimization pass.
For example, see
http://sites.google.com/site/toffer86/m1-project
There's a "o" processing mode which builds a
model profile for given data samples (which
includes context masks and counter update
constants and the like).
There're also many other projects with a similar
approach (eg. epmopt and beeopt), and most of my
coders are made like that, just that it makes
more sense to use in the development stage than
in public utilities.
September 14, 2010 at 5:22 PM
Increasing context length, as you mentioned sometimes hurt compresssion. But, moreover affects both speed and memory usage. If we try to see in a "lossy" way for context determination, we would definitely improve the compression. Because, "similar" statistics (context clustering) tend to do better prediction, thus improve compression, and "lossy context compression" enables it. That's one of the reason why we use FSM in our compressors (like BIT, M1, CCM, PAQ etc).
September 15, 2010 at 5:58 PM
Yep, absolutely. My point is that that lossy quantization really should be adaptive, and we should be using some system that finds it for us based on what's optimal for the data. That doesn't exist yet, so we wind up usually hand tweaking it (or sometimes training it offline, I believe M1 does this).
There's not much of a theoretical formalism behind all this context merging and formation of the lossy FSM's, which means it's hard to tell how close they are to optimal and what's being left on the table.
September 15, 2010 at 6:18 PM
M1 optimizes most of model parameters: context masks, weights, non-linear quantization buckets, including FSM construction parameters. Many recent context mixing compressors already tweaked by automated process (including all Shelwien's, Toffer's and my compressors).
BTW, Shelwien has an idea to compress bit history in a lossy way to use as useful context clustering.
September 15, 2010 at 11:57 PM
I didn't think my crappy stb.h compressor was actually interesting, but I saw that Matt Mahoney's "DCE" book mentioned the NTFS compression file format and some numbers for it, and they seemed suspicious so I went and tested mine against it and found mine was actually better, despite having been written mainly to keep the source code small.
So I wrote mine up just for thoroughness, not that I think it really matters.
September 17, 2010 at 2:49 AM
I believe LZ-NTFS is the infamous "Stac" algorithm which they sued MS for (and won) despite the fact that it was almost identical to well known works LZSS and LZRW1.
At the time MS was much hated, so lots of people cheered that David was punching Goliath, but in reality it was a travesty of patent abuse.
September 17, 2010 at 10:11 AM