Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"09-14-10 - A small note on structured data"

14 Comments -

1 – 14 of 14
Blogger Matt Mahoney said...

Yeah, I'm amazed that 7zip beats paq8o too. Clearly you need a 3-D model (4 x 18 x n). paq would find a record length of either 4 or 72 but it needs both.

September 14, 2010 at 5:02 PM

Blogger Shelwien said...

Please try a newer paq version, like paq8px or paq8p:
http://paq8.hys.cz/

Also worth considering are
nanozip http://nanozip.net/download.html
bsc
http://libbsc.com/downloads.aspx
and ccm
http://www.megaupload.com/?d=QYGGVZXJ

September 14, 2010 at 5:21 PM

Blogger won3d said...

For my first job out of college, I had to write something that would load a bunch of volumetric data. They were all of the same general form: weird header + blob of bytes. Instead of trying to reverse engineer the headers, all I did was do an auto-correlator kind of thing, which figured out the width/height/depth of the volumes by looking at the sum-of-absolute differences of the file under various shifts of itself. This was pretty reliable but slow, so I would use it to convert to a simpler format.

More on-topic: auto-transposing arrays of structs sounds like something your C++ reflect thingy might be able to do?

Also, it seems like context sorting could easily go together with transposition, just by manipulating your comparison. Instead of comparing adjacent bytes, compare bytes separated by some stride. You could possibly also use more information about the structs. Knowledge about the fields could inform which bytes within the struct are good to sort on.

September 14, 2010 at 5:33 PM

Blogger cbloom said...

"Yeah, I'm amazed that 7zip beats paq8o too. Clearly you need a 3-D model (4 x 18 x n). paq would find a record length of either 4 or 72 but it needs both."

Yeah. I think another issue is that LZMA uses (pos&3) in its contexts. I think PAQ does something like that in some places, but not in all its mixer contexts. eg. when the pos is in certain places you want to favor different models.

But it's one of those issues with sparsity and file-specific stuff; you don't want to just add N bits of pos to all your contexts because on many files that would hurt.

September 14, 2010 at 5:46 PM

Blogger cbloom said...

"More on-topic: auto-transposing arrays of structs sounds like something your C++ reflect thingy might be able to do?"

Sure. More generally if you have the full struct data you could create a custom compressor or preprocessor for that struct that does rejiggling to floats, etc.

"Also, it seems like context sorting could easily go together with transposition, just by manipulating your comparison. Instead of comparing adjacent bytes, compare bytes separated by some stride."

Hmmmm I'm not sure how that would go, but that's a pretty interesting idea. You could do some kind of record-length-aware BWT. Obviously you could use any predicate other than just strcmp, but it's not obvious how that would affect inversion.

September 14, 2010 at 5:49 PM

Blogger won3d said...

There's the very similar to my idea RadixZip:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3977

Which even dispenses with suffix sorting.

September 14, 2010 at 6:56 PM

Blogger cbloom said...

Oh, RadixZip! How did I not know about this?

How do you do the inverse sort order efficiently?

September 15, 2010 at 9:45 AM

Blogger won3d said...

You mean as opposed to inverting the sort transform and de-transposing in sequence? I haven't thought about it, but I would guess that it is certainly possible.

September 15, 2010 at 10:29 AM

Blogger won3d said...

So I think for fixed-width, you just do something like inverse-BWT, and the transpose is "free" by manipulating the way you index the output array. For variable-length records, I'm not sure there's anything faster than doing the inversion/transpose separately.

September 15, 2010 at 8:25 PM

Blogger cbloom said...

Yeah, the details aren't obvious to me but I need to sit down with a pen and paper to work it out.

It seems like within the record, you could sort using data[-4] as the predecessor instead of data[-1] , maybe use data[-4] as immediate predecessor then data[-1] as next predecessor for breaking ties. Seems like that would do well on this data.

September 15, 2010 at 8:30 PM

Anonymous Anonymous said...

That RadixZip is pretty cool.

The paper is kinda confusing since they e.g. transpose the "matrix" as a major algorithm step but then in all their examples continue drawing it non-transposed. They'd have been better off leaving that as an implementation detail than as part of the algorithm. In general there's a few things like that that made me take more iterations of reading to decipher it than I'd have liked.

Also, why do papers make you wade through so much analysis and formalism before getting to the actual original/clever idea? I guess abstracts aren't currently for this, but couldn't the abstract have had a sentence like "For a stream which can be interpreted as distinct multi-character tokens, use only up to the beginning of the token as the context for a BWT-esque transform of the content" and saved me from having to read the first half of the paper to get to that tidbit?

I dunno about their O(N) claims. You need a stable sort, so you're paying significant overheads even it's really O(N), but also if you have 100 1-byte tokens and 1 100-byte token, don't you end up sorting 10000 bytes? I guess maybe you can virtualize the short ones, but at least as presented I thought they said you had to keep track of the short ones in the partitioning order. Inexplicably, they offer a proof of the first trivial lemma, and then omit it for the harder ones.

September 16, 2010 at 9:56 PM

Blogger won3d said...

I think the problem with academic-y papers is that they are written to get into conferences, so the end up with competing issues of clear exposition and "look, we did something challenging and it is the best thing ever." Blogs (like this one) seem much more valuable.

I saw the paper a while ago, and had a similar doubt about the O(n) stuff, and I vaguely remember convincing myself that it was OK, but I'll have to think it through at a more reasonable hour.

And to take the tangent a bit further: RadixZip is doing MTF + RLE + Huffman, similar to bzip2. Although, does it do RLE then? I thought it did RLE as a preprocess to BWT because of their bad suffix sorting, but maybe it does it post-MTF as well. In any case, what have people been doing after context sorting, besides the various MTF variants?

September 17, 2010 at 12:24 AM

Blogger cbloom said...

"In any case, what have people been doing after context sorting, besides the various MTF variants?"

There are various papers on this. Maybe I'll do a post with a bunch of compression papers & links. Short answer :

Maniscalco and others have an interesting approach using the suffix tree directly

"distance coding" is a new way that might be slightly better than MTF

there are also some papers on reconstructing the contexts from the post-BWT data

but most of the useful work has just been on modeling the post-MTF data well (eg. mixing etc., in bbb, bsc, etc.).

September 17, 2010 at 12:31 AM

Blogger ryg said...

"I think the problem with academic-y papers is that they are written to get into conferences, so the end up with competing issues of clear exposition and "look, we did something challenging and it is the best thing ever."
Clear exposition my ass. They're written to get accepted at conferences and in journals alright. If a paper actually offers a fundamentally new insight, clear exposition is actually a goal, since you want the reviewers (and readers) to get it (and cite you afterwards). But the vast majority of papers are incremental improvements on existing approaches, and at least in CS, there's a definite tendency to pad things out (and make them more complex than necessary) to get one paper's worth of text out of an idea that can be succinctly explained in a single paragraph.

Also, most academic researchers won't put two closely related ideas into one paper if there's a realistic chance that they can milk it for two or more papers. The latter will give them more published articles and more citations.

Some people will even intentionally emit key implementation details to slow the "competition" down.

I've had all of this explained to me as the "way to go" by one of my CS professors - who happened to be the youngest and most successful (academically) professor in the whole faculty. At the same time I was working for another department where half of the PhD students were working on projects that they knew were dead-ends and without any practical relevance, because that's what they ended up getting funding for - and so on.

Bottom line: Academic research papers just aren't set up to communicate ideas with maximum efficiency or clarity - your best bet is usually to extract the basic idea from the paper, then write to the authors and ask them about the details. You do get the peer review though. Not a magic bullet, but the process is fairly good at weeding out brain farts (and complete BS) and finding oversights. With blogs, you gotta do that yourself.

September 20, 2010 at 12:32 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.