Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"03-15-15 - LZ Literal Correlation Images"

6 Comments -

1 – 6 of 6
Blogger Paul W. said...

I think that the square circled in green in the first picture is from 48-57, not 48-58.

That's the range of ASCII digit characters 0-9, and presumably what you're seeing is due to numeric data represented as text, and any digit being about equally likely to follow any other digit, once you've stripped away the (typically more redundant) leading digit strings with an LZ match.

May 2, 2015 at 8:44 AM

Blogger Paul W. said...

BTW, what is the purpose of this post? Is it to pose a puzzle for your readers---what are the sources that look like this under these transforms?

Are the fez, lzt24, and lzt99 standard test files like enwik?

May 3, 2015 at 7:29 AM

Blogger cbloom said...

Fez, lzt24 & lzt99 are some of my test files from my collection of videogame test data. I picked them because they seem to be pretty good representatives of some data types (lzt99 is an aggregate of several files).

(testing on enwik is considered harmful)

The point is that we were investigating LZ literal compression and I thought it might be helpful to visualize the models and see if anything stands out.

You can certainly see how different LO correlation is vs O1.

You can see that Fez is in fact perfect sub data. You can see that lzt24 has some perfect sub data, but also some strong order0 peaks that are screwed up by sub and xor.

May 3, 2015 at 9:42 AM

Blogger Paul W. said...

Can you redistribute your test files? I'd be interested in plotting them with (my modded version of) Matt Mahoney's fv program, which shows different regularities and usually makes discontinuities within a file clear.

The order 1 raw picture of Fez shows a faint feature in approximately the same place on the diagonal as the bright box in enwik7 for ASCII lower-case letters following each other. (But it sorta looks like 4 faint blobs arranged in a square for some other reason.) Is there any ASCII at all in that file---maybe a text header or something?

May 3, 2015 at 10:03 AM

Blogger cbloom said...

I can't redist files from games.

You can get "Fez_Essentials.pak" by downloading or buying the game "Fez".

It's a simple tar-like pak file; it has a small text header followed by the binary data.

I can redist lzt24 since it is RAD-owned data. You can email me or maybe I'll just post it.

May 3, 2015 at 10:09 AM

Blogger Paul W. said...

What kind of LZ match are you doing? Is it a fixed-length (3 byte?) match, or greedy, or what?

--

This is all interesting to me because I'm looking into cheap feature detectors and classifiers to figure out how to compress whatever input you get, without having to run a bunch of models and adapt between them like a context mixer...

lzt99 looks like it may have at least 7-bit ASCII text in it... it's got obvious activity in the lowercase ASCII box as well as the digits box. Hard to tell if it's using any extended or multibyte chars (like enwik) because it's so faint.

From just the order 1 raw picture, it looks like fez is multibyte integer data where the values only cover a fraction of the range, so that the the high few bits of the high bytes are all 1's for positive numbers, or all 0's for negative numbers, with no evident imbalance between the two, and a skew toward small absolute values, like in a variable-amplitude waveform.









May 3, 2015 at 10:37 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.