Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"08-20-10 - Deobfuscating LZMA"

6 Comments -

1 – 6 of 6
Blogger ryg said...

On the delta literal stuff: "When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better."
There's a crucial bit that you're missing here. In structured binary data, even on a non-match, the top bits of the byte that failed to match are usually a good predictor for the top bits of the byte you're trying to code (I'll give an example in a minute). With xor, matching top bits are guaranteed to be zero; with sub they could be either all-0 or all-1 depending on whether a carry occurs or not. Assuming that top bits do tend to match, the sub has a near-random MSB which will take roughly 1 bit to code, while xor results in a MSB that's most likely zero, same as the other bits.

Why top bits would match: A lot of values in binary files are either small integers (clustering near zero) or offsets (which exhibit high locality). Say you have a 200k app binary. All called addresses are gonna be within a 200k window of the code segment base address. That's a bit less than 18 bits worth of payload, with the rest (=the top bits) always being equal.

To exploit this even more, the preprocessing filters used for things like x86 code compression usually turn call offsets (which are relative to make the code relocatable) into absolute addresses and byte-swap them into big endian. Why does this help?

Absolute addresses help because calls to the same function are now identical, instead of just having similar offsets (kinda similar doesn't cut it for LZ or context coders!). Big endian helps because the bytes that are likely to be equal (the high bits of the address) are now grouped together with the call instruction. And the xor trick means that the bytes that don't match perfectly are still likely to agree in the top few bits.

August 21, 2010 at 8:20 PM

Blogger cbloom said...

" with sub they could be either all-0 or all-1 depending on whether a carry occurs or not. Assuming that top bits do tend to match, the sub has a near-random MSB which will take roughly 1 bit to code, while xor results in a MSB that's most likely zero, same as the other bits."

I didn't mean that you would actually sub. I meant you would either use (sub + 128) or sub folded. That's what compression people mean when they say "sub" ;)

Xor does weird things. If you actually thought the values were predicted to be close, xor would be bad. For example 0x40 and 0x3F are 1 apart, but xor'd they are 0x7F.

August 21, 2010 at 8:37 PM

Blogger ryg said...

"I didn't mean that you would actually sub. I meant you would either use (sub + 128) or sub folded. That's what compression people mean when they say "sub" ;)"
sub + 128 == sub ^ 128 so the MSB is just as random. Folding would work though.

"Xor does weird things. If you actually thought the values were predicted to be close, xor would be bad. For example 0x40 and 0x3F are 1 apart, but xor'd they are 0x7F."
Depends on how you're coding and what your expectations of closeness are. Xor corresponds to a notion of closeness in terms of hamming distance, not absolute value of the difference. The tree-structured binary contexts that LZMA uses are well-suited to exploit that kind of structure where it exists - much better than a standard "counting-based" model for a multisymbol arith coder.

August 21, 2010 at 9:13 PM

Blogger Ben Slivka said...

Fascinating analysis!
Thank you for sharing.
(I worked at Microsoft 1985-1999, and in 1993-1994 I defined the CAB file format, wrote the COMPRESS.EXE (nee DIAMOND.EXE) tool to build CAB files, and purchased both the Quantum and then LZX compressors from their respective authors.)

August 29, 2017 at 5:15 PM

Blogger cbloom said...

Interesting connection. LZX and Quantum were sort of underground works of genius, they had big steps in compression that sort of went unnoticed until LZMA came along and continued that thread; everyone else was going in different directions. Around 1998 or so I worked for Dave Stafford who wrote Quantum.

August 29, 2017 at 7:41 PM

Blogger cbloom said...

Dave explained Quantum's optimal parse to me at the time, but in hindsight I probably didn't understand him. I wonder if it was in fact quite similar to the LZMA parse.

August 29, 2017 at 7:57 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.