Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"09-27-11 - String Match Stress Test"

4 Comments -

1 – 4 of 4
Anonymous Anonymous said...

I understand how the later tests are testing for O(N^2) traps and etc., but I don't understand how twobooks is relevant to that stuff--isn't it just testing whether you can find long matches at all? There shouldn't be anything O(N^2) about it, should there?

(Also, if you're making a compressor that operates on, say, independent 256KB blocks, it's kind of an irrelevant test.)

September 27, 2011 at 6:37 PM

Blogger cbloom said...

" I understand how the later tests are testing for O(N^2) traps and etc., but I don't understand how twobooks is relevant to that stuff ... "

Yeah, there is. (this is for non-greedy parsing only). That's why twobooks is interesting, it isolates one specific O(N^2) case that is rarely handled.

Any string matcher that does not have a "follows" implementation will come to the second occurance of book1 and do something like this :

find the longest match, which is N characters, which takes N character compares in the strcmp

go to the next pos, find the longest match, that takes N-1 character compares

etc. N+(N-1)+(N-2) ... = O(N^2)

You can fail to see these traps if you think of the string compare as being O(1).

"(Also, if you're making a compressor that operates on, say, independent 256KB blocks, it's kind of an irrelevant test.)"

You could take the first 128k of book1 and do the same test.

But really it applies to any long match scenario in non-greedy parsing.

Obviously most people (such as myself in the exist rrLZH code) limit their exposure to these kind of traps by doing things like bailing out of optimal parsing when they see a match longer than 256 or whatever.

September 27, 2011 at 7:10 PM

Anonymous Anonymous said...

Oh, right, I see, ok. Ugh.

Of course a match of length K at position N always implies a match of K-1 at position N+1, so you could probably avoid this specific case pretty easily (and optimizing that is useful in general for long matches, not just in this specific case).

September 27, 2011 at 7:30 PM

Blogger cbloom said...

Yeah, true, "twobooks" is actually a very easy to fix degenerate case of this problem. You can patch a solution to it onto any underlying string matcher just by tracking {lastoffset,lastml} and checking them first.

But that doesn't really solve the underlying problem. It's sort of like patching on handling of the degenerate RLE case, that doesn't fix the ababab case, etc.

You can make a trickier "follows" stress test by repeating a long common prefix which is then followed by a variety of things. And also suffixes of the long prefix, such that the simple {lastml,lastoffset} no longer gives you the best match.

September 27, 2011 at 7:36 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.