Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"09-26-11 - Tiny Suffix Note"

8 Comments -

1 – 8 of 8
Blogger Shelwien said...

> Suffix arrays are even worse in that they don't slide at all.

They kinda do. Its pretty simple to add or remove a symbol actually.
But these simple operations have O(N) complexity because parts of
a table have to be moved.
And when we try to optimize it for speed, trees appear, and it becomes
hard to recognize any relation to BWT :)

> One is how to look up a string in the chunk that it is not a member of

I'd do BWT on adjacent pairs of chunks (overlapped),
then we'd automatically know where to look next.
But this requires 8N memory, and 8N structures like MMC suddendly
become more attractive.

Btw, there's also that - http://encode.ru/threads/1278-Nibble-MMC-matchfinder

September 27, 2011 at 4:53 AM

Blogger Shelwien said...

Ah, so to use suffix arrays for matchfinding, I'd just compute BWT
of overlapped 2*winsize blocks. It would take the same memory as
MMC or binary tree, but with up to 2x window size.

September 27, 2011 at 4:58 AM

Blogger cbloom said...

Yeah, so the problem with 2*winsize blocks is the second problem - you can have find lots of matches that are ahead of you instead of behind you.

September 27, 2011 at 9:50 AM

Blogger Shelwien said...

No, half of them would be certainly behind. Also the best match would be nearest in SA, so we can apply iteration limits like in hash chains to avoid O(N).

It might be also possible to add some links and skip the "future" pointers, but that would be again at least 8N which doesn't make sense.

September 27, 2011 at 10:32 AM

Blogger cbloom said...

No, read my post again.

September 27, 2011 at 10:47 AM

Blogger ryg said...

"No, half of them would be certainly behind."
Nope. Half of the *data* you want to compress is in the second half of the block, but it can contain way more than half of all potential matches.

Simple example: say your data consists of two blocks, each having winsize bytes. The first block consists of groups of 4 characters each - 50% of them are 4 random bytes, the other 50% are always "abab". The second block is just "ababababab"...

With high likelihood, almost all of your matches are going to be "abab", and most of the positions for "abab" in your suffix array are going to be in second block. (In the first block, there's going to be about winsize/8 locations for "abab"; in the second block, there's winsize/2-1 of them).

This may sound constructed, but you actually get patterns very much like that on highly repetitive structured data.

September 27, 2011 at 12:17 PM

Blogger cbloom said...

Yup. Unfortunately on a file of size N and window size W, the suffix array matcher is O(N*W). (or you can limit the walk but suffer an unknown amount of missed matches)

This is rare in practice, but I can't say that suffix arrays are the awesome solution when they have such a bad degeneracy possible.

September 27, 2011 at 12:36 PM

Blogger cbloom said...

Booya. I got it. Post later today.

September 27, 2011 at 1:51 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.