Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"09-24-11 - Suffix Tries 1"

7 Comments -

1 – 7 of 7
Blogger Cyan said...

Afaiu, the "follow pointer" provides a great speed boost.

It seems logical. However, i guess for this to be correct that the suffix trie is no longer timely ordered (which means pointer to sibling and child always go to older positions) ?

Here is an example where the follow pointer seems to fail finding the best match length :

Imagine there is a match at
BAND; (4 characters)
then the "follow pointer" should point to
AND; (3 characters)
but maybe, in between, there was a position with :
AND THIS ONE IS BETTER; (22 characters)

September 26, 2011 at 5:54 AM

Blogger cbloom said...

No, a correct implementation of follow pointers will point to the *node* for "AND" not the leaf for "AND;" so if there are longer matches possible it will find them.

If a follow pointer is original made pointing at a leaf, it must be updated to point to a node when that leaf is later split.

For example say "bands..." is originally put in the tree as a leaf, the follow pointer will point to "ands.." also a leaf. Later on "andy.." is seen and the "and" prefix is split into a node that branches to 's' and 'y'. Now the follow pointer for "band" needs to be updated to point to the node for "and".

This requires some maintenance structures but doesn't change the O().

September 26, 2011 at 8:46 AM

Blogger Cyan said...

OK, so a node does not point towards a position in the input stream, only a leaf does.

September 26, 2011 at 9:05 AM

Anonymous Anonymous said...

Is there a way to create the follow pointer efficiently other than looking up the following chars?

If you were doing M queries and N inserts and M >> N, then pushing the work into the insert would make sense. But in LZ compression, you have N suffixes and N positions at which you try to find the best match, so I would think you're looking at N inserts and N queries.

If the follow makes the N queries O(1) time at the cost of making creating the follow take as much time as the query would have been, you're not saving anything.

But maybe there's some trick I'm missing.

September 26, 2011 at 5:26 PM

Blogger cbloom said...

Yeah, so :

1. In suffix tries the insert & find are really the same operation so the only thing that makes sense is to make a combined "find_and_insert".

2. Follow pointer maintenance and such really only work great if you are doing "find_and_insert" on every position.

3. So, here's how it works :

at position N I want to do a "find_and_insert". I start in the tree by using the follow pointer from the match at position N-1.

As I do my match at position N I make some new nodes. Those are missing follow pointers, and I just set them aside (I don't immediately fill them out). (with path compression this is only ever one new node)

I now move on to position N+1 and use the follow pointer from position N to start my search. I'm now in the right part of tree to fill out the follow pointer that was missing which I set aside in the previous search.

I actually understand it now as a funny way of sequentially walking through the tree. At every position N you are filling out the previous follow pointer, using the current follow pointer, and trying to find the next follow pointer. You are almost never walking the tree from the root to the leaves, instead you're jumping around on these links.

I actually think it's closely related to the BWT and the tree-structured BWT coders by Maniscalco and others but I haven't quite grocked that yet.

September 26, 2011 at 5:47 PM

Blogger cbloom said...

It should be reasonably easy to see that you can walk an entire file in O(N).

Any time you fail to grow a follow (that is, you had lastml at the previous spot and you now have lastml-1) that is done in O(1).

If you do find a longer match, say it's longer by L, that is O(L). But you carry that L forward to the next spot. So if you add up all the match grow lengths, the sum of all of them is O(N).

Therefore the whole walk is O(N).

September 26, 2011 at 6:05 PM

Anonymous Anonymous said...

Oh wait, I see. It's contingent on this idea that we're always tracking two places at once, the current insertion and the best match. Then we can use the follow from the best match to guide the insertion.

(Well, and even if we weren't desiring to track a best match, you could still track the best match just to run this algorithm faster. Weird.)

September 26, 2011 at 8:15 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.