Challenge to the reader : can this be extended for windowed matching?
You will need to construct an array of "next pos preceding but in window".
September 30, 2011 at 5:00 PM
A suffix sorter (such as the excellent divsufsort by Yuta Mori) provides a list of the suffix positions
in an array in sorted order. Eg. sortedSuffixes[i] is the ith suffix in order.
You can easily invert this table to make sortLookup such that sortLookup[ sortedSuffix[i] ] == i . eg.
sortLookup[i] is the sort order for position i.
Now at this point, for each suffix sort position i, you know that the longest match with another suffix is
either at i-1 or i+1.
Next we need the neighboring pair match lengths for the suffix sort. This can be done in O(N)
as
previously described here . So we now have a sortSameLen[] array such that sortSameLen[i] tells
you the match length between (sorted order) elements i and i+1.
Using just these you can find all the match lengths for any suffix in the array thusly :
For a suffix start at index pos
Find its sort order : sortIndex = sortLookup[pos]
In each direction (+1 and -1)
current_match_len = infinite
step to next sort index
current_match_len = MIN(current_match_len,sortSameLen[sort index])
Okay. This is all old news. But it has a problem that has been
discussed previously .
When matching strings for LZ and such, we don't want the longest match in the array, we want the longest
match that occurs earlier. Handled naively this ruins the great O() performance of suffix array string matching.
But you can do better.
Run
Algorithm Next Index with Lower Value on the sortedSuffix[] array. This provides an array
nextSuffixPreceding[]. This is exactly what you need - it provides the next closest suffix with a
preceding index.
Now instead of the longest match being at +1 and -1, the longest match is at nextSuffixPreceding[i] and
priorSuffixPreceding[i].
There's one last problem - if my current suffix is at position pos, and I look up si = sortIndex[pos] and
from that nextSuffixPreceding[si] - I need to walk up to that position one by one doing MIN() on the
adjacent pair match lengths (sortSameLen). That ruins my O() win.
But there's a solution - simply build the match length as well when you run "next index with lower value".
This can be done easily by tracking the match length back to the preceding "fence". This adds no complexity
to the algorithm.
The total sequence of operations is :
sort suffixes : O(NlogN) or so
build sort lookup : O(N)
build sort pair same len : O(N)
build next/prior pos preceding with match lengths : O(N)
now to find a match length :
at position "pos"
si = sortLookup[pos]
for each direction (following and preceding)
matchpos = nextSuffixPreceding[si]
matchlen = nextSuffixPreceding_Len[si]
that is, the match length lookup is a very simple O(1) per position (or O(N) for all positions).
One minor annoyance remains, which is that the suffix array string searcher does not provide the lowest
offset for a given length of match. It gives you the closest in suffix order, which is not what you want.
"09-28-11 - String Matching with Suffix Arrays"
1 Comment -
Challenge to the reader : can this be extended for windowed matching?
You will need to construct an array of "next pos preceding but in window".
September 30, 2011 at 5:00 PM