You knew that couldn't be the end.
SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.
The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a
range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N
The min/max tree method is generally slightly slower than the elegant "chain of fences" approach
used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for
windowed matching, which is not easy to integrate in SA2.
(ADDENDUM : this is super unclear, see more at end)
First check that it satisfies the O(N) goal on the stress tests :
[Image]
0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks
Yep. Then check optimal parse, large window vs. the good options :
[Image]
0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin
The good Suffix Trie is clearly the best, but we're in the ballpark.
Now optimal parse, 17 bit (128k) window :
totals:
Test_MMC2 : DNF
Test_LzFind2 : 506.954418 , 1355.992865
Test_SuffixArray3 : 506.954418 , 514.931740
Test_MMC1 : 13.095260 , 1298.507490
Test_LzFind1 : 12.674177 , 226.796123
Test_Hash1 : 503.319301 , 1094.570022
[Image]
Finally greedy parse, 17 bit window :
totals:
Test_MMC2 : 0.663028 , 110.373098
Test_LzFind2 : DNF
Test_SuffixArray3 : 0.663036 , 236.896551
Test_MMC1 : 0.663028 , 222.626069
Test_LzFind1 : 0.662929 , 216.912409
Test_Hash1 : 0.662718 , 62.385071
[Image]
average match length :
[Image]
And once more for clarity :
Greedy parse, 16 bit window , just the good candidates :
totals:
Test_SuffixArray3 : 0.630772 , 239.280605
Test_LzFind1 : 0.630688 , 181.430093
Test_MMC2 : 0.630765 , 88.413339
Test_Hash1 : 0.630246 , 51.980073
[Image]
[Image]
It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart
you certainly can't see it.
ADDENDUM : an email I wrote trying to explain this better :
L
That is, level 0 has intervals of size 1, that's just
Tree_0[S] = sortIndexInverse[S]
(eg. it's just the file position of that sort pos)
(in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)
Tree_1[i] covers steps of size 2, that is :
file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
I store the min and max of that
Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]
Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.
Also see the following blog on
how I walk binary interval trees
and the
released source code for StringMatchTest contains the full
code for this matcher.
"10-01-11 - String Match Results Part 6"
No comments yet. -