Some wrapping up.
I've only talked about LZSA for the case of static dictionaries. What about the more common case
of scanning through a file, the dictionary is dynamic from previously transmitted data?
Well, in theory you could use LZSA there. You need a streaming/online suffix trie construction. That does exist,
and it's O(N) just like offline suffix array construction. So in theory you could do that.
But it's really no good. The problem is that LZSA is transmitting substrings with probability based on their
character counts. That's ideal for data that is generated by a truly static source (that also has no finite state patterns).
Real world data is not like that. Real world sources are always locally changing (*). This means that low-offset recent
data is much better correlated than old data. That is easy to model in LZ77 but very hard to model in LZSA; you would have
to somehow work the offsets of the strings into their code space allocation. Data that has finite state patterns also
benefits from numeric modeling of the offsets (eg. binary 4-8 patterns and so on).
(* = there's also a kind of funny "Flatland" type thing that happens in data compression. If you're in a 2d Flatland,
then 2d rigid bodies keep their shape, even under translation. However, a 3d rigid body that's translating through
your slice will appear to change shape. The same thing happens with data compression models. Even if the source comes
from an actual static model, if your view of that model is only a partial view (eg. your model is simpler than the real
model) then it will appear to be a dynamic model to you. Say you have a simple model with a few parameters, the best
value of those parameters will change over time, appearing to be dynamic.)
Furthermore, at the point that you're doing LZSA-dynamic, you've got the complexity of PPM and you may as well just do PPM.
The whole advantage of LZSA is that it's an incredibly simple way to get a PPM-like static model. You just take your
dictionary and suffix sort it and you're done. You don't have to go training a model, you don't have to find a way to
compact your PPM nodes. You just have suffix sorted strings.
Some papers for further reading :
"A note on the Ziv - Lempel model for compressing individual sequences" - Langdon's probability model equivalent of LZ
"Unbounded Length Contexts for PPM" - the PPM* paper
"Dictionary Selection using Partial Matching" - LZ-PM , an early ROLZ
"PPM Performance with BWT Complexity" - just a modification of PPM* with some notes.
"Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.
Mark Nelson's Suffix Tree writeup ; see also
Larsson, Senft, and Ukkonen.
There's obviously some relation to ACB, particularly LZSA*, just in the general idea of finding the longest backwards context
and then encoding the longest forward match from there, but the details of the coding are totally unrelated.
"02-06-15 - LZSA - Part 6"
No comments yet. -