Class | SuffixArrayDelta::DeltaGenerator |
In: |
lib/sadelta.rb
|
Parent: | Object |
Uses a SuffixArray, a source, a target, and an Emitter to create a sequence of INSERT/MATCH events. The emitter is responsible for using these events to do something useful.
SHORT_MATCH_THRESHOLD | = | 30 |
short_match_threshold | [R] | |
short_match_threshold | [W] |
Initializes the generator so that generate can do it’s thing. It defaults to a short match threshold (see generate) of 30 which informally seemed to produce the best overall deltas. Allowing the short_match_threshold to be changed hasn’t been fully tested so it’s not allowed right now. It might be a good idea in the future to make this adaptable based on the input.
Does the actual work of generating the INSERT/MATCH events. The algorithm is dead simple and involves nothing more than a while loop that repeatedly calles SuffixArray#longest_nonmatch producing the required events. It continues this until it exhausts the target data.
The only strange part is the use of the @shortest_match_threshold as the third parameter of the SuffixArray#longest_nonmatch target. The shortest match threshold is a setting that helps create more efficient deltas by including any MATCH that is smaller than this threshold in the non-match region. The longest_nonmatch basically considers any bytes not found and any MATCH less than this threshold as the non-matching region. It stops looking for a non-match once it finds a MATCH greater than the short match threshold.
Currently it defaults to 30, which in my quick tests seemed to be a good limit on the size of a match. A more adaptive algorithm would be better where the shortest_match_threshold is adjusted either based on the size of the file, or the size of each match found.