Class SuffixArrayDelta::DeltaGenerator
In: lib/sadelta.rb
Parent: Object

Uses a SuffixArray, a source, a target, and an Emitter to create a sequence of INSERT/MATCH events. The emitter is responsible for using these events to do something useful.

Methods

generate   new  

Constants

SHORT_MATCH_THRESHOLD = 30

Attributes

short_match_threshold  [R] 
short_match_threshold  [W] 

Public Class methods

Initializes the generator so that generate can do it’s thing. It defaults to a short match threshold (see generate) of 30 which informally seemed to produce the best overall deltas. Allowing the short_match_threshold to be changed hasn’t been fully tested so it’s not allowed right now. It might be a good idea in the future to make this adaptable based on the input.

Public Instance methods

Does the actual work of generating the INSERT/MATCH events. The algorithm is dead simple and involves nothing more than a while loop that repeatedly calles SuffixArray#longest_nonmatch producing the required events. It continues this until it exhausts the target data.

The only strange part is the use of the @shortest_match_threshold as the third parameter of the SuffixArray#longest_nonmatch target. The shortest match threshold is a setting that helps create more efficient deltas by including any MATCH that is smaller than this threshold in the non-match region. The longest_nonmatch basically considers any bytes not found and any MATCH less than this threshold as the non-matching region. It stops looking for a non-match once it finds a MATCH greater than the short match threshold.

Currently it defaults to 30, which in my quick tests seemed to be a good limit on the size of a match. A more adaptive algorithm would be better where the shortest_match_threshold is adjusted either based on the size of the file, or the size of each match found.

[Validate]