Module | SuffixArrayDelta |
In: |
lib/sadelta.rb
|
A Suffix Array Delta (or Suffix Tree Delta as well) is a method of producing a delta which is reasonably small, favors small changes, and is fairly fast by searching for matching/non-matching regions using the suffix array.
The Suffix Array implementation is a C extension which wraps a suffix array construction algorithm written by Sean Quinlan and Sean Doward. Their library is licensed under the Plan 9 Open Source license (see ext/sarray/sarray.c for details).
If you want to quickly make and read deltas then use the SuffixArrayDelta#make_delta and SuffixArrayDelta#apply_delta functions. Also look at the ApplyDeltaCommand, MakeDeltaCommand, and ShowDeltaCommand for alternative ways to use this module.
The SuffixArrayDelta module consists of four classes:
[SuffixArray] C extension that does the grunt work of creating a suffix array and doing searches. [Emitter] Responsible for taking INSERT and MATCH events and "doing something" with them. [DeltaGenerator] Takes a Suffix Array, a target, and an Emitter, and then generates INSERT and MATCH events until finished. This makes delta files, but the emitter allows other options. [DeltaReader] Reads in a delta "file" (input source) and generates INSERT/MATCH events to an Emitter. This allows applying the delta (ApplyEmitter), or printing it out (LogEmitter).
The SuffixArray is not defined in this module, but in the suffix_array.c extension.
This design allows for flexible configuration of the delta processing operations, and let’s people use them in new situations, but most of the time you’ll just want to make and apply deltas without having to worry about how to "wire" the classes together. The two functions SuffixArrayDelta#make_delta and SuffixArray#apply_delta both do this for you and are used in the ApplyChangeSetCommand and MakeChangeSetCommand classes.
The actual delta creation algorithm is very simple since the SuffixArray class does all the heavy lifting. The key to how it works is the fact that we can make a suffix array fairly quickly, and then use that suffix array to find matching/non-matching regions between two strings. The delta creation simply involves taking the SuffixArray and calling SuffixArray#longest_nonmatch until we’ve exhausted the target string’s data.
Each call to SuffixArray#longest_nonmatch returns a triplet of [non-match length, match start, match-length] which is used to send INSERT and MATCH events to the Emitter.
Refer to SuffixArrayDelta#generate for more details, and SuffixArray#longest_nonmatch for how matching/non-matching is done.
Once a series of INSERT/MATCH records is recorded, we can reconstruct the target file given only the delta and the source. We simply process each record by sending it to an ApplyEmitter which either INSERTs the required block of data, or writes/copies the MATCH region from the source. The end result is a (hopefully) exact replica of the target file.
Right now the algorithm is written to be as correct as possible, but not as fast as possible. Some possible improvements are:
* Use some of the more recent suffix array construction algorithms which are possibly faster. * Implement a better search algorithm. Currently the search algorithm is a traditional binary search and must rescan the target until it finds a full match. * Use a smaller delta encoding. Currently uses a byte followed by a set of 32 bit integers and possible INSERT data. BER encoding would work, but the current Array#pack and String#unpack functions don't handle streaming very well. * Experiment with different caching options, pre-generating the suffix array, and maybe mmap files.
The only format that matters at the moment is the delta file format created by the SuffixArrayDelta::FileEmitter, and read by the SuffixArray::DeltaReader. The file consists of a sequence of INSERT and MATCH records. Each records has the format:
[INSERT] byte=0 uint32(length) string(data) -- string is not 0 terminated. [MATCH] byte=1 uint32(start) uint32(length)
The uint32 is a little-endian (think Intel) byte order. This is only an artifact of my using an Intel machine to make the program, and also a choice based on the fact that most of the entire world uses little-endian machines, so converting to network byte-order is retarded. Future versions may change this.
A Convenience method that takes a source data set (String like), a delta input source (IO like), and an output source (IO like). It then wires together the necessary SuffixArrayDelta objects to re-create a file based on the source and delta, writing the results to out.
A Convenience method that takes a source data set (String like), a target data set (String like) and an output target (IO like). It then wires together all of the objects in SuffixArrayDelta required to create a delta and write it to output.