Benutzer:Schnark/js/diff/core

aus Wikipedia, der freien Enzyklopädie
Zur Navigation springen Zur Suche springen

Documentation for the enhanced diff algorithm implemented in Benutzer:Schnark/js/diff.js/core.js.

To use this algorithm for displaying diffs of wiki pages, put

mw.loader.load('https://de.wikipedia.org/w/index.php?title=Benutzer:Schnark/js/diff.js&action=raw&ctype=text/javascript');

into your common.js. See Benutzer:Schnark/js/diff for a longer documentation (in German). The rest of this page documents the backend, which can be used in other scripts, too.

The algorithm is based on Heckel's and Cacycle's implementation of it.

Split old and new text into words

[Bearbeiten | Quelltext bearbeiten]

First both the old and the new text are split into words. A "word" is any string separated by whitespace or punctuation. Not all such Unicode characters are included, but the most important ones (at least I hope so).

Connect words in old and new text

[Bearbeiten | Quelltext bearbeiten]

Now we want to connect equal words in the old and the new text. The basic principle is this: For each word in the old and the new text we look how often it appears. First we connect all unique words, then neighbours to already connected words if they are equal. If a word is too short, we just append the next word until it is long enough. Please note that this requires the splitting into words being done for both texts in the same way (i.e. we shouldn't split one A-BC and the other AB-C). We don't take fragments of whitespace into consideration, remember the old proverb "One whitespace doesn't make connection." This approach has two problems: If a word occurs twice (or even more often) the algorithm is confused, and secondly when connecting neighboured words a word could belong to the block left of it or to the block right to it, the output would depend on the first direction.

The second problem we solve this way: We begin connecting to the right, but stop when we reach a new line sign, as this is a good break. Then we connect to the left. At last we connect to the right again, now without stopping. In addition we connect the first two words if they are equal when going from left to right, likewise the last two. We could have canceled the common suffix and prefix right at the beginning, but then the prefix and the suffix might be too greedy.

The first problem is a bit more difficult. I finally[1] decided to use the idea from Cacycle's implementation: We look for unresolved islands in the old text and their equivalent in the new text (determined by the borders). If it is an island in the new text too, we recursively connect the words in only these two islands.

As output we have now two arrays, one for the old words which contains either -1 (if the word was deleted) or the index in the new words. Similar the array for the new words contains either -1 (if the word was inserted) or the index in the old words.

Determining blocks

[Bearbeiten | Quelltext bearbeiten]

For each old word we now determin the block number. For deleted words this is -1, for all others an increasing integer. Block borders are next to deleted words and next to words that are not neighboured in the new text.

Next we determin the block numbers in the new text. Again it is -1 for inserted words and the block number in the old text for all others.

We are interested in the sequence of the new blocks, i. e. without -1 and repetitions. If no blocks were moved this is just 0, 1, …, n, but if some blocks were moved we get a permutation of this. To say which blocks were moved and which stayed in there place we will take the longest increasing subsequence. For this a decreasing subsequence of following numbers doesn't matter, so for something like k, k − 1, k − 2 we will only take the longest block. This is to prevent a longer block to be marked as moved round a shorter one.

Now we are ready for the first output: The blocks in the longest increasing subsequence are those that stayed the same, all other blocks were moved or (if they are labeled with -1) inserted/deleted. Short block moves are changed into a deletion and an insertion.

Sometimes a block is moved and a word is changed inside this block. If you want the algorithm can show this nested changes since version 1.12. We look for two consecutive moves with an optional deletion where the blocks were moved from, which end up again next to each other with an optional insertion in between.

Until now we considered the diff only on the level of words. Now we look for following blocks of deletion and insertion and generate a diff on character level with just the same algorithm. The only change is that when looking for unique words we don't look at the single characters (they are to short) but at sequences of letters.

Clean up breaks

[Bearbeiten | Quelltext bearbeiten]

If you changed

[[Category:A]]
[[Category:C]]

into

[[Category:A]]
[[Category:B]]
[[Category:C]]

the algorithm will tell you, that you inserted

            ]]
[[Category:B

or

           B]]
[[Category:

so next we look for blocks between two blocks that stayed equal. If they have a common prefix or suffix we move this a bit such that the block borders and a line or word break coincide. If there are no such breaks, we look if we can get at least a "stronger" break if we move the whole part.

The code provides an object schnarkDiff, which contains all functions. There are two ways to get access to this object:

Either you load the script via $.getScript or some similar mechanism that will inform you when the script has finished loading. After the script has loaded, you can access the object as mw.libs.schnarkDiff (or mediaWiki.libs.schnarkDiff).

Or you add a function to the hook mw.hook('userjs.load-script.diff-core'). It will be called once the script has finished loading, the object will be the first (and only) parameter of this function.

The configuration can be accessed through schnarkDiff.config.get(), resp. schnarkDiff.config.set().

property default meaning
charDiff 1 simplify diff on character level, number of passes, 0 to disable
wordDiffQual 0.3 quality of a diff on character level, 0 to always show a diff on character level whenever possible, 1 to show it only in very simple cases
recursion 5 maximal recursion level, 1 to disable
tooShort 1 length of a word too short to be linked on uniqueness
smallRegion 9 length (in words/chars) of a region where such a word is considered as long enough
minMovedLength 10 minimal length of a moved block
for HTML output
indicateMoves 1 0: don't show moves (i.e. treat them as a deletion and insertion), 1: show moved text only at new position (and a placeholder at the old), 2: show moved text at both the new and old position (like deletions and insertions, but with different colors)
showPar '¶' sign for deleted or inserted paragraphs
movedLeft '^' sign for blocks moved left
movedRight '^' sign for blocks moved right
lengthOmit 100 middle part of blocks of at least this length should be omitted
lengthOmitPar 20 characters to show at least before/after an omission at a paragraph
lengthOmitSpace 40 characters to show at least before/after an omission at a space
lengthOmitOther 50 characters to show at least before/after an omission inside a (very long) word
lengthOmitJoin 20 minimal number of characters that must be omitted, otherwise blocks are joined again

The CSS for HTML output can be accessed by schnarkDiff.style.get(), resp. schnarkDiff.style.set().

property meaning
equal equal text
omit omitted text
ins inserted text
del deleted text
movedDel deleted text that was moved somewhere else
movedIns inserted text that was moved from somewhere else
movedFrom place where a block was moved from (a link to the new place)
movedTo moved text
movedHover moved text (when the user hovers over the original or new place, you have to call addEvents to make it work)
movedTarget moved text (when the user clicked on the link)
blocks array of styles for different blocks
repeatBlocks how often to repeat the colors for blocks (i.e. if set to 2, and blocks contains 5 entries, there will be styles for 10 blocks)
additional additional CSS code

All public functions are members of schnarkDiff.

Returns CSS code (string) that can be used with mw.util.addCSS.

Parameters: old and new text (string), flag whether to output nested blocks (boolean)

Returns the diff in an array. Each entry is an array of the form ['text', 'action'] where 'text' is the text of the block and action is one of the following symbols:

symbol meaning
= text stayed the same
+ text was inserted
- text was deleted
: text was moved from here
. text was moved to here
< text was moved to here, begin of a nested block
> text was moved to here, end of a nested block

For the actions :, . and < there is a third entry, indicating the block number (each occuring twice).

Note that it may happen, that text is an empty string.

Parameters: old and new text (string), flag whether to output nested blocks (boolean)

Returns the diff as HTML (string).

Note that nested blocks only make sense with indicateMoves of value 1. Don't expect anything sensible if you ignore this.

Parameters: jQuery object of diff

Adds event handlers to handle hovering over moved text.

{config|style}.get

[Bearbeiten | Quelltext bearbeiten]

Parameters: key (string)/array of keys/nothing

Gets the value, a hash for all keys in the array, or a hash of all key-value-pairs.

{config|style}.set

[Bearbeiten | Quelltext bearbeiten]

Parameters: key (string)/hash, value (if the first parameter isn't a hash)

Sets the value for the key, or values for all keys provided by the hash.

Parameters: character or character range (string), display (string)

Adds a character to provide a substitute (possibly with tooltip) for in HTML diff.

Parameters: separators (string)

The given string will be used as character class in a regular expression to split the text into words.

The code can be found at Benutzer:Schnark/js/diff.js/core.js. Benutzer:Schnark/js/diff.js/core.js/test.js provides some tests for the algorithm.

  1. Look at the history of this documentation and of the script, if you want to see other ideas.