Benutzer:Schnark/js/diff/core

Documentation for the enhanced diff algorithm implemented in Benutzer:Schnark/js/diff.js/core.js.

Use

To use this algorithm for displaying diffs of wiki pages, put

mw.loader.load('https://de.wikipedia.org/w/index.php?title=Benutzer:Schnark/js/diff.js&action=raw&ctype=text/javascript');

into your common.js. See Benutzer:Schnark/js/diff for a longer documentation (in German). The rest of this page documents the backend, which can be used in other scripts, too.

Algorithm

The algorithm is based on Heckel's and Cacycle's implementation of it.

Split old and new text into words

First both the old and the new text are split into words. A "word" is any string separated by whitespace or punctuation. Not all such Unicode characters are included, but the most important ones (at least I hope so).

Connect words in old and new text

Now we want to connect equal words in the old and the new text. The basic principle is this: For each word in the old and the new text we look how often it appears. First we connect all unique words, then neighbours to already connected words if they are equal. If a word is too short, we just append the next word until it is long enough. Please note that this requires the splitting into words being done for both texts in the same way (i.e. we shouldn't split one A-BC and the other AB-C). We don't take fragments of whitespace into consideration, remember the old proverb "One whitespace doesn't make connection." This approach has two problems: If a word occurs twice (or even more often) the algorithm is confused, and secondly when connecting neighboured words a word could belong to the block left of it or to the block right to it, the output would depend on the first direction.

The second problem we solve this way: We begin connecting to the right, but stop when we reach a new line sign, as this is a good break. Then we connect to the left. At last we connect to the right again, now without stopping. In addition we connect the first two words if they are equal when going from left to right, likewise the last two. We could have canceled the common suffix and prefix right at the beginning, but then the prefix and the suffix might be too greedy.

The first problem is a bit more difficult. I finally^[1] decided to use the idea from Cacycle's implementation: We look for unresolved islands in the old text and their equivalent in the new text (determined by the borders). If it is an island in the new text too, we recursively connect the words in only these two islands.

As output we have now two arrays, one for the old words which contains either -1 (if the word was deleted) or the index in the new words. Similar the array for the new words contains either -1 (if the word was inserted) or the index in the old words.

Determining blocks

For each old word we now determin the block number. For deleted words this is -1, for all others an increasing integer. Block borders are next to deleted words and next to words that are not neighboured in the new text.

Next we determin the block numbers in the new text. Again it is -1 for inserted words and the block number in the old text for all others.

We are interested in the sequence of the new blocks, i. e. without -1 and repetitions. If no blocks were moved this is just 0, 1, …, n, but if some blocks were moved we get a permutation of this. To say which blocks were moved and which stayed in there place we will take the longest increasing subsequence. For this a decreasing subsequence of following numbers doesn't matter, so for something like k, k − 1, k − 2 we will only take the longest block. This is to prevent a longer block to be marked as moved round a shorter one.

First output

Now we are ready for the first output: The blocks in the longest increasing subsequence are those that stayed the same, all other blocks were moved or (if they are labeled with -1) inserted/deleted. Short block moves are changed into a deletion and an insertion.

Nested blocks

Sometimes a block is moved and a word is changed inside this block. If you want the algorithm can show this nested changes since version 1.12. We look for two consecutive moves with an optional deletion where the blocks were moved from, which end up again next to each other with an optional insertion in between.

Character diff

Until now we considered the diff only on the level of words. Now we look for following blocks of deletion and insertion and generate a diff on character level with just the same algorithm. The only change is that when looking for unique words we don't look at the single characters (they are to short) but at sequences of letters.

Clean up breaks

If you changed

[[Category:A]]
[[Category:C]]

into

[[Category:A]]
[[Category:B]]
[[Category:C]]

the algorithm will tell you, that you inserted

            ]]
[[Category:B

or

           B]]
[[Category:

so next we look for blocks between two blocks that stayed equal. If they have a common prefix or suffix we move this a bit such that the block borders and a line or word break coincide. If there are no such breaks, we look if we can get at least a "stronger" break if we move the whole part.

Access to code

The code provides an object schnarkDiff, which contains all functions. There are two ways to get access to this object:

Either you load the script via $.getScript or some similar mechanism that will inform you when the script has finished loading. After the script has loaded, you can access the object as mw.libs.schnarkDiff (or mediaWiki.libs.schnarkDiff).

Or you add a function to the hook mw.hook('userjs.load-script.diff-core'). It will be called once the script has finished loading, the object will be the first (and only) parameter of this function.

Configuration

The configuration can be accessed through schnarkDiff.config.get(), resp. schnarkDiff.config.set().

property	default	meaning
`charDiff`	`1`	simplify diff on character level, number of passes, 0 to disable
`wordDiffQual`	`0.3`	quality of a diff on character level, 0 to always show a diff on character level whenever possible, 1 to show it only in very simple cases
`recursion`	`5`	maximal recursion level, 1 to disable
`tooShort`	`1`	length of a word too short to be linked on uniqueness
`smallRegion`	`9`	length (in words/chars) of a region where such a word is considered as long enough
`minMovedLength`	`10`	minimal length of a moved block
for HTML output
`indicateMoves`	`1`	0: don't show moves (i.e. treat them as a deletion and insertion), 1: show moved text only at new position (and a placeholder at the old), 2: show moved text at both the new and old position (like deletions and insertions, but with different colors)
`showPar`	`'¶'`	sign for deleted or inserted paragraphs
`movedLeft`	`'^'`	sign for blocks moved left
`movedRight`	`'^'`	sign for blocks moved right
`lengthOmit`	`100`	middle part of blocks of at least this length should be omitted
`lengthOmitPar`	`20`	characters to show at least before/after an omission at a paragraph
`lengthOmitSpace`	`40`	characters to show at least before/after an omission at a space
`lengthOmitOther`	`50`	characters to show at least before/after an omission inside a (very long) word
`lengthOmitJoin`	`20`	minimal number of characters that must be omitted, otherwise blocks are joined again

CSS

The CSS for HTML output can be accessed by schnarkDiff.style.get(), resp. schnarkDiff.style.set().

property	meaning
`equal`	equal text
`omit`	omitted text
`ins`	inserted text
`del`	deleted text
`movedDel`	deleted text that was moved somewhere else
`movedIns`	inserted text that was moved from somewhere else
`movedFrom`	place where a block was moved from (a link to the new place)
`movedTo`	moved text
`movedHover`	moved text (when the user hovers over the original or new place, you have to call `addEvents` to make it work)
`movedTarget`	moved text (when the user clicked on the link)
`blocks`	array of styles for different blocks
`repeatBlocks`	how often to repeat the colors for blocks (i.e. if set to 2, and `blocks` contains 5 entries, there will be styles for 10 blocks)
`additional`	additional CSS code

Functions

All public functions are members of schnarkDiff.

`getCSS`

Returns CSS code (string) that can be used with mw.util.addCSS.

`diff`

Parameters: old and new text (string), flag whether to output nested blocks (boolean)

Returns the diff in an array. Each entry is an array of the form ['text', 'action'] where 'text' is the text of the block and action is one of the following symbols:

symbol	meaning
`=`	text stayed the same
`+`	text was inserted
`-`	text was deleted
`:`	text was moved from here
`.`	text was moved to here
`<`	text was moved to here, begin of a nested block
`>`	text was moved to here, end of a nested block

For the actions :, . and < there is a third entry, indicating the block number (each occuring twice).

Note that it may happen, that text is an empty string.

`htmlDiff`

Parameters: old and new text (string), flag whether to output nested blocks (boolean)

Returns the diff as HTML (string).

Note that nested blocks only make sense with indicateMoves of value 1. Don't expect anything sensible if you ignore this.

`addEvents`

Parameters: jQuery object of diff

Adds event handlers to handle hovering over moved text.

`{config|style}.get`

Parameters: key (string)/array of keys/nothing

Gets the value, a hash for all keys in the array, or a hash of all key-value-pairs.

`{config|style}.set`

Parameters: key (string)/hash, value (if the first parameter isn't a hash)

Sets the value for the key, or values for all keys provided by the hash.

`addInvisible`

Parameters: character or character range (string), display (string)

Adds a character to provide a substitute (possibly with tooltip) for in HTML diff.

`setWordSep`

Parameters: separators (string)

The given string will be used as character class in a regular expression to split the text into words.

Code

The code can be found at Benutzer:Schnark/js/diff.js/core.js. Benutzer:Schnark/js/diff.js/core.js/test.js provides some tests for the algorithm.

Remarks

↑ Look at the history of this documentation and of the script, if you want to see other ideas.

[1] Look at the history of this documentation and of the script, if you want to see other ideas.

[1]

Benutzer:Schnark/js/diff/core

Inhaltsverzeichnis

Use

Algorithm

Split old and new text into words

Connect words in old and new text

Determining blocks

First output

Nested blocks

Character diff

Clean up breaks

Access to code

Configuration

CSS

Functions

`getCSS`

`diff`

`htmlDiff`

`addEvents`

`{config|style}.get`

`{config|style}.set`

`addInvisible`

`setWordSep`

Code

Remarks

Navigationsmenü

Benutzer:Schnark/js/diff/core

Use

Algorithm

Split old and new text into words

Connect words in old and new text

Determining blocks

First output

Nested blocks

Character diff

Clean up breaks

Access to code

Configuration

CSS

Functions

getCSS

diff

htmlDiff

addEvents

{config|style}.get

{config|style}.set

addInvisible

setWordSep

Code

Remarks

Navigationsmenü

Suche

`getCSS`

`diff`

`htmlDiff`

`addEvents`

`{config|style}.get`

`{config|style}.set`

`addInvisible`

`setWordSep`