R·ex / Zeng


音游狗、安全狗、攻城狮、业余设计师、段子手、苦学日语的少年。 MUGer, hacker, developer, amateur UI designer, punster, Japanese learner.

JSON Diff Kit: A Combination of Several Simple Algorithms

Background

Recently, I received a requirement: display a diff with 2 JSON strings, which are basically an API's request and response data. Furthermore, we should show the diff in a Git-like style.

Unfortunately, after a brief exploration, I found no suitable library for me.

No suitable library...

It seems I should implement it by myself.

A "Git" Way

The JSON string can be pretty informal. For example, these 2 JSON strings are equal:

{"a":1,"b":2}
{   "a" :1,   "b":   2}

Luckily, a powerful built-in function, JSON.stringify, can serialise a JavaScript object to a standard JSON string. Moreover, it can also format the string with the given indent level:

const original = `{   "a" :1,   "b":   2}`;
const formatted = JSON.stringify(JSON.parse(original), null, 4);
console.log(formatted);
// ...which will result in:
// {
//     "a": 1,
//     "b": 2
// }

The result looks pretty much like the lower-right item in the screenshot before. Let's find a proper way to perform the diff.

Since what I need to display is a Git-like style, how about using the diff algorithm used by Git directly?

Git uses the Myers algorithm to generate a beautiful diff result; maybe I can use a Git library. But I found two cons for this method quickly:

First, it can not handle the trailing comma perfectly; I had seen many conflicts in package.json when I started a merge request. In the above example, the a line is recognised as "different" only because of its trailing comma.

  {          |   {
-     "a": 1 |
             | +     "a": 1,
             | +     "b": 2
  }          |   }

Second, JSON.stringify can not handle the different key orders. For example:

// the results are different!
JSON.stringify(JSON.parse({"a":1,"b":2}))
JSON.stringify(JSON.parse({"b":2,"a":1}))

So the "Git" way failed; I should consider making a diff method myself.

A Basic JSON Diff Method

First of all, JSON has many types: number, string, boolean, null, object, array. If two values have different types, we can easily say they are different. There is no need to diff them recursively, just output "the left is removed, the right is added". Likewise, if they have the same primitive type, we can easily say they are different, just output "the left is removed, and the right is added".

Now we can only deal with two values with the same type object or array. Let's start with object first.

Object Diff Algorithm

Assume we have two objects, left and right, a direct diff algorithm can be:

  1. Sort all keys for left and right; assume the result is keysLeft and keysRight.
  2. Use two pointers p and q pointing at keysLeft[0] and keysRight[0].
  3. If there is no more value in keysLeft, output "the rest of keysRight is added"; if there is no more value in keysRight, output "the rest of keysLeft is removed".
  4. Compare the two pointed values:
  5. If p is smaller, p++, output "p is removed", add a blank line for q;
  6. If q is smaller, q++, output "q is added", add a blank line for p;
  7. If they are equal, increase them and...

    • If the values are also equal, output "equal";
    • If the values are not the same type, output "p is removed, and q is added".
    • If the values are both objects, diff them recursively.
    • Else (if the values are both arrays), use the array diff method (we will find it later).

Here comes an example:

[start]
        p: ↓
 keysLeft: a  d  f  g  i  j  k  l
keysRight: a  b  c  d  i
        q: ↑

[step 1]
        p:    ↓
 keysLeft: a  d  f  g  i  j  k  l
keysRight: a  b  c  d  i
        q:    ↑

[step 2]
        p:       ↓
 keysLeft: a  _  d  f  g  i  j  k  l
keysRight: a  b  c  d  i
        q:       ↑

[...various steps skipped]

[result]
        p:                               ↓
 keysLeft: a  _  _  d  f  g  i  j  k  l
keysRight: a  b  c  d  _  _  i  _  _  _
        q:                               ↑

This algorithm looks just like the merging process in the "merge sort" algorithm. By addling empty lines, it can ensure the left and right results have the same number of lines.

Array Diff Algorithm

Compared with the objects, the arrays seem easier to handle. Assume we have two arrays, left and right:

  1. Let p and q point to left[0] and right[0].
  2. If there is no more value in left, output "the rest of right is added"; if there is no more value in right, output "the rest of left is added".
  3. If p and q are different:

    • If p and q are both objects, call object diff;
    • If p and q are both arrays, diff them recursively;
    • Else, output "p is removed, and q is added".
  4. Else, output "equal".

The above algorithms should work for most cases. 🎉

Icing On The Cake

Although the algorithms can give a good result, the result may sometimes make us confusing.

Comply With Intuition

There may be multiple diff results for two JSONs:

- 1 |       - 1 |           | + 3
- 2 |           | + 3       | + 4
    | + 3   - 2 |           | + 5
    | + 4       | + 4   - 1 |
    | + 5       | + 5   - 2 |
   ✅          🤔           🤔

Obviously, the first one is the best. It complies with our intuition: 1) first "remove" then "add", 2) as many continuous lines with the same type as possible. That is what the output produced by the Mayers algorithm look like. Can we adjust our output to achieve this goal? I came out with a solution inspired by the "bubble sort" algorithm.

If we have a diff result, we might swap two lines to get a better result (in this example, we swap the line 2 and 3):

- 1 |        - 1 |
    | + 3    - 2 |
- 2 |     =>     | + 3
    | + 4        | + 4
    | + 5        | + 5

If two adjacent lines, x and y, x is [empty, added], y is [removed, empty], we can swap them. The result is a [removed, empty] followed by [empty, added].

The swap process may be executed several times. But like the bubble sort, if there is no swap in the current loop, the algorithm can be terminated early.

Better Matching Algorithm For Arrays

Sometimes, the difference between two arrays is due to the insertion or deletion of only a few items. For example, we should not tell "remove" in this situation:

- 2 |           | + 1
- 3 |         2 |   2
    | + 1     3 |   3
    | + 2
    | + 3
   🤔           ✅

A dynamic programming algorithm called LCS (short for Longest Common Subsequence) can help us find the longest common subsequence for two arrays. For example, the LCS is [2, 3] in the previous two arrays.

A fundamental way is we find the LCS, diff the LCS with two arrays. The extra items in a should be identified as "remove", those in b should be identified as "added".

But there is an easier way. During the LCS calculation, we use a backtrack matrix to record current status ("left", "up", "diag"). When backtracking, we can tell that the current item in a should be removed if the current status is "left" and that the current item in b should be added if the current status is "up".

A backtracking process for LCS

And then... Booyah!

Identifying Modifications

Git diff is a good way to display results, but we can do more with JSON strings. For example, if we receive {"a":1} and {"a":2}, we can tell that "the field a is changed from 1 to 2".

Let's review the object part to improve the algorithm:

If p and q are different:

  • ...
  • Output "p is removed, and q is added".

The output should be "p is modified to q".

The array part is a bit more complex. If we use normal matching, just output "a[i] is modified to b[i]" is enough. But if we use LCS matching, we should change the backtracking process a little.

We can tell that the current item in a should be removed if the current status is "left" and that the current item in b should be added if the current status is "up".

How do we recognise "there is an a remove followed by b add" when backtracking? The solution is easy: when encountering an "up", we take a step forward and see if the next step (actually, it's the "previous" step since we are backtracking) is a "left". If a "left" followed by an "up" appears in a path, we can tell that "a[i] is removed and b[j] is added". Then it can be optimised to "a[i] is modified to b[j]".

So far, we have found an algorithm good enough to handle the diff for JSON strings. We can output these fields for each line:

  • Type (add, modify, remove, equal);
  • Line number;
  • Indent level;
  • Text;
  • Whether to have a trailing comma.

The interface for diff results can be like this:

export interface DiffResult {
  level: number;
  type: 'modify' | 'add' | 'remove' | 'equal';
  text: string;
  comma?: boolean;
  lineNumber?: number;
}

The Differ & The Viewer

To decouple the diff and display logic, I write a Differ class to calculate the diff result and a Viewer component to display the diff result. I package these logics to a library json-diff-kit (a more complex usage is here), developers can use this library easily:

import { Differ } from 'json-diff-kit';

const before = {
  a: 1,
  b: 2,
  d: [1, 5, 4],
  e: ['1', 2, { f: 3, g: null, h: [5], i: [] }, 9],
};
const after = {
  b: 2,
  c: 3,
  d: [1, 3, 4, 6],
  e: ['1', 2, 3, { f: 4, g: false, i: [7, 8] }, 10],
};

// all configs are optional
const differ = new Differ({
  detectCircular: true,    // default `true`
  maxDepth: Infinity,      // default `Infinity`
  showModifications: true, // default `true`
  arrayDiffMethod: 'lcs',  // default `"normal"`
});

const diff = differ.diff(before, after);
import { Viewer } from 'json-diff-kit';
import type { DiffResult } from 'json-diff-kit';

import 'json-diff-kit/dist/viewer.css';

interface PageProps {
  diff: [DiffResult[], DiffResult[]];
}

const Page: React.FC<PageProps> = props => {
  return (
    <Viewer
      diff={props.diff}  // required
      indent={4}         // default `2`
      lineNumbers={true} // default `false`
    />
  );
};

And the result is clear:

A good diff result

Conclusion

I combine (or am inspired by) several simple algorithms:

  • Merging process in merge sort
  • Bubble sort
  • LCS & backtracking
  • Recursion

These algorithms is enough to make a good diff result for two JSON strings.

版权声明:除文章开头有特殊声明的情况外,所有文章均可在遵从 CC BY 4.0 协议的情况下转载。
上一篇: Golang 内存泄漏排查之旅
下一篇: 由 Use Zoom For DSF 导致的幽灵 Bug

这是我们共同度过的

第 2367 天