DEV Community

Cover image for SemanticDiff - Language aware diffs for VS Code
Michael Müller
Michael Müller

Posted on

SemanticDiff - Language aware diffs for VS Code

What I am building

SemanticDiff is a programming-language aware diff for Visual Studio Code. The extension helps you understand code changes faster by hiding style-only changes, detecting moved code blocks as well as refactorings.

I think it is easier to understand using the following image:
SemanticDiff

If you want to see the features in action, try out the online demo. Simply select one of the existing examples or enter your own code.

Why I am building SemanticDiff

Most diff tools like git diff detect changes between two versions of a code by comparing each line character-by-character. While this works well in many cases, it can produce a lot of noise if you reformat your code or perform other types of refactorings.

For example, splitting the parameters of a complex function call across multiple lines produces a diff that isn't very useful. It looks like you have completely replaced the code:

- verify_token = generate_token(user, models.TokenType.EmailVerification, datetime.timedelta(days=2), email=user.email)
+ verify_token = generate_token(
+     user,
+     models.TokenType.EmailVerification,
+     datetime.timedelta(days=3),
+     email=user.email,
+ )
Enter fullscreen mode Exit fullscreen mode

You would now need to manually compare all the parameters to spot that I extended the duration from 2 to 3 days. Wouldn't it be great if your diff tool could filter out all the irrelevant changes, so that you can immediately see the changed parameter?

That is what SemanticDiff does. It makes reviewing code less tedious and more secure. By hiding style-only changes and highlighting modifications within moved code blocks, you are less likely to overlook anything important and you have to review less overall.

How does it work?

SemanticDiff implements a pipeline consisting of three stages:

SemanticDiff Pipeline

1. Code Parsing

The old and new code is parsed into an abstract syntax tree (AST). These trees contain all the information that a compiler or interpreter would need to compile or interpret your code. This approach gives us two advantages over the original text representation:

  1. We have additional information about the meaning of individual characters (e.g. that generate_token is the name of a function that is going to be called).
  2. All characters that don't have an effect on the program flow, like white-space or line breaks outside of strings, are automatically filtered out.

2. Tree Matching

The nodes of the old and new tree are matched to identify which parts of the code have changed and which are still the same. This involves comparing all the nodes of the old and new tree to find those that are identical.

Since we know the structure of the code, we can implement a more advanced comparison than a normal text diff. For example, the following three python statements look quite different when comparing the code character-by-character, but it is easy to verify that they are identical using the AST tree.

a = "Hello\nWorld"
a = 'Hello\n' \
    'World'
a = """Hello
World"""
Enter fullscreen mode Exit fullscreen mode

3. Text Diff Generation

The last step is to create a side-by-side diff so that a developer has an easy way to understand what has changed. This involves aligning the old and new source code using the generated mapping. All old nodes that can not be found in the new tree are marked as deleted in the diff and vice versa for new nodes.

Since we allow mapping any pairs of nodes across the two trees, some parts of the code can not be aligned properly. This occurs for example, if a block of code has been moved. Special handling for these cases is required.

Want to try it out?

SemanticDiff will soon enter closed beta. If you are interested in better diffs integrated directly into VS Code, join the waitlist. You get a notification email as soon as the beta starts.

In the meantime you can play around with the online demo.

Also let me know if you want to know more about the inner workings of SemanticDiff 😃️.

Top comments (0)