DEV Community

Herrington Darkholme
Herrington Darkholme

Posted on

Search Multi-language Documents in ast-grep

Introduction

ast-grep is a powerful structural code search tool that leverages syntax trees to find patterns in source code. This makes it more accurate and flexible than traditional text-based search tools.

ast-grep works well searching files of one single language, but it is hard to extract a sub language embedded inside a document.

However, in modern development, it's common to encounter multi-language documents. These are source files containing code written in multiple different languages. Notable examples include:

  • HTML files: These can contain JavaScript inside <script> tags and CSS inside <style> tags.
  • JavaScript files: These often contain regular expression, CSS style and query languages like graphql.
  • Ruby files: These can contain snippets of code inside heredoc literals, where the heredoc delimiter often indicates the language.

These multi-language documents can be modeled in terms of a parent syntax tree with one or more injected syntax trees residing inside certain nodes of the parent tree.

ast-grep now supports a feature to handle language injection, allowing you to search for code written in one language within documents of another language.

ast-grep lang diagram

This concept and terminology come from tree-sitter's language injection, which implies you can inject another language into a language document. (BTW, neovim also embraces this terminology.)

Example: Search JS/CSS in the CLI

Let's start with a simple example of searching for JavaScript and CSS within HTML files using ast-grep's command-line interface (CLI).
ast-grep has builtin support to search JavaScript and CSS inside HTML files.

Using sg run: find patterns of CSS in an HTML file

Suppose we have an HTML file like below:



<style>
  h1 { color: red; }
</style>
<h1>
  Hello World!
</h1>
<script>
  alert('hello world!')
</script>


Enter fullscreen mode Exit fullscreen mode

Running this ast-grep command will extract the matching CSS style code out of the HTML file!



sg run -p 'color: $COLOR'


Enter fullscreen mode Exit fullscreen mode

ast-grep outputs this beautiful CLI report.



./test.html
2│  h1 { color: red; }


Enter fullscreen mode Exit fullscreen mode

ast-grep works well even if just providing the pattern without specifying the pattern language!

Using sg scan: find JavaScript in HTML with rule files

You can also use ast-grep's rule file to search injected languages.

For example, we can warn the use of alert in JavaScript, even if it is inside the HTML file.



id: no-alert
language: JavaScript
severity: warning
rule:
  pattern: alert($MSG)
message: Prefer use appropriate custom UI instead of obtrusive alert call.


Enter fullscreen mode Exit fullscreen mode

The rule above will detect usage of alert in JavaScript. Running the rule via sg scan.



sg scan --rule no-alert.yml


Enter fullscreen mode Exit fullscreen mode

The command leverages built-in behaviors in ast-grep to handle language injection seamlessly. It will produce the following warning message for the HTML file above.



warning[no-alert]: Prefer use appropriate custom UI instead of obtrusive alert call.
  ┌─ ./test.html:8:3
  │
8 │   alert('hello world!')
  │   ^^^^^^^^^^^^^^^^^^^^^


Enter fullscreen mode Exit fullscreen mode

How language injections work?

ast-grep employs a multi-step process to handle language injections effectively. Here's a detailed breakdown of the workflow:

  1. File Discovery: The CLI first discovers files on the disk via the venerable ignore crate, the same library under ripgrep's hood.

  2. Language Inference: ast-grep infers the language of each discovered file based on file extensions.

  3. Injection Extraction: For documents that contain code written in multiple languages (e.g., HTML with embedded JS), ast-grep extracts the injected language sub-regions. At the moment, ast-grep handles HTML/JS/CSS natively.

  4. Code Matching: ast-grep matches the specified patterns or rules against these regions. Pattern code will be interpreted according to the injected language (e.g. JS/CSS), instead of the parent document language (e.g. HTML).

Customize Language Injection: styled-components in JavaScript

You can customize language injection via the sgconfig.yml configuration file. This allows you to specify how ast-grep handles multi-language documents based on your specific needs, without modifying ast-grep's built-in behaviors.

Let's see an example of searching CSS code in JavaScript. styled-components is a library for styling React applications using CSS-in-JS. It allows you to write CSS directly within your JavaScript via tagged template literals, creating styled elements as React components.

The example will configure ast-grep to detect styled-components' CSS.

Injection Configuration

You can add the languageInjections section in the project configuration file sgconfig.yml.



languageInjections:
- hostLanguage: js
  rule:
    pattern: styled.$TAG`$CONTENT`
  injected: css


Enter fullscreen mode Exit fullscreen mode

Let's break the configuration down.

  1. hostLanguage: Specifies the main language of the document. In this example, it is set to js (JavaScript).

  2. rule: Defines the ast-grep rule to identify the injected language region within the host language.

- `pattern`: The pattern matches styled components syntax where `styled` is followed by a tag (e.g., `button`, `div`) and a template literal containing CSS.
- the rule should have a meta variable `$CONTENT` to specify the subregion of injected language. In this case, it is the content inside the template string.
Enter fullscreen mode Exit fullscreen mode
  1. injected: Specifies the injected language within the identified regions. In this case, it is css.

Example Match

Consider a JSX file using styled components:



import styled from 'styled-components';

const Button = styled.button`
  background: red;
  color: white;
  padding: 10px 20px;
  border-radius: 3px;
`

exporrt default function App() {
  return <Button>Click Me</Button>
}


Enter fullscreen mode Exit fullscreen mode

With the above languageInjections configuration, ast-grep will:

  1. Identify the styled.button block as a CSS region.
  2. Extract the CSS code inside the template literal.
  3. Apply any CSS-specific pattern searches within this extracted region.

You can search the CSS inside JavaScript in the project configuration folder using this command:



sg -p 'background: $COLOR' -C 2


Enter fullscreen mode Exit fullscreen mode

It will produce the match result:



./styled.js
2│
3│const Button = styled.button`
4│  background: red;
5│  color: white;
6│  padding: 10px 20px;
```

## Using Custom Language with Injection

Finally, let's look at an example of searching for GraphQL within JavaScript files.
This demonstrates ast-grep's flexibility in handling custom language injections.

### Define graphql custom language in `sgconfig.yml`.

First, we need to register graphql as a custom language in ast-grep. See [custom language reference](https://ast-grep.github.io/advanced/custom-language.html) for more details.

```yaml
customLanguages:
  graphql:
    libraryPath: graphql.so # the graphql tree-sitter parser dynamic library
    extensions: [graphql]   # graphql file extension
    expandoChar: $          # see reference above for explanation
```

### Define graphql injection in `sgconfig.yml`.

Next, we need to customize what region should be parsed as graphql string in JavaScript. This is similar to styled-components example above.

```yaml
languageInjections:
- hostLanguage: js
  rule:
    pattern: graphql`$CONTENT`
  injected: graphql
```

### Search GraphQL in JavaScript

Suppose we have this JavaScript file from [Relay](https://relay.dev/), a GraphQL client framework.

```js
import React from "react"
import { graphql } from "react-relay"

const artistsQuery = graphql`
  query ArtistQuery($artistID: String!) {
    artist(id: $artistID) {
      name
      ...ArtistDescription_artist
    }
  }
`
```

We can search the GraphQL fragment via this `--inline-rules` scan.

```sh
sg scan --inline-rules="{id: test, language: graphql, rule: {kind: fragment_spread}}"
```

Output

```sh
help[test]:
  ┌─ ./relay.js:8:7
  │
8 │       ...ArtistDescription_artist
  │       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

## More Possibility to be Unlocked...

By following these steps, you can effectively use ast-grep to search and analyze code across multiple languages within the same document, enhancing your ability to manage and understand complex codebases.

This feature extends to various frameworks like [Vue](https://vuejs.org/) and [Svelte](https://svelte.dev/), enables searching for [SQL in React server actions](https://x.com/peer_rich/status/1717609270475194466), and supports new patterns like [Vue-Vine](https://x.com/hd_nvim/status/1815300932793663658).

Hope you enjoy the feature! Happy ast-grepping!
Enter fullscreen mode Exit fullscreen mode

Top comments (0)