Attila Molnar

Posted on May 15, 2023

Search in your Jupyter notebooks from the CLI, fast.

#jupyter #datascience #python #cli

My colleagues and I have written a large number of Jupyter notebooks. It has been a common problem to search efficiently within them.

Jupyter notebooks are JSON files, so using traditional search methods such as grep are tedious. Without a bit of tinkering it is 😉

Initially, I found a script on the internet called "nbgrep", but it did not work out for us. So, I wrote my own version.

It requires jq for JSON processing and GNU parallel for concurrent searches in the notebooks.

These are awesome tools anyway and can be very handy for data scientists. jq makes it easy to write queries against JSON files, while parallel can be used to execute any kind of code concurrently, or even on multiple machines (using ssh) in a very simple way.

They are easy to install:

Debian and friends:

sudo apt-get install jq parallel

MacOS

brew install jq parallel

You can find my script as a gist here, or if you cannot install parallel for some reason, here is the non-parallel version. I wrote a non-parallel version specifically for this post, so please notify me if something is wrong.

So when you run it as:

nbgrep 'read_[a-z]'

you will get something like:

./foo/bar.ipynb
        df_a = pd.read_csv(
        df_b = pd.read_csv(
./foobar/barfoo.ipynb
        G = obonet.read_obo(url)

(I had to rename most of the result)

The script:

#!/bin/bash
set -euo pipefail

catch() {
  echo "ERROR $1 occurred on $2"
}
trap 'catch $? $LINENO' ERR

pattern="${1? You must provide a search pattern}"

jupyter-search() {
  file="$1"
  pattern="$2"

  matches=$(< "$file" jq '.cells[].source[]' -r \
    | grep -P "$pattern" \
    | xargs -I '%' echo -e "\t%"
  )

  if [ ! -z "$matches" ]
  then
    echo "$file"
    echo "$matches"
  fi

}
export -f jupyter-search

find . \
  -type 'f' \
  -iname '*.ipynb' \
  -not -path '*/.ipynb_checkpoints/*'\
  | parallel jupyter-search {} "$pattern"

Now let's see how the script works.

#!/bin/bash
set -euo pipefail

#!/bin/bash just tells the kernel where to find the interpreter for the script. set -euo pipefail is bash's "strict mode". Without it, bash will not stop the script execution on an error -e or on encountering undefined variables -u. -o pipefail will make sure that if any error happens (non-zero exit code) in a pipeline, then the whole pipeline will be considered erroneous.

catch() {
  echo "ERROR $1 occurred on $2"
}
trap 'catch $? $LINENO' ERR

By trapping errors this way, we can see on which line the error occurred.

pattern="${1? You must provide a search pattern}"

This is the search pattern that you want to find in the notebooks. It can be any Perl-like regex pattern. This line will also provide a helpful error message if the search pattern is not provided.

jupyter-search() {
  file="$1"
  pattern="$2"

  matches=$(< "$file" jq '.cells[].source[]' -r \
    | grep -P "$pattern" \
    | xargs -I '%' echo -e "\t%"
  )

  if [ ! -z "$matches" ]
  then
    echo "$file"
    echo "$matches"
  fi
}

This is a bash function definition. This function has two arguments: $file, read from the first positional argument, and the search $pattern, read from the second one. Let's concentrate on the search part:

  matches=$(< "$file" jq '.cells[].source[]' -r \
    | grep -P "$pattern" \
    | xargs -I '%' echo -e "\t%"
  )

Here, the output of the pipeline (commands connected by pipes |) will be assigned to the matches variable.

The first command of the pipeline reads the notebook into the jq JSON processor, which extracts all code cells. These are piped into a grep command which applies the given $pattern as a Perl-like regexp -P. The last command in the pipeline will tabulate the matches found by grep.

The if statement at the end of the function will print the results, given that $matches is not an empty string.

export -f jupyter-search

parallel will execute the given code in a sub-shell, which will not inherit the variables from the parent shell (the shell executing the script itself). Therefore, it is necessary to export the previously defined function, so that it will be accessible in the sub-shell created by parallel.

find . \
  -type 'f' \
  -iname '*.ipynb' \
  -not -path '*/.ipynb_checkpoints/*'\
  | parallel jupyter-search {} "$pattern"

This code will find all regular files (excluding symlinks, directories and device files) with the .ipynb extension in a case-insensitive manner. It will search recursively in the directory where you started the script, omitting only the .ipynb_checkpoints directories. The found notebook files are streamed into the parallel command which applies the jupyter-search function on them with the given search $pattern in as many parallel processes as your CPU core number. All of these processes will send their results onto your standard output, but parallel tidies them up so it will not be messy.

So in conclusion we have a very fast way to search in Jupyter notebooks from the command line. I hope some of you will find it helpful. If you have ideas about how to improve it, I am open to suggestions.

DEV Community

Search in your Jupyter notebooks from the CLI, fast.

Top comments (0)

Read next

AI-Powered System Safely Converts Legacy C Code to Modern Rust with Automated Verification

LLM Test Generators Miss Critical Bugs Due to Design Flaws, Study Shows

New ML Compiler Uses Pattern Matching to Speed Up AI Code, Verified with Formal Proofs

Small AI Models Outperform Giants in Grading Language Tasks, New Study Shows