My colleagues and I have written a large number of Jupyter notebooks. It has been a common problem to search efficiently within them.
Jupyter notebooks are JSON files, so using traditional search methods such as grep are tedious. Without a bit of tinkering it is 😉
Initially, I found a script on the internet called "nbgrep", but it did not work out for us. So, I wrote my own version.
It requires jq for JSON processing and GNU parallel for concurrent searches in the notebooks.
These are awesome tools anyway and can be very handy for data scientists. jq
makes it easy to write queries against JSON files, while parallel
can be used to execute any kind of code concurrently, or even on multiple machines (using ssh) in a very simple way.
They are easy to install:
Debian and friends:
sudo apt-get install jq parallel
MacOS
brew install jq parallel
You can find my script as a gist here, or if you cannot install parallel for some reason, here is the non-parallel version. I wrote a non-parallel version specifically for this post, so please notify me if something is wrong.
So when you run it as:
nbgrep 'read_[a-z]'
you will get something like:
./foo/bar.ipynb
df_a = pd.read_csv(
df_b = pd.read_csv(
./foobar/barfoo.ipynb
G = obonet.read_obo(url)
(I had to rename most of the result)
The script:
#!/bin/bash
set -euo pipefail
catch() {
echo "ERROR $1 occurred on $2"
}
trap 'catch $? $LINENO' ERR
pattern="${1? You must provide a search pattern}"
jupyter-search() {
file="$1"
pattern="$2"
matches=$(< "$file" jq '.cells[].source[]' -r \
| grep -P "$pattern" \
| xargs -I '%' echo -e "\t%"
)
if [ ! -z "$matches" ]
then
echo "$file"
echo "$matches"
fi
}
export -f jupyter-search
find . \
-type 'f' \
-iname '*.ipynb' \
-not -path '*/.ipynb_checkpoints/*'\
| parallel jupyter-search {} "$pattern"
Now let's see how the script works.
#!/bin/bash
set -euo pipefail
#!/bin/bash
just tells the kernel where to find the interpreter for the script. set -euo pipefail
is bash's "strict mode". Without it, bash will not stop the script execution on an error -e
or on encountering undefined variables -u
. -o pipefail
will make sure that if any error happens (non-zero exit code) in a pipeline, then the whole pipeline will be considered erroneous.
catch() {
echo "ERROR $1 occurred on $2"
}
trap 'catch $? $LINENO' ERR
By trapping errors this way, we can see on which line the error occurred.
pattern="${1? You must provide a search pattern}"
This is the search pattern that you want to find in the notebooks. It can be any Perl-like regex pattern. This line will also provide a helpful error message if the search pattern is not provided.
jupyter-search() {
file="$1"
pattern="$2"
matches=$(< "$file" jq '.cells[].source[]' -r \
| grep -P "$pattern" \
| xargs -I '%' echo -e "\t%"
)
if [ ! -z "$matches" ]
then
echo "$file"
echo "$matches"
fi
}
This is a bash function definition. This function has two arguments: $file
, read from the first positional argument, and the search $pattern
, read from the second one. Let's concentrate on the search part:
matches=$(< "$file" jq '.cells[].source[]' -r \
| grep -P "$pattern" \
| xargs -I '%' echo -e "\t%"
)
Here, the output of the pipeline (commands connected by pipes |
) will be assigned to the matches
variable.
The first command of the pipeline reads the notebook into the jq
JSON processor, which extracts all code cells. These are piped into a grep command which applies the given $pattern
as a Perl-like regexp -P
. The last command in the pipeline will tabulate the matches found by grep
.
The if statement at the end of the function will print the results, given that $matches
is not an empty string.
export -f jupyter-search
parallel
will execute the given code in a sub-shell, which will not inherit the variables from the parent shell (the shell executing the script itself). Therefore, it is necessary to export the previously defined function, so that it will be accessible in the sub-shell created by parallel.
find . \
-type 'f' \
-iname '*.ipynb' \
-not -path '*/.ipynb_checkpoints/*'\
| parallel jupyter-search {} "$pattern"
This code will find all regular files (excluding symlinks, directories and device files) with the .ipynb
extension in a case-insensitive manner. It will search recursively in the directory where you started the script, omitting only the .ipynb_checkpoints
directories. The found notebook files are streamed into the parallel
command which applies the jupyter-search
function on them with the given search $pattern
in as many parallel processes as your CPU core number. All of these processes will send their results onto your standard output, but parallel tidies them up so it will not be messy.
So in conclusion we have a very fast way to search in Jupyter notebooks from the command line. I hope some of you will find it helpful. If you have ideas about how to improve it, I am open to suggestions.
Top comments (0)