DEV Community

Jean
Jean

Posted on • Originally published at blog.gitguardian.com

How to scan local files for secrets in python using the GitGuardian API

Do you how many secrets, like API keys or credentials, are hidden in your local files? Today, we're going to show you how you can scan files and directories for sensitive information like secrets. To accomplish this, we'll use the GitGuardian API and python wrapper. By the end of this tutorial you will understand how the API works so you can start building your own, custom secrets detections scripts.

What our script will do

We will create a python script that will scan all files within a local directory for secrets. To do this we will be using the GitGuardian API and the API python wrapper, we recommend reviewing these resources before starting.

Our script will:

  • Detect secrets and other policy breaks from your file directory.
  • Print the filename, policy break and matches for all policy breaks found.
  • Output the result to a JSON format.

Getting setup

Before we get started writing our script, let's get the necessary components setup.

Installing the GitGuardian python API client

Install the GitGuardian python API client using de facto package manager ‘pip’.

In your terminal or command line execute the command:

pip3 install --upgrade pygitguardian

Obtaining the GitGuardian API token

Sign up for a free developer account from GitGuardian using your GitHub account or email at https://dashboard.gitguardian.com.

From the menu, navigate to the ‘API’ tab, scroll to ‘Generate new API key’ and select ‘Create new API key’. Make sure you give it an appropriate name.

You will not be able to view the API key again so make sure you immediately copy it to your clipboard before navigating away.

Setting up the directory and files

Open your terminal or command line.

Create a new directory in the local you wish to save your script:

mkdir  directory-scan

Enter into the directory using:

cd  directory-scan

Setting environment variables

As a stickler for good coding practices, this tutorial will use environment variables to store our API token (rule number one, never hardcode secrets in source code!).

I recommend using a tool called python-dotenv which will allow you to import environment variables, or you can set the API token in your console.

Create a new file called .env:

touch .env

Open this file in your chosen text editor and create a variable to store our API:

GG_API_KEY=**INSERT API TOKEN**

Writing our script

Importing the modules and setting up our client

Next let's create a file called directory_scan.py:

touch directory_scan.py

Open this file with your chosen text editor.

First we need to import the modules we need:


import  glob
import  os
import  sys
import  traceback

In addition to standard modules. We will also be importing and using ‘glob’. This will allow us to get the path and file names of all the files within our directory.

Importing environment variables

If you are using python-dotenv then we need to load in our API token from our .env file and use the API token within:


from  dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("GG_API_KEY")

Now, thanks to load_dotenv(), you’ll be able to retrieve the GG_API_KEY this way and store it in a variable.

Importing our GitGuardian API client modules

Next load the GitGuardian API client.

from pygitguardian import GGClient
from pygitguardian.config import MULTI_DOCUMENT_LIMIT

‘GGClient’ is the core module for our API client, it will handle the data we are scanning, send it to the GitGuardian scanning engine and receive the results.

The GitGuardian API only allows a maximum package of 20 files or a total size of 2mb for each request to allow for asynchronous scanning. Our MULTI_DOCUMENT_LIMIT module imports these parameters so we don’t send invaild requests to the server.

This does not mean you can only scan 20 files at a time. Our script will handle this by breaking the files into ‘chunks’ that meet the maximum API requirements and send multiple requests before collating the information at the end.

Now we just need to initialize the GGClient by attaching our API key:

# Initializing GGClient
client = GGClient(api_key=API_KEY) 

Loading files into an array

We now need to load in all the files and file paths within our current directory. Our script scans recursively from the working directory (the directory from which the script is called):

# Create a list of dictionaries for scanning
to_scan = []
for name in glob.glob("**/*", recursive=True):
    if ".env" in name: or os.path.isdir(name):
        continue
    with open(name) as fn:
        to_scan.append({"document": fn.read(), "filename": os.path.basename(name)})) 

The ‘glob’ module allows us create a list of files and path names that we will add into an array called to_scan so we can scan them.

We are going to also add a if statement that will exclude both our .env file and also ignore any folders we are trying to add into our array which will create an error (the files within the folders will still be added).

This will scan files recursively (from the working directory the script is within), but if you want to scan a different directory you can add the in the file path.

Example
for name in glob.glob("users\user\documents\**", recursive=True):

If you want to check your code is working so far. On a new line add ‘print(to_scan)’. You should get a list of all the files and their contents within your current directory. Comment out or remove before continuing.

Process the files in ‘chunks’ and making the API request

As previously mentioned the API will only accept 20 files per request with a maximum of 1MB per file. So we are going to break up our files into acceptable chunks to send as a request:

# Process in a chunked way to avoid passing the multi document limit
to_process = []
for i in range(0, len(to_scan), MULTI_DOCUMENT_LIMIT):
   chunk = to_scan[i : i + MULTI_DOCUMENT_LIMIT]
   try:
       scan = client.multi_content_scan(chunk)
   except Exception as exc:
       # Handle exceptions such as schema validation
       traceback.print_exc(2, file=sys.stderr)
       print(str(exc))
   if not scan.success:
       print("Error scanning some files. Results may be incomplete.")
       print(scan)
   to_process.extend(scan.scan_results)

First, create an empty array to hold the scan results from the API, we call this array to_process.

We are going to loop through our to_scan array containing our file paths and break them into chunks. To do this we are using a ‘range’ function which we will pass a start value, end value and stepping value.

range(start_value, end_value, step)

We are going to load the current values of our array into a variable called ‘chunk’.

Using a try block, we will scan our current chunk using the multi content scan command of the GG API client.

Of course we need to handle any expectations where the scan will fail, for example if the filename is too long for our schema.

The traceback will show the exact line it failed.

Let's add in a message in the scenario our scan fails.

Finally we are going to append our scan results to our array ‘to_process’.

FAQ: If I need to scan 200 files, will this count as 1 or 10 API requests in my dashboard? It will count as 10 but don’t worry you have 1,000 API requests a month.

Printing results

Now we will loop through our results. If a policy break is detected it will be captured by the .has_secrets tag, if this is true, we will print that result:

# Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
Now we will loop through our results. If a policy break is detected it will be captured by the .has_policies_breaks tag, if this is true, we will print that result.

Code Checkpoint 1

We are ready to run our first scan so let's quickly make sure our code is the same.

import  glob
import  os
import  sys
import  traceback
from  dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("GG_API_KEY") 

from pygitguardian import GGClient
from pygitguardian.config import MULTI_DOCUMENT_LIMIT

# Initializing GGClient
client = GGClient(api_key=API_KEY) 

# Create a list of dictionaries for scanning
to_scan = []
for name in glob.glob("**/*", recursive=True):
    with open(name) as fn:
        to_scan.append({"document": fn.read(), "filename": os.path.basename(name)})) 

# Process in a chunked way to avoid passing the multi document limit
to_process = []
for i in range(0, len(to_scan), MULTI_DOCUMENT_LIMIT):
   chunk = to_scan[i : i + MULTI_DOCUMENT_LIMIT]
   try:
       scan = client.multi_content_scan(chunk)
   except Exception as exc:
       # Handle exceptions such as schema validation
       traceback.print_exc(2, file=sys.stderr)
       print(str(exc))
   if not scan.success:
       print("Error scanning some files. Results may be incomplete.")
       print(scan)
   to_process.extend(scan.scan_results)

# Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_secrets:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")

Running the script

We are now ready to run our first directory scan.

You can download some example files that contain expired secrets here so you can test your script.

Move the directory_scan.py file into the directory you want to scan.

Open your terminal, navigate to the directory and run the command:

python3 directory_scan.py

Congratulations you just scanned your directory for policy breaks!

After your script has run, you will receive feedback with the amount of policy breaks that have been found.

main.py: 1 break/s found
sample.yaml: 1 break/s found

Now we know what files have policy breaks.

But we don't know if the policy break is a secret and we don't know what kind of secret it is. So next we will add some additional detail into our results.

Displaying additional information

Including policy break type and matches

Now we have detected policy breaks, we may wish to know what policy breaks have been detected, for example was it a Slack token, an AWS key or a filename policy that was broken.

Let's add a line in the output that tells us what policy breaks have been broken.

You can find more information on policy breaks in the GitGuardian dashboard

 # Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
       # Printing policy break type
       for  policy_break in scan_result.policy_breaks:
           print(f"\t{policy_break.break_type}:")

Now we are going to add a nested loop within our previous loop and for each policy break, we are going to use the break_type tag in the GG client to print the type of policy break that has occurred (in other words, the type of secret, filename or extension that has triggered the alert).

Now if we run our function again, we will get the same results, but this time we will also get the name of the policy break next to each file.

    main.py: 1 break/s found
        AWS Key: 
    sample.yaml: 1 break/s found
        Google API Key:

Adding matches

It is not always appropriate to print the matches we find, but in the case of this example we are going to do just that.

We are going to create another for loop, again nested. We are now going to call the tag ‘match’ from the GG and print that. This will give us our policy break.

 # Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
       # Printing policy break type
       for  policy_break in scan_result.policy_breaks:
           print(f"\t{policy_break.break_type}:")
           # Printing matches
           for match in policy_break.matches:
                print(f"\t\t{match.match_type}:{match.match}")

Lets run this again and we should now get

  • File name and number of policy breaks.
  • Policy break types (secrets if any).
    main.py: 1 break/s found
        AWS Key: *********************************
    sample.yaml: 1 break/s found
        Google API Key: *****************************

Retrieving the output as JSON

Now let's say we need to output these results in JSON format.

The API has built in functionality to convert results into JSON format.

Let's loop through our files and if our scan results have policy breaks within them, we will print them in JSON.

  #Getting results in JSON format           
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(scan_result.to_json())

Checkpoint 2

You're done! Let's do a final code review to make sure your code is correct.

import  glob
import  os
import  sys
import  traceback
from  dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("GG_API_KEY") 

from pygitguardian import GGClient
from pygitguardian.config import MULTI_DOCUMENT_LIMIT

# Initializing GGClient
client = GGClient(api_key=API_KEY) 

# Create a list of dictionaries for scanning
to_scan = []
for name in glob.glob("**/*", recursive=True):
    with open(name) as fn:
        to_scan.append({"document": fn.read(), "filename": os.path.basename(name)})) 

# Process in a chunked way to avoid passing the multi document limit
to_process = []
for i in range(0, len(to_scan), MULTI_DOCUMENT_LIMIT):
   chunk = to_scan[i : i + MULTI_DOCUMENT_LIMIT]
   try:
       scan = client.multi_content_scan(chunk)
   except Exception as exc:
       # Handle exceptions such as schema validation
       traceback.print_exc(2, file=sys.stderr)
       print(str(exc))
   if not scan.success:
       print("Error scanning some files. Results may be incomplete.")
       print(scan)
   to_process.extend(scan.scan_results)

# Printing the results
for i, scan_result in enumerate(to_process):
   if scan_result.has_secrets:
       print(f"{chunk[i]['filename']}: {scan_result.policy_break_count} break/s found")
       # Printing policy break type
       for  policy_break in scan_result.policy_breaks:
           print(f"\t{policy_break.break_type}:")
           # Printing matches
           for match in policy_break.matches:
                print(f"\t\t{match.match_type}:{match.match}")

#Getting results in JSON format    
for i, scan_result in enumerate(to_process):
   if scan_result.has_policy_breaks:
       print(scan_result.to_json())

Warning

Please note that you should only scan for secrets in places they should not exist and revoke any that are found. As a general rule, any secrets in that end up in remote locations not specifically designed to secure sensitive data should be considered compromised. This includes using the GitGuardian API.

Next Steps

Now that you have created your first script using the GitGuardian API and python wrapper you can create your own awesome scripts to scan files.

The next tutorial will help you scan files pre-commit or in the CI.

Any questions on the API please email us, mackenzie.jackson@gitguardian.com.

Top comments (0)