How to uncover leak secrets with regex + entropy analysis
As a developer, I admit that I’ve committed secrets to public Github repositories before. Hardcoded secrets have always been a problem in organizations and are one of the first things I look for during a penetration test.
When developers write secrets such as passwords and API keys directly into source code, these secrets can make their way to public repos or application packages, then into an attacker’s hands. As microservice architectures and API-centric applications become mainstream, developers often need to exchange credentials and other secrets programmatically. This means that developers can sometimes make mistakes when handling sensitive data.
To put this into context, let’s look at an instance of hardcoded credentials. This bug reportwas submitted to reverb.com. The researcher discovered a pair of basic authentication credentials used to access Cloudinary. The secret was embedded in the source code of Reverb’s Android app. Anyone who downloads the Android app can extract this credential and gain the ability to access, edit, and delete all files in the Cloudinary instance.
**private** **static** **final** java **.** lang **.** String CONFIG **=**"cloudinary://434762629765715:█████@reverb" **;**
This type of vulnerability is not rare by any means. As a penetration tester, I’ve found anything from basic auth credentials, AWS keys, and Github API keys in many organizations' public source code or binaries. Sometimes, the only thing attackers need to do to compromise an organization is to search their Github repositories for accidentally committed credentials.
Using regexes
How do we detect these secrets before it causes an info leak? The most straightforward way to detect hardcoded credentials is to use text search and regex.
Hardcoded credentials such as API keys, encryption keys, and database passwords can often be discovered by grepping for keywords such as “key”, “secret”, “password”, or “aws”. These searches target identifiers, like variable names, that are used to refer to the secrets. Similarly, you can use string searches to look for keywords, known file names, and file formats that indicate a secret. RSA private key files, for instance, start with the string -----BEGIN RSA PRIVATE KEY-----.
Many API keys also adhere to a specific format. You can detect these by looking for patterns in source code using regex searches. For instance, AWS access keys IDs commonly start with the string “AKIA”, followed by 16 alphanumeric characters. So if you do a regex search of AKIA[0–9A-Z]{16}, you can very reliably identify strings of this format. Twilio API keys start with “SK” followed by 32 alphanumeric characters. So you can locate them with the regex patternSK[a-z0–9]{32}. Passwords in URLs can be detected by searching for patterns that indicate basic authentication syntax: [a-zA-Z]{3,15}:\/\/[^\/\:@]+:[^\/\:@]+@.{1,100}. This regex pattern will discover credentials included in URLs: protocol://username:password@example.com. Identify the key formats for the services you use, and target your search using those patterns.
These two strategies can discover most hardcoded credentials. But by relying on text searches, you risk missing secrets that don’t adhere to a specific format. This is where entropy scanning comes in.
Let’s talk about entropy.
For our purposes, you can think of entropy as how random and unpredictable something is. For instance, a string composed of only one character aaaaa has very low entropy. A longer string with a larger set of characters wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY, has higher entropy.
You can test these strings out and see how entropy is calculated here (Kozlowski, L. Shannon entropy calculator):
Entropy is a good tool to find highly randomized and complex strings, which often indicates a secret. By measuring the entropy of string literals in your source code, you can discover suspicious strings of any format.
What now?
You should monitor your public repositories for accidentally committed secrets. Any credentials that are leaked to public repositories should be considered stolen and should be rotated.
Of course, not all code is open-sourced, and not all hardcoded secrets will be committed to public repositories. But hardcoded secrets can still be an issue if leaked through application binaries, logs, or stolen source code. A good strategy to minimize the risk of hardcoded secrets is to employ a scan that combines pattern searching with entropy analysis before code makes it to production and to store secrets in configuration files or secret management services instead.
Sometimes, it might feel necessary to store secrets in code that users can get their hands on. An example of this is API keys used in mobile applications. In this case, you can take steps to prevent these keys from being found. For instance, avoid naming your sensitive variables with easily guessable identifiers like “api_key” or “password”, and obfuscate your code so that it’s harder to extract secrets from your application. Finally, run parts of your application that requires third-party services on the server to avoid packaging keys into application files.
Always scan your codebase for hardcoded secrets and analyze if they have the chance to make it onto the attacker’s screen. See if the secrets need to be there and if you are protecting them properly.
I hope you had fun with this tutorial! Static analysis is the most efficient way of uncovering hardcoded secrets in your applications. ShiftLeft’s static analysis tool NG-SAST is equipped with a secrets scanner that can automate this process for you. If you’re interested in learning more about NG-SAST, visit us here:
Thanks for reading! What is the most challenging part of developing secure software for you? I’d love to know. Feel free to connect on Twitter @vickieli7.
Top comments (0)