We all know the risks that vulnerabilities pose: application crashes, data loss, or privacy breaches. In this article, we'll look at examples that illustrate the core aspects of an approach, where programmers can find vulnerabilities at the development stage.
How can one prevent vulnerabilities?
How to look for vulnerabilities
The foreword sets the scene: security flaws can lead to the loss of privacy, integrity, and availability of data, as well as disruption to an application. Hackers can also exploit vulnerabilities for cyber attacks. Examples of such vulnerabilities include XSS, XXE, and SQL injection. If you find it easier to digest information in the meme form, here's a prime example:
What are the prevention measures against such vulnerabilities? Three options come to mind immediately:
- testing, which includes manual testing and pentesting;
- dynamic analysis (DAST), which is a testing method that uses specialized tools to analyze a running application and identify vulnerabilities by simulating real-world attack scenarios;
- development control, which involves using security standards in development, raising employee awareness, and regular code reviews.
But how can we avoid vulnerabilities at the development stage without relying on human involvement? This is where Static Application Security Testing (SAST) comes in. It works directly on the source code and doesn't depend on a human.
A little spoiler: we'll focus on the theoretical aspects of the issue here. In the next article, we'll look under the hood of a specific SAST solution, PVS-Studio.
Why is it difficult to detect tainted data?
How to work with code
So, we have some code. But how do we even work with it? Here's the simplest example of SQL injection: an unvalidated argument is injected into the database from the command line.
public static void main(String[] args) throws SQLException{
var query = "SELECT * FROM foo WHERE bar = '" + args[0];
var conn = getConn();
var st = conn.createStatement();
var rs = st.executeQuery(query);
while (rs.next()) {
System.out.println(rs.getString("baz"));
}
}
We immediately reject such silly things as searching for code fragments via regular expressions. The well-known compilers, which translate the code into machine code (or intermediate representation), undertake the task of code parsing.
More specifically, it is the parsers that process the source code by turning a token stream into an abstract syntax tree. In our recent article, my team and I have already detailed how to develop an analyzer from scratch—we've described the process. I'd like to also note here that you don't have to write a parser from scratch. You can either use parser generators, like ANTLR, or ready-made libraries, including compiler APIs.
Let's go back to our code. After parsing, AST will look something like this (the artist's impression):
It's a bit simplistic but quite illustrative. If you're in the mood for some fun, you could compare the source code with the resulting AST. Now we have a code representation that is easy to work with programmatically, but what's next?
How do we recognize tainted data?
To distinguish the normal code from potentially vulnerable one, we can turn to the annotation mechanism. We can use it to mark up the following:
- Sinks, which are dangerous parts of an application where tainted data may get into. If this happens, a potentially dangerous operation is performed.
- Sources, which indicate where the data may come from. These include: controller methods, desktop application forms, or console input.
- Sanitization, which are methods or checks that validate data.
It's really simple: we just mark up library methods, like java.sql.statement.executeQuery
, and assign them as sinks. Similarly, we mark data sources.
Or you're making some kind of a query builder for a custom ORM. Yes, it's often a Sisyphean task, but I've been there, so send me your condolences. So, concatenation is unavoidable there, but since the methods are private, no external data can get through. By the way, this is why public methods are considered potential taint sources—it's possible to accidentally pass unvalidated data there. Thus, to stop complaining about fairly secure code, we need to look at data sources and, consequently, their validation (sanitization).Why over-complicate things if we can just use sinks?
We can illustrate this with the SQL injection example: while using concatenation instead of query parameters is generally bad practice, sometimes it doesn't cause harm. For example, when values come from a whitelist:
String parameter;
switch (args[0]) {
case "1":
parameter = "qux";
break;
case "2":
parameter = "quux";
break;
default:
throw new IllegalArgumentException("Unexpected argument");
}
var query = "SELECT * FROM foo WHERE bar = '" + parameter;
// ....
The specific mechanism of annotation implementation doesn't matter much. The key thing here is that we should be able to find the sinks and sources. For the sake of clarity, we'll mark the source in orange and the sink in scarlet for the example above:
Can we determine whether there's a connection between the source and the sink, knowing what the source and the sink are? This is easy in our case, we just need to make sure that args
contributes to the creation of the query
variable. However, what if we had a more complicated example? I quickly came up with not-so-sophisticated code to illustrate:
int mode = 0;
String defaultQ = "false";
String field = "";
Statement sql;
public static void main(String[] args) throws SQLException {
var st1 = args[0];
var statement = args[1];
var flag = Boolean.parseBoolean(args[2]);
String query;
if (flag) {
st1 = statement;
statement = "SELECT * FROM TBL";
query = st1 + statement;
} else if (st1.equals("foo")) {
if (this.mode == 1) {
query = this.defaultQ;
} else {
query = statement + "LIMIT 1";
}
} else {
query = ";";
}
query = field + query;
sql.executeQuery(query);
}
And here's its AST with the source and sink marked (it's clickable):
We aren't on the same page unless your face has changed in a similar way after that:
Yes, tracing data from the source to the sink still seems easy, but how do we even track data movement if AST has no direction? Okay, if you've been paying attention, you might have noticed the similarities: in the picture, the operations go from top to bottom, and the nested operation bodies go to the right. Yet AST doesn't differentiate between regular assignments, conditional statements, or loops—they're all just syntactic constructs to it. So, "climbing" a tree is definitely possible, but it's hard. We need to keep an eye on:
- reassignments (the initial value of a variable doesn't always match its final value);
- validation (input data can be checked for a security threat, or it can be removed from the data in other ways);
- control flow (it's essential to track where each particular node is located in relation to all conditions and loops within the method body, otherwise it's impossible to understand where the data comes from).
Those were just the things that immediately came to my mind. All in all, we need to keep building the foundation. If you find this topic interesting, you may also want to explore annotations further on our terminology page.
Control-flow integrity
I've already mentioned that we need to keep an eye on the control flow. This brings us to the need for a full-fledged data flow analysis, which begins with the control-flow graph.
If the abstract syntax tree aims to show—oh, don't faint here—the language syntax, the CFG helps display the order of statements in the code. The CFG is based on AST (remember I mentioned that you can trace the direction of code execution from it?) and once you've built it once, you'll find it much easier to get information from it.
Let me just show you the CFG for that dreaded AST above:
If you've ever had to draw block diagrams during your uni years, you might have felt a twinge, as these things are inherently similar.
Are we ready to start the analysis? Not quite yet—better buckle up; we're only halfway there :) Okay, jokes aside, but it's too early to start the analysis. Here's why: with a CFG, we'd need to analyze all nodes in the graph, while we're only interested in the nodes containing external data flowing to the sink.
Two approaches can help with this issue. While only one is necessary, we'll explore both for clarity: the SSA (static single assignment) form and the DU graph. We'll start with the first one.
Intermediate representation
The intermediate representation topic is so broad that we could extend the discussion to bytecode (by the way, in .NET, the intermediate language is aptly named Common Intermediate Language (CIL)). So, I'll outline the issues we're trying to address:
- tracking variable value overrides is difficult;
- monitoring variable usage across branches can be tricky.
However, these challenges could be addressed at the bytecode level (and it would be even easier in some ways), but I'd rather not expand the scope of the article. I'll just say that this approach can be a perfect choice if you're analyzing only bytecode-based languages and not afraid to start from scratch.
Let's get back to business—or rather to issues. We can handle them using a simple form of intermediate representation compatible with the language, which is the aforementioned static single assignment (SSA) form. The rule is simple: each variable can be assigned only once. This code then:
var a = 5;
a = a + c;
var b = a;
Turns into this one:
var a1 = 5;
var a2 = a1 + c;
var b = a2;
Things get a bit more complicated when it comes to conditions and loops. We take the φ (phi) function approach and use the functions to heuristically determine the branching result. The code:
int x = 5;
if (cond) {
x = x + 3;
} else {
x = a;
}
System.out.println(x);
Becomes this one:
int x1 = 0, x2 = 0;
int x0 = 5;
if (cond) {
x1 = x0 + 3;
} else {
x2 = a;
}
int x3 = phi(x1, x2);
System.out.println(x3);
Yes, we had to initialize the variables right away to ensure syntax compatibility.
Let's go back to our example again and get the following SSA:
int mode = 0;
String defaultQ = "false";
String field = "";
Statement sql;
public static void main(String[] args) throws SQLException {
var st1_0 = args[0];
var statement_0 = args[1];
var flag = Boolean.parseBoolean(args[2]);
String query_1 = null;
String query_2 = null;
if (flag) {
var st1_1 = statement_0;
var statement_1 = "SELECT * FROM TBL";
query_1 = st1_1 + statement_1;
} else {
String query_2_0 = null;
String query_2_3 = null;
if (st1_0.equals("foo")) {
String query_2_1 = null;
String query_2_2 = null;
if (this.mode == 1) {
query_2_1 = this.defaultQ;
} else {
query_2_2 = statement_0 + "LIMIT 1";
}
query_2_0 = phi(query_2_1, query_2_2);
} else {
query_2_3 = ";";
}
query_2 = phi(query_2_0, query_2_3);
}
var query_3 = phi(query_1, query_2);
var query_4 = field + query_3;
sql.executeQuery(query_4);
}
It may be hard to read, but it's much easier to analyze. The example didn't even have any assignment to object fields, otherwise it would've been a headache. Luckily, they aren't there :) I suggest we take it easy and stick to the simpler stuff for now. I hope you're still here. We're almost done, just one more step to go.
Use chains
We've modified the code, but we don't have to analyze it directly. Based on the SSA form, we can build the last thing we need today—def-use chains, or definition-use chains. There's also their opposite analog, UD chains, but we'll focus on the former for now.
Oh yes, earlier I've mentioned that the SSA and DU chains don't have to be built together, so the previous step could be skipped as well as this one. However, for our purposes, we'll build both of them here, so we don't have to keep track of overrides and branching. In addition, building SSA reduces the number of edges in the graph, which also minimizes memory consumption.
So, if you're still confused about what these chains are—or their name doesn't make it clear—let me explain: DU chains link the variable value-initialization and the further use of this variable. This way we've created a new variable for each new override of the query
value. If we build chains for all of them and connect them somehow, we get something like this:
Looks a bit intimidating, right? Let me walk you through it:) Since each override is used only once, each chain has two elements: a definition on the left, and a use on the right. Their link is also shown on the left. From the program's point of view, reading it is easy as we can go to any end of the chain and then continue through the linked chains.
It's easier to see the benefit of def-use traversal when it's shown over the CFG:
In the explanation, we'll move from bottom to top:
- The potentially unsafe SQL query is marked in red.
- Elements of chains built from
query
that may contain tainted data are marked in orange. - Green marks chain elements beyond which there's no point in searching for tainted data. This is clear for literals, but less so for object fields. The thing is, static analysis struggles to interprocedurally calculate a potential value of a field, so we won't attempt it.
- Blue and purple mark the nodes from other variable chains (
statement
andst1
) that may contain dangerous data. - Since purple resides at the top, the data comes from the
main
parameters, indicating that we have found two paths where tainted data can flow. - White is left for things that don't belong to the chains we're interested in.
As you may have noticed a bit earlier, this would be more difficult to illustrate with unlinked chains :) In fact, we'll come back to the idea of traversal later, but for now we have everything we need to traverse the method. What if we need to traverse multiple methods, though?
The call graph
I have to admit that we missed one thing. If we modify the very first example even in this trivial way:
public static void main(String[] args) throws SQLException {
var foo = findFoo(args[0]);
// ....
}
private static Foo findFoo(String bar) throws SQLException {
var query = "SELECT * FROM foo WHERE bar = '" + bar;
var conn = getConn();
var st = conn.prepareStatement(query);
var rs = st.executeQuery(query);
// ....
return foo;
}
We won't find any errors, because the data goes into the method parameters. Looks like all our efforts have been in vain.
Okay, drama aside, even though we can't learn from the method where it's called, we have a tool that will help us. This tool is the call graph, which we need to build before analyzing all methods. In our case, it looks pretty boring:
It's straightforward to build: just identify all method calls in the project beforehand and link them with edges. Call graphs of large projects can sometimes produce interesting patterns. For example, here's the call graph for the Lua analyzer from the aforementioned article. You can explore it by clicking below.
We can see parts that aren't connected to the clusters in the center. This happens because polymorphism makes it difficult to tell exactly where a method is called from. While addressing polymorphism presents its own challenges, let's not complicate things and move on.
How to find tainted data
If you've been attending your discrete mathematics courses or gearing up for FAANG interview, you might have already seen where we're going with this. Having started with just a text file containing the code, we now have a complete toolkit to work with the program. This reduces the whole task to a simple traversal of a graph, or rather graphs.
Well, let's do this together and revisit the code from the last section. We'll trace the data as it moves from its source:
public static void main(String[] args) throws SQLException {
var foo = findFoo(args[0]);
// ....
}
Entering foo
, we don't see anything wrong, but we find the findFoo
call:
From the call graph, we find this method.
It wasn't difficult here, but the other way around would be harder. Anyway, this is how we get to findFoo
:
private static Foo findFoo(String bar) throws SQLException {
var query = "SELECT * FROM foo WHERE bar = '" + bar;
var conn = getConn();
var st = conn.prepareStatement(query);
var rs = st.executeQuery(query);
// ....
return foo;
}
Let's build the CFG for it:
We don't need to create SSA, because everything has already been assigned once. Oh, how lucky we are :) The only thing left to do is complete the chain for query
and bar
. For clarity, I marked their connection.
At the top, we have the chain for bar
(with a signature hidden behind BEGIN_2) and for query
at the bottom. When we reach the execution of the SQL query (executeQuery
) without encountering any validation or use of query parameters, we realize that here it is—a potential path for tainted data.
There's no validation against SQL injection, but technically we could say it's something like checking for ';'.
Well, since the exception isn't handled anywhere, it's thrown directly to the console—here we have a full house of security violations (never do this). Yes, the example with main
is a bit far-fetched, but absolutely the same logic applies to controller methods and to any other source.
Why did I say at the beginning of the section that it came down to simply traversing the graph, and the last section was called "Why is it difficult to detect tainted data"? Well, because this is the journey we had to take to get here :)
Afterword
I believe this covers the essentials for detecting tainted data in source code. Each of these points deserves its own in-depth article, and a scientific one at that, but I wanted to give a general overview of the technologies used for this task. For the same reason, I focused mostly on trivial examples. I hope you've made it this far in the article. If so, I'd be eager to hear your thoughts in the comments :)
Just a reminder, that this is the first article of the two I've planned. In the second one, we'll talk about how do we look for tainted data. Yes, you read that right, we've recently taught our analyzer to detect tainted data in code. You'll help us a lot if you try it out.
To stay tuned for more articles on code quality like this one, you can subscribe to:
Top comments (0)