DEV Community

Tom den Braber
Tom den Braber

Posted on • Originally published at moxio.com on

The What, Why and How of Type Inference

In my previous post concerning the exceptional flow of PHP programs, I presented a global overview of my project. In this post, I am going to cover one of the building blocks of the algorithm in more depth: type inference. We will discuss what type inference is, why it is needed and how it can be done.

Wait, what?

The words ‘type inference’ might sound scary, but the principle behind it is quite simple. Consider the following code:

<?php 
class Elf { 
    public function __construct(string $name) {
        /* constructor code */
    }

    public function getName() : string { 
        /*... an implementation ...*/ 
    } 

    public function say(string $what_to_say) : void { 
        /*... also an implementation ... */ 
    } 
}

$an_elf = new Elf("Legolas");

if ($an_elf->getName() === "Legolas") { 
    $an_elf->say("They're taking the hobbits to Isengard!"); 
}

Looking at this code, we can say a few things. For example, we can say that the expression "Legolas" is of type string, because it is text enclosed by quotes. We also know that any occurrence of the $an_elf variable is of type Elf, because $an\_elf is only assigned a value once, namely the expression new Elf("Legolas") which is of type Elf. I don't know whether you noticed or not, but... we already inferred a few types. Type inference is nothing else than deducing the types of expressions in a program at Compile Time, i.e. without actually running the program.

But why?

So why would you want to do this? Well, it is likely that you are already using it, maybe even without knowing it. If you are using an IDE, it probably informs you about the types of expressions and variables. It probably supports ‘click to go to definition’: you click on an expression with type Elf and your cursor automagically moves to the definition of the Elf class. You already guessed it: these kind of interactions are possible because of type inference.

Type inference can also help to spot errors before even running your code. Try to spot the error in the snippet below:

<?php
function tell_me(string $what_to_tell) { /* some implementation */}
tell_me(["where is Gandalf, for I much desire to speak to him"]);

When your IDE has type inference, it could warn you that you are calling the function tell_me with an argument of type array, whereas the definition of the function clearly states that it requires a string. In this case, it is quite easy to see. But what if these statements were located in different files? When types of expressions are known at compile time, IDEs and static analysis tools can warn programmers that they are making a mistake. Type inference can be a great help while developing. Even better: you are probably already using tools that are powered by type inference algorithms, maybe without even knowing it.

Awesome! How can we do this?

When we look back at the process we used to infer types in the first code snippet, we can see that we used two main methods.

  1. We looked at ‘stand-alone’ expressions, like "Legolas" or new Elf(...) and inferred the types of those expressions.
  2. We propagated types of expressions to other expressions. E.g. each occurrence of $an_elf has type Elf because it is assigned the expression new Elf(...) of type Elf at the start of the program.

The interesting thing is that these methods go hand in hand. For example, if we want to know the type of $an_elf->getName(), we have to know the type of $an_elf: this type needs to be propagated from a definition earlier in the program. The complete expression can be resolved when we know that $an_elf is of type Elf, as we can then look at the the definition of Elf::getName(), which says that this function returns a string. The type of $an_elf->getName() is thus string.

Implementing a type inference algorithm which uses only infers types on ‘stand-alone’ expressions is not too hard. Combining a simple traversal of the Abstract Syntax Tree (AST) with enough knowledge of the language of a program will do the job. It gets interesting when you also want to propagate the types of expressions to other expressions. In order to do that, we need to know which expressions are dependent on other expressions. We can discern two types of dependences here: control dependence and data dependence. Control dependence says something about the order of execution, whereas data dependences describe which variables influence other variables.

Control flow data is often represented as a Control Flow Graph (CFG). The nodes in the graph are sets of instructions which are executed linearly, i.e. one after another. The edges represent the ‘jumps’ that might occur in the program. Consider the following snippet.

<?php 
$a = rand_int(0,5); 
if ($a > 3) { 
    $a = "a string"; 
} else { 
    $a = false; 
} 
some_function($a);

The CFG of this snippet looks as follows:

From the CFG, we can clearly see that either the if-branch or the else-branch will be executed, but certainly not both in the same run. In the snippet, the variable $a is used. But what is the type of $a in this snippet? The correct answer would be that $a can be a string, an integer or a boolean. However, this is not really useful: although it is true, it does not say a lot. As $a changes multiple times in the program, it is hard to determine what is meant by the variable $a.

But what if we slightly transform the program? We could say, for example, that each variable might be assigned only once. In this way, we can be sure what the type of the variable is at all times, as it is only initialised but not changed. This form is called the Single Static Assignment (SSA) form. The snippet above would look as follows in SSA:

<?php
$a_1 = rand_int(0,5); 
if ($a_1 > 3) { 
    $a_2 = "a string"; 
} else { 
    $a_3 = false; 
} 
some_function(φ($a_2, $a_3));

This snippet has exactly the same behaviour as the earlier snippet, but each variable is assigned only once. This makes the process of type inference somewhat easier, as a variable can now have only one type during its entire life. By keeping a link to from the variables in SSA form to the variables in the original program, we can transfer the inferred types back to the original program in non-SSA form. However, translating the program to SSA form only gets you so far. You might have noticed the strange φ function in the last line of the snippet. The φ actually tells us that we cannot really know which variable will be used here. The value of φ($a_2, $a_3) will be either $a_2 or $a_3: we know for sure that one of those variables will be used as parameter for some_function, but we cannot know which one at compile time. The type of the φ-function is the union of all the types of its parameters. In this example, the type is either a string (originating from $a_2) or boolean (originating from $a_3).

Wrapping up

We covered a lot of ground in this post. To summarise, type inference…

  • is the process of deducing types of expressions at compile time;
  • enables all kinds of tools which can be used to help you, the software developer;
  • can be done by combining deep knowledge of the programming language and flow information.

Stay tuned for the next post concerning call graph construction!

Tools

There are some tools around for creating CFGs and performing type inference. For example:

  • PHP-CFG — a tool for generating CFGs of PHP programs
  • PHP-Types — a tool for performing type inference on CFGs.

Originally published at www.moxio.com on December 17,2017.


Top comments (0)