Developers coming from C know that variables should always be initialized. Not initializing your variables means they contain junk, and this can result in undefined behavior. For example:
#include<stdio.h>
int main(void) {
char buffer[256];
char answer;
char* name;
printf("Do you want to enter a name? [yn] ");
answer = getchar();
while (getchar() != '\n') { } // because we need CR for getchar but it doesn't read the CR...
if (answer == 'y') {
printf("Please enter name: ");
name = fgets(buffer, 256, stdin);
if (name == 0) {
name = "<too long>";
}
} else if (answer == 'n') {
name = "<user refused to enter name>";
}
printf("The name is %s\n", name);
return 0;
}
If the user entered a character that is not y
or n
, not of the name = ...;
statements will be executed, and name
will still hold the same value it had when main
started. What is that value? In release mode C, that would be whatever random data happened to be in that piece of memory name
was assigned. And then we take that utterly random number and pass it to printf
where it'll get printed as if it was a string pointer!
If we are lucky, we'll hit some illegal memory address and the OS will stop us. If we aren't it'll just go to some random place at memory and start printing whatever it encounters: passwords, credentials, application tokens...
And of course - this will not be reproducible. Because every time you run the program, there will be a different value at that place in memory and you'll get different results.
To avoid these problems, C developers have conditioned themselves to always initialize their variables. If you don't have something meaningful to put in the point of declaration - just put 0:
#include<stdio.h>
int main(void) {
char buffer[256] = {};
char answer = '\0';
char* name = 0;
printf("Do you want to enter a name? [yn] ");
answer = getchar();
while (getchar() != '\n') { } // because we need CR for getchar but it doesn't read the CR...
if (answer == 'y') {
printf("Please enter name: ");
name = fgets(buffer, 256, stdin);
if (name == 0) {
name = "<too long>";
}
} else if (answer == 'n') {
name = "<user refused to enter name>";
}
printf("The name is %s\n", name);
return 0;
}
While null pointer dereference is still formally an undefined behavior, it is still much better than random pointer dereference because your operation system will probably make it s SEGFAULT - which is better than security leaks.
OK, but that's C. What about more modern languages?
There are two main reason this was so needed in C:
- Uninitialized variables having junk data.
- Inability to declare variables in the middle of a block.
More modern languages allow declaring variables in the middle of a block, so it is usually preferable to only declare the variable at the point where you have something meaningful to put in it.
This greatly reduces the cases where you have to initialize something with a default value - but does not prevent all of them. In our case, for example, name
gets its value inside if
branches - if we declared it there we wouldn't be able to use it after the if
. Some languages (mostly the functional ones) have easy syntax solution, but in most mainstream languages you'd have to either extract it to a function or declare the variable outside the block.
When going with the latter solution, because C is such a common background, many developers will initialize the value. So if we convert our code to Java:
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.print("Do you want to enter a name? [yn] ");
String answer = scanner.nextLine();
String name = null;
if ("y".equals(answer)) {
System.out.print("Please enter name: ");
name = scanner.nextLine();
} else if ("n".equals(answer)) {
name = "<user refused to enter name>";
}
System.out.printf("The name is %s\n", name);
}
}
Sure, this is Java, a language with managed memory that will never allow undefined behavior from uninitialized variables, so we don't really need to initialize name
to null
, but better safe than sorry, right?
WRONG!
Java analyses code paths to make sure no variable can be used without being initialized first. So if we remove the initialization:
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.print("Do you want to enter a name? [yn] ");
String answer = scanner.nextLine();
String name;
if ("y".equals(answer)) {
System.out.print("Please enter name: ");
name = scanner.nextLine();
} else if ("n".equals(answer)) {
name = "<user refused to enter name>";
}
System.out.printf("The name is %s\n", name);
}
}
We'll get a compilation error:
$ javac Main.java
Main.java:18: error: variable name might not have been initialized
System.out.printf("The name is %s\n", name);
^
1 error
I just broke the compilation, but this is a good thing - the compiler found a bug! The same bug we had in the C version - what if the user enters something which isn't y
or n
. The Java compiler sees that there are three possible code paths that reach the last line but we are only initializing two of them.
To be able to compiler again, we must tell Java what to do in case the user gave an invalid answer. Failure is also an option - as long as we do it intentionally:
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.print("Do you want to enter a name? [yn] ");
String answer = scanner.nextLine();
String name;
if ("y".equals(answer)) {
System.out.print("Please enter name: ");
name = scanner.nextLine();
} else if ("n".equals(answer)) {
name = "<user refused to enter name>";
} else {
System.err.printf("Illegal answer \"%s\". The only legal answers are \"y\" and \"n\".", answer);
return;
}
System.out.printf("The name is %s\n", name);
}
}
Now there are still three code paths, but in the third we return
from the function early, before printing name
. The Java compiler can determine that there are no code paths where name
is used without being assigned a value first - and thus the compilation succeeds.
This is still initialization
Despite the clickbaity title, we do actually initialize name
. We don't do on declaration, but we are initializing it nevertheless. This compiles:
import java.util.Scanner;
public class Main {
public static void main(String[] args) {
Scanner scanner = new Scanner(System.in);
System.out.print("Do you want to enter a name? [yn] ");
String answer = scanner.nextLine();
final String name;
if ("y".equals(answer)) {
System.out.print("Please enter name: ");
name = scanner.nextLine();
} else if ("n".equals(answer)) {
name = "<user refused to enter name>";
} else {
System.err.printf("Illegal answer \"%s\". The only legal answers are \"y\" and \"n\".", answer);
return;
}
System.out.printf("The name is %s\n", name);
}
}
Wait - how? Didn't they teach us that you can't change the value of a final
variable?
Well, yes, but we are not changing the value of any final
variables here - we are just initializing it. Since name
has never been assigned before in either of the paths that assign to it, these assignments are actually initializations - which are perfectly fine for final
variables. It wouldn't have worked with final String name = null
, but without the initialization on declaration it's fine, and even without the final
name
could be used in lambdas (provided they appeared after the first assignment).
Conclusion
Do initialize your variables - but don't always force a default value when you can't initialize them with a proper one. Know how your language behaves with uninitialized variables and pick the best strategy for uncovering bugs.
Top comments (5)
I'd say, that when you think you need some default value to initialise your variable, most probably what you need is a new method, that would return properly constructed value. It would most of the time result in a slightly more readable code. The same is valid for C. :)
I wouldn't get too dogmatic about it though. Extracting stuff to functions doesn't always improve the readability. As a rule of thumb, I find that it only increase the readability if you can find a good name for that function. If the name of the function is not simpler to understand than it's body, you will actually reduce the readability because now readers will have to derail their train of thought and look for the meaning of that function.
Also, it's not always possible. In C, for example, there is that rule of only being able to declare variables at the top of the block.
Consider, for example, this function:
Even if we ignore the fact that we need to pass a pointer to
fscanf
instead of getting the result from it, we have the problem of the early return in case we couldn't open the file. We can only read the variable intoresult
after that firstif
, but we can only declareresult
before thatif
. Extracting the initialization ofresult
will not work here.Yeah, you don't have to declare variables at the top of the block even in C, as was mentioned in another comment. And C99 is almost 20 years old now. :) And no, I never claimed, that it should always result in better code. That's way I always use a lot of 'maybe' or 'most probably'. It always leaves the space to back out. :)
But, that piece of code is a good challenge. I've just realized, that I never actually thought how should clean code look like in C. It's a good challenge, actually. I like it. I think I'll take a few days to think it over and then show my version of it (or admit, that it was impossible to write) :)
Only a minor observation that, since C99 you can declare variables in any point in the function, not only at the top. You can even declare the variable inside the for clause
True, but old habits die hard.