Francesco Tisiot

Posted on Feb 28, 2024 • Edited on May 7, 2024 • Originally published at ftisiot.net

11 Lessons to learn when using NULLs in PostgreSQL®

#postgres #sql #null #isnull

A boolean value should only contain two values, True or False, but is it correct? Usually people assume so, but sometimes miss the fact that there could be the absence of the value all-together. In databases this is absence is usually stored as NULL and this blog showcases how to find them, use them properly and 11 lessons to learn to be a NULL Pro!

Keep in mind, it's not only booleans that can contain NULL values, it's all the columns where you don't define a NOT NULL constraint!

If you need a FREE PostgreSQL database?
🦀 Check Aiven's FREE plans! 🦀
⚡️ Need to optimize your SQL query with AI? ⚡️
🐧 Check Aiven AI database optimizer! Powered by EverSQL 🐧

It all starts with some columns and rows

Let's start from the basics: you have a PostgreSQL® database and a table, called users like:

CREATE TABLE users (
    id SERIAL,
    username TEXT PRIMARY KEY,
    name TEXT,
    surname TEXT,
    age INT
);

Hey, do you want to test the above code but not having a PostgreSQL database handy? Past your code in PostgreSQL Playground and quickly check the results!

Let's insert some data:

INSERT INTO users (username, name, surname, age) VALUES
    ('jdoe', 'Jon', 'Doe', 25),
    ('lspencer','Liz', 'Spencer', 35),
    ('hlondon','Hanna', 'London', 45);

Querying the data showcases the table with all the columns filled.

 id | username | name  | surname | age 
----+----------+-------+---------+-----
  1 | jdoe     | Jon   | Doe     |  25
  2 | lspencer | Liz   | Spencer |  35
  3 | hlondon  | Hanna | London  |  45
(3 rows)

Insert NULLs

Now, let's check if we can insert some NULLs, let's try by inserting them in the name, surname and age columns:

INSERT INTO users (username, name, surname, age) VALUES ('test',NULL, NULL, NULL);

This works since we don't have any constraint

 id | username | name  | surname | age 
----+----------+-------+---------+-----
  1 | jdoe     | Jon   | Doe     |  25
  2 | lspencer | Liz   | Spencer |  35
  3 | hlondon  | Hanna | London  |  45
  4 | test     |       |         |    
(4 rows)

We can even simplify the insert with the same effect

INSERT INTO users (username) VALUES ('test1');

Result:

 id | username | name  | surname | age 
----+----------+-------+---------+-----
  1 | jdoe     | Jon   | Doe     |  25
  2 | lspencer | Liz   | Spencer |  35
  3 | hlondon  | Hanna | London  |  45
  4 | test     |       |         |    
  5 | test1    |       |         |    
(5 rows)

Inserting NULLs

The first step to avoid NULLs from appearing in a table is to forbid inserting them. The section below showcases a few situations on how to avoid inserting NULLs in new and existing columns.

Inserting NULLs in the primary key

Can I insert a NULL in the table primary key? In theory this could be allowed since there's no explicit NOT NULL constraint on the column. Let's try:

INSERT INTO users (username) VALUES (NULL);

However this fails with:

ERROR:  null value in column "username" of relation "users" violates not-null constraint
DETAIL:  Failing row contains (6, null, null, null, null).

💡 Lesson 1 💡: Primary Keys are by default NOT NULL

Create a new column with NOT NULL constraint

Let's add a new column to our table:

ALTER TABLE users ADD COLUMN points INT NOT NULL;

This will fail, since the previously inserted data don't have any associated value for points

ERROR:  column "points" of relation "users" contains null values

Let's add a default value with:

ALTER TABLE users ADD COLUMN points INT DEFAULT 0 NOT NULL;

The table was altered, and now contains 0 points for all the existing rows.

 id | username | name  | surname | age | points 
----+----------+-------+---------+-----+--------
  1 | jdoe     | Jon   | Doe     |  25 |      0
  2 | lspencer | Liz   | Spencer |  35 |      0
  3 | hlondon  | Hanna | London  |  45 |      0
  4 | test     |       |         |     |      0
  5 | test1    |       |         |     |      0
(5 rows)

If we try to insert a new row without specifying the points column:

INSERT INTO users (username) VALUES ('usrwithnullpoints');

The row is inserted with the default 0 points

 id |     username      | name  | surname | age | points 
----+-------------------+-------+---------+-----+--------
  1 | jdoe              | Jon   | Doe     |  25 |      0
  2 | lspencer          | Liz   | Spencer |  35 |      0
  3 | hlondon           | Hanna | London  |  45 |      0
  4 | test              |       |         |     |      0
  5 | test1             |       |         |     |      0
  7 | usrwithnullpoints |       |         |     |      0
(6 rows)

What if we try to push a NULL into points?

INSERT INTO users (username, points) VALUES ('forcenullpoints', NULL);

We get the ERROR: null value in column "points" of relation "users" violates not-null constraint stating that our insert is violating the NOT NULL constraint.

💡 Lesson 2 💡: Setting a NOT NULL constraint doesn't allow inserting NULLs

💡 Lesson 3 💡: If we want to add a NOT NULL column to an existing table with data, we need to specify the default value for existing rows.

Add a NOT NULL constraint to an existing column

If we try to change an existing column (e.g. name) to add the NOT NULL constraint:

ALTER TABLE users ALTER COLUMN name SET NOT NULL;

We get the similar error ERROR: column "name" of relation "users" contains null values, therefore we need also need to amend the current NULL values (and possibly set a default)

ALTER TABLE users 
    ALTER COLUMN name TYPE TEXT USING (COALESCE(name, 'Hugo')),
    ALTER COLUMN name SET DEFAULT 'Hugo',
    ALTER COLUMN name SET NOT NULL
    ;

In the above we:

Altered the type, setting it to TEXT again, to apply the function COALESCE(name, 'Hugo') which is replacing NULLs in the name field with Hugo
Set the default value for the name column to Hugo <- This is not strictly necessary
Applied the NOT NULL constraint since we don't have any NULL values anymore

Note: the above method locks the entire table until done. A faster approach could be to split the COALESCE(name, 'Hugo') into a dedicated update and then only perform the NOT NULL and DEFAULT settings on the ALTER TABLE statement.

The results are in line with the expected, no NULLs in the name column, being replaced by Hugo

 id |     username      | name  | surname | age | points 
----+-------------------+-------+---------+-----+--------
  1 | jdoe              | Jon   | Doe     |  25 |      0
  2 | lspencer          | Liz   | Spencer |  35 |      0
  3 | hlondon           | Hanna | London  |  45 |      0
  4 | test              | Hugo  |         |     |      0
  5 | test1             | Hugo  |         |     |      0
  7 | usrwithnullpoints | Hugo  |         |     |      0
(6 rows)

💡 Lesson 4 💡: If we need to add a NOT NULL column to an existing column in a table, we need to amend the existing NULL values and optionally set the default value for incoming rows.

What if I want to remove the NOT NULL constraint? We can do it with:

ALTER TABLE users 
    ALTER COLUMN name DROP NOT NULL
    ;

Is DEFAULT enough?

What if we leave the NOT NULL constraint aside, and just set the DEFAULT? Let's try with the date_first_login new column

ALTER TABLE users ADD COLUMN date_first_login DATE DEFAULT CURRENT_DATE;

Checking what's in the table, we see all the date_first_login filled

 id |     username      | name  | surname | age | points | date_first_login 
----+-------------------+-------+---------+-----+--------+------------------
  1 | jdoe              | Jon   | Doe     |  25 |      0 | 2024-02-27
  2 | lspencer          | Liz   | Spencer |  35 |      0 | 2024-02-27
  3 | hlondon           | Hanna | London  |  45 |      0 | 2024-02-27
  4 | test              | Hugo  |         |     |      0 | 2024-02-27
  5 | test1             | Hugo  |         |     |      0 | 2024-02-27
  7 | usrwithnullpoints | Hugo  |         |     |      0 | 2024-02-27
(6 rows)

If we try to insert a new row, without specifying the date_first_login, it's correctly assigned to today

INSERT INTO users (username) VALUES ('user_with_no_date_first_login');

The above is correctly inserted with a not null date_first_login

 id |           username            | name  | surname | age | points | date_first_login 
----+-------------------------------+-------+---------+-----+--------+------------------
  1 | jdoe                          | Jon   | Doe     |  25 |      0 | 2024-02-27
  2 | lspencer                      | Liz   | Spencer |  35 |      0 | 2024-02-27
  3 | hlondon                       | Hanna | London  |  45 |      0 | 2024-02-27
  4 | test                          | Hugo  |         |     |      0 | 2024-02-27
  5 | test1                         | Hugo  |         |     |      0 | 2024-02-27
  7 | usrwithnullpoints             | Hugo  |         |     |      0 | 2024-02-27
  9 | user_with_no_date_first_login | Hugo  |         |     |      0 | 2024-02-27
(7 rows)

So... is the NOT NULL constraint really needed? The reply is YES since I could insert NULLs by specifying it in the INSERT statement

INSERT INTO users (username, date_first_login) VALUES ('user_with_null_date_first_login', NULL);

Which will be stored correctly with a NULL value for date_first_login

 id |            username             | name  | surname | age | points | date_first_login 
----+---------------------------------+-------+---------+-----+--------+------------------
  1 | jdoe                            | Jon   | Doe     |  25 |      0 | 2024-02-27
  2 | lspencer                        | Liz   | Spencer |  35 |      0 | 2024-02-27
  3 | hlondon                         | Hanna | London  |  45 |      0 | 2024-02-27
  4 | test                            | Hugo  |         |     |      0 | 2024-02-27
  5 | test1                           | Hugo  |         |     |      0 | 2024-02-27
  7 | usrwithnullpoints               | Hugo  |         |     |      0 | 2024-02-27
  9 | user_with_no_date_first_login   | Hugo  |         |     |      0 | 2024-02-27
 10 | user_with_null_date_first_login | Hugo  |         |     |      0 | 
(8 rows)

💡 Lesson 5 💡: The DEFAULT is going to prevent NULLs from being inserted only if the NULL is NOT specified in the target column. To enforce the value we need an explicit NOT NULL constraint

Querying NULLs

In the above section we tried to avoid storing NULLs in our systems, what if we need to deal with existing NULLs in our column? Here we can use a few SQL functions to detect them and change their value.

Detecting NULLs with the `IS NULL` condition

What if we want to detect the NULL values? We can use the IS NULL condition, for example:

SELECT * from users where surname IS NULL;

Which will retrieve the list of NULLs surnames

 id |            username             | name | surname | age | points | date_first_login 
----+---------------------------------+------+---------+-----+--------+------------------
  4 | test                            | Hugo |         |     |      0 | 2024-02-27
  5 | test1                           | Hugo |         |     |      0 | 2024-02-27
  7 | usrwithnullpoints               | Hugo |         |     |      0 | 2024-02-27
  9 | user_with_no_date_first_login   | Hugo |         |     |      0 | 2024-02-27
 10 | user_with_null_date_first_login | Hugo |         |     |      0 | 
(5 rows)

The opposite list could be retrieved by applying the IS NOT NULL condition:

SELECT * from users where surname IS NOT NULL;

Resulting in

 id | username | name  | surname | age | points | date_first_login 
----+----------+-------+---------+-----+--------+------------------
  1 | jdoe     | Jon   | Doe     |  25 |      0 | 2024-02-27
  2 | lspencer | Liz   | Spencer |  35 |      0 | 2024-02-27
  3 | hlondon  | Hanna | London  |  45 |      0 | 2024-02-27
(3 rows)

What if I try to use the equal comparison (=)

SELECT * from users where surname = NULL;

The above retrieves NO rows since the NULL value means there is NO value and is NOT equal to any other value (NULL=NULL is False)

 id | username | name | surname | age | points | date_first_login 
----+----------+------+---------+-----+--------+------------------
(0 rows)

If we apply the different operator (<>) with NULL we also get back NO rows, since the NULL value means there is NO value, and therefore is not equal or different to other values.

SELECT * from users where surname <> NULL;

💡 Lesson 6 💡: To find NULLs we should always use the IS NULL; the IS NOT NULL to fetch the not null rows.

Associating a default value to NULLs with COALESCE

In the above we saw how to find NULLs, what if we need to include them in a SELECT statement with other values? We can either add an OR condition like

SELECT * from users 
WHERE surname = 'London' 
OR surname IS NULL;

Or we can apply a transformation to change the NULLs into a value we can compare

SELECT * from users 
WHERE COALESCE(surname,'Verona') in ('London', 'Verona');

Both will have the same result

 id |            username             | name  | surname | age | points | date_first_login 
----+---------------------------------+-------+---------+-----+--------+------------------
  3 | hlondon                         | Hanna | London  |  45 |      0 | 2024-02-27
  4 | test                            | Hugo  |         |     |      0 | 2024-02-27
  5 | test1                           | Hugo  |         |     |      0 | 2024-02-27
  7 | usrwithnullpoints               | Hugo  |         |     |      0 | 2024-02-27
  9 | user_with_no_date_first_login   | Hugo  |         |     |      0 | 2024-02-27
 10 | user_with_null_date_first_login | Hugo  |         |     |      0 | 
(6 rows)

However, if we apply the second option, we must be sure that Verona (the label we associate to NULL values) is a value not present in the column or a value we want to include in our selection.

💡 Lesson 7 💡: When using COALESCE check out that the value you want to use as substitute of NULL is not already present in the table and outside of your business logic otherwise you might include in the filter more values than expected.

Performing Math on NULL columns

A similar scenario is due when we need to perform a calculation over a NULLable column. Let's take the example of the age column: what if we want to calculate the average age?

Let's see the data:

 id |            username             | name  | surname | age | points | date_first_login 
----+---------------------------------+-------+---------+-----+--------+------------------
  1 | jdoe                            | Jon   | Doe     |  25 |      0 | 2024-02-27
  2 | lspencer                        | Liz   | Spencer |  35 |      0 | 2024-02-27
  3 | hlondon                         | Hanna | London  |  45 |      0 | 2024-02-27
  4 | test                            | Hugo  |         |     |      0 | 2024-02-27
  5 | test1                           | Hugo  |         |     |      0 | 2024-02-27
  7 | usrwithnullpoints               | Hugo  |         |     |      0 | 2024-02-27
  9 | user_with_no_date_first_login   | Hugo  |         |     |      0 | 2024-02-27
 10 | user_with_null_date_first_login | Hugo  |         |     |      0 | 
(8 rows)

We can calculate the average age with:

select avg(age) 
from users;

Resulting in 35 which is the average between the 3 NOT NULL ages! Similarly if we count the date_first_logins with:

select count(date_first_login) 
from users;

The result is 7, one row for every user having a not null date_first_login

💡 Lesson 8 💡: When performing Mathematical operation with nullable columns, only NOT NULL values will be elaborated!

What if we try to calculate the average points? First let's update the table to give jdoe 25 points

UPDATE users set points=25 where username='jdoe';

Now let's calcuate the average points

select avg(points) from users;

The result is 3.125 which is the mathematical average between the 25 points of jdoe and the 0 of the rest due to us previously setting 0 as default value. Is this correct? It depends! If we wanted to calculate the average number of points of users actually playing some games, this could be a misleading calculation.

💡 Lesson 9 💡: When performing Mathematical operation with nullable columns, the default value of a column will impact the results.

Joining on NULL columns

Building on what we explored before about the NULL not being equal or different to any other value, we need to pay closer attention when we deal with nullable columns in join condition.

Let's create another table called sessions and insert some data. For a weird reason, the design of this table will force us to join with users on the surname which is a NULLABLE field.

CREATE TABLE sessions (id serial, surname text, session_time timestamp);
INSERT INTO sessions (surname, session_time) values
    ('Doe', CURRENT_TIMESTAMP - INTERVAL '1 day'),
    ('Doe', CURRENT_TIMESTAMP - INTERVAL '1 day'*2),
    ('Spencer', CURRENT_TIMESTAMP - INTERVAL '1 day'*2),
    ('Spencer', CURRENT_TIMESTAMP),
    (NULL, CURRENT_TIMESTAMP),
    (NULL, CURRENT_TIMESTAMP - INTERVAL '1 day'*2),
    (NULL, CURRENT_TIMESTAMP - INTERVAL '1 day'*2),
    (NULL, CURRENT_TIMESTAMP - INTERVAL '1 day'*1),
    (NULL, CURRENT_TIMESTAMP);

Now, if we want to count users having a session on each day we could think of simply writing the above query:

SELECT EXTRACT(DAY FROM session_time),
    count(distinct username) 
FROM sessions join users 
    on sessions.surname=users.surname
group by EXTRACT(DAY FROM session_time);

However the results shows only the 4 sessions below

 extract | count 
---------+-------
      25 |     2
      26 |     1
      27 |     1
(3 rows)

Why? It's again because of NULL values in the join condition. This problem of poor data quality on both ends of the join is stopping us from reporting any useful number. How could we fix it? Well, the fix should be done uphill by setting the sessions table to use the username instead of the surname and defining a proper FOREIGN KEY to the users table.

💡 Lesson 10 💡: Joining tables on nullable columns without caring about the NULL management will impact the results by reducing the number of rows included in the join.

Still, if we want a simplistic solution, we could do the apply the COALESCE to both sides of the join:

SELECT EXTRACT(DAY FROM session_time),
    count(distinct username) 
FROM sessions join users 
    on coalesce(sessions.surname, 'NoSurname')=coalesce(users.surname, 'NoSurname')
group by EXTRACT(DAY FROM session_time);

The result shows an increased count of sessions

 extract | count 
---------+-------
      25 |     7
      26 |     6
      27 |     6
(3 rows)

However this is still not the correct answer, since we are seeing 19 session being aggregated while only 9 sessions were inserted in the original sessions table. Why is that? It's because we now casted all the NULLs as NoSurname on both sides of the join. Since we had 5 sessions without surname and 5 users without surname we are creating a cartesian join of all the empty surnames with the empty sessions.

We had the following users without surname

test
test1
usrwithnullpoints
user_with_no_date_first_login
user_with_null_date_first_login

And the following sessions without surname

  6 | 2024-02-27 14:10:08.419021
  7 | 2024-02-25 14:10:08.419021
  8 | 2024-02-25 14:10:08.419021
  9 | 2024-02-26 14:10:08.419021
 10 | 2024-02-27 14:10:08.419021

Each of the users was associated with each sessions, creating a overflow of sessions.

A more correct solution, attribuiting all the NULL session to a same phantom user NoUser could be:

SELECT EXTRACT(DAY FROM session_time),
    count(distinct coalesce(username,'NoUser'))
FROM sessions left outer join users 
    on sessions.surname=users.surname
group by EXTRACT(DAY FROM session_time);

However this is an effort to fix data quality issues at query time, which is never a good idea.

💡 Lesson 11 💡: Fixing data quality issues related to nullable columns at query time is a risky approach. Solving them with proper table design and constraints allows you to keep quality data in the tables.

Conclusion

Dealing with NULLs is something to care upfront, during table design, in order to avoid data quality issues downstream!
This is it for the 11 lessons learnt using NULLs, do you have others to share? Feel free to reach me on X/Twitter at @ftisiot or on Linkedin

DEV Community

11 Lessons to learn when using NULLs in PostgreSQL®

It all starts with some columns and rows

Insert NULLs

Inserting NULLs

Inserting NULLs in the primary key

Create a new column with NOT NULL constraint

Add a NOT NULL constraint to an existing column

Is DEFAULT enough?

Querying NULLs

Detecting NULLs with the `IS NULL` condition

Associating a default value to NULLs with COALESCE

Performing Math on NULL columns

Joining on NULL columns

Conclusion

Top comments (0)

It all starts with some columns and rows

Insert NULLs

Inserting NULLs

Inserting NULLs in the primary key

Create a new column with NOT NULL constraint

Add a NOT NULL constraint to an existing column

Is DEFAULT enough?

Querying NULLs

Detecting NULLs with the IS NULL condition

Associating a default value to NULLs with COALESCE

Performing Math on NULL columns

Joining on NULL columns

Conclusion

Detecting NULLs with the `IS NULL` condition