While developing ETL pipelines for a new data warehouse, I encountered unexpected query results related to NULL values. This prompted me to delve into the underlying SQL behavior to better understand the discrepancy. In this article, I aim to elucidate the contrasting treatment of NULLs by the DISTINCT and UNIQUE keywords. I'll present examples based on my analysis of the data warehouse.
When I sought to explore the categories or ranges of values within a specific column—let's say the "jobs" column—the initial result appeared as follows:
[doctor, engineer, ..., null]
This outcome raised a question: Does this mean that there is only a single NULL value? My previous understanding of PostgreSQL led me to believe that NULL values are treated as distinct. However, when I meticulously counted the occurrences of NULL values in the "jobs" column using the query:
SELECT COUNT(*) as nulljobs FROM jobs WHERE jobs IS NULL;
The result unveiled a surprising 140 rows containing NULL values. It seemed that DISTINCT treated NULL values as equivalent. There could be a reasonable explanation for this behavior. The DISTINCT operation is typically employed for visualization or categorization purposes. Hence, treating NULLs as distinct might not make practical sense. However, this raises the question of why NULLs are treated as distinct values in the first place.
This treatment is rooted in the concept of Three-Valued Logic (3VL), which consists of true, false, and unknown. Crucially, "unknown" is not equal to "unknown" in this context. Keep in mind that the comparison "unknown != unknown" doesn't hold. To shed light on this, consider that infinity + infinity equals infinity. This implies that unknown values can be aggregated into a single value or category.
The behavior of NULL values under the UNIQUE constraint in PostgreSQL introduces interesting nuances. UNIQUE constraints are particularly tailored for use with partial indexes. In this specific context, the interpretation of NULL values differs significantly. Rather than serving as indicators of mean or central tendencies, NULL values under UNIQUE constraints embody the distinctiveness of other non-null values within a row. As a result, UNIQUE constraints do not revolve around the treatment of NULL values in isolation. Instead, they focus on preserving distinctions among values contained within individual rows.
To illustrate this point further, let's consider a practical scenario involving a PostgreSQL database:
Suppose we have a table named "employees" with columns "employee_id" and "email." We want to ensure that each email address in the "email" column is unique. However, NULL values should not interfere with this uniqueness requirement.
CREATE TABLE employees ( employee_id serial PRIMARY KEY, email VARCHAR(255) UNIQUE ); -- Create a partial index to enforce uniqueness among non-null ids CREATE UNIQUE INDEX unique_email ON employees (email) WHERE email IS NOT NULL;
In this case, PostgreSQL will enforce the UNIQUE constraint on the "email" column, allowing multiple NULL values. This aligns with the concept that the UNIQUE constraint is chiefly concerned with maintaining the distinctiveness of non-null values.
For instance, the following insertions are valid:
-- Valid insertions INSERT INTO employees (email) VALUES ('email@example.com'); INSERT INTO employees (email) VALUES ('firstname.lastname@example.org'); INSERT INTO employees (email) VALUES (NULL);
However, attempting to insert duplicate non-null values would result in a violation of the UNIQUE constraint:
-- Invalid insertion due to violation of UNIQUE constraint INSERT INTO employees (email) VALUES ('email@example.com');
when you SELECT some info using the partial index already made with EXPLAIN it is performing an index scan.
EXPLAIN SELECT * FROM employees WHERE email IS NOT NULL;
Sure, I can help you write a summary and conclusion for your article. Here is a possible way to end your article:
- DISTINCT allows duplicate NULLs in the result set, as it considers NULLs as distinct values that represent unknown or missing data. This behavior is based on the three-valued logic of SQL, where unknown is not equal to unknown.
- UNIQUE constraints treat NULLs as equal, as they enforce uniqueness among non-null values in a column or a set of columns. This behavior is useful for creating partial indexes that exclude NULL values from the index.
- The difference between DISTINCT and UNIQUE has an impact on the query performance and data integrity, as it affects how the data is filtered, sorted, and indexed. Developers should be aware of this difference and use appropriate techniques to handle NULL values in their queries.