We often see articles warning that using SELECT *
is a bad habit and columns should be explicitly specified in queries, such as in the latest Alibaba Java Coding Guidelines (Taishan Edition) where it is a mandatory rule. However, people only have a vague understanding of the reasons behind avoiding SELECT *
. This article explains 9 reasons to avoid using SELECT *
from two perspectives, and also discusses two reasonable exceptional cases to use SELECT *
.
Performance Issues
From a performance perspective, using SELECT *
in applications can cause performance problems in queries, mainly manifesting in the following six aspects:
Increased parsing cost:
SELECT *
requires the database to parse more objects, permissions, attributes and other related content. This impact may be small, but it does put some pressure on the database's metadata queries.Increased IO operations: For useless large fields like varchar, blob, text, IO operations will increase. Databases generally store large fields above a certain length in separate tablespaces, so accessing these fields will incur an additional IO operation.
Increased network consumption: Bringing useless large text fields like LOB/TEXT will cause the data transmission to increase exponentially, especially if the database and application are not on the same machine, this overhead is very noticeable.
Increased memory consumption: Whether or not you use these columns, your application still needs to receive them into memory, which can unnecessarily consume large amounts of memory, affecting program performance and robustness, and even causing memory overflow and application crashes.
Impact on index selection: For query statements using
SELECT *
, the optimizer will abandon the possibility of optimizing using covering indexes, resulting in having to access the table or full table scans.Impact on index recommendations: PawSQL's index recommendation engine can analyze the structure of SQL statements and recommend the creation of indexes on eligible tables (three cases: locate data, avoid sorting, avoid accessing the table again) to improve query performance. Please refer to Guidelines for Creating Efficient Indexes for details. For query statements using
SELECT *
, PawSQL's index recommendation engine will not consider recommending covering indexes.
Maintenance Cost
From a code maintenance perspective, using SELECT *
in applications can make maintenance difficult, mainly manifesting in the following three aspects:
Code readability: Using SELECT * reduces the readability of the query, because using SELECT * makes the query statement difficult to understand. Developers need to check the table definition to determine exactly what data is being queried, and it is also difficult to debug.
-
Column alignment:
- When developing applications using the Mybatis framework, adding or deleting fields can easily become inconsistent with the resultMap configuration;
- When using
SELECT *
to define views, adding or deleting fields may cause the view to become invalid; - You may often use
SELECT * into INSERT . . .
and other statements to copy some data from one table to another. If the column order is slightly different between the two tables, some incorrect data may be copied to the wrong columns.
Column name conflicts: If you use
SELECT *
in joined queries, duplicate column names in multiple tables will cause column name conflicts and lead to consumers using the wrong data.
Two Exceptional Cases
SELECT *
is not unsuitable for use in all situations. There are two common reasonable scenarios in our daily development:
- Ad-hoc Queries
When we explore data or locate problems, we need to manually write some SQL on the fly to view certain data tables. We don't know what columns the table has, so we can use SELECT *
to complete our query. We don't need to research the underlying column names in advance. We can even use SELECT *
to determine column names and sample data, especially when a table has a large number of columns, SELECT *
can make data exploration or problem location more convenient and faster.
- When * represents a row rather than all columns
When * represents a row rather than all columns, the use of * is reasonable.
For example, in the following use case, * represents rows that meet a certain condition. If you use column names instead of *, it will count the number of rows where the column value is not NULL
.
SELECT COUNT(*) FROM table;
Similarly, in the following query, * represents rows that meet the join condition.
SELECT c_custname FROM customer c WHERE EXISTS ( SELECT * FROM orders o WHERE o.c_custkey = c.c_custkey);
Some people use the primary key o.c_custkey
of table orders
or the number 1 in the SELECT list, but these conventions are basically meaningless. You are querying all rows that meet a certain condition, which is what * means. For the database optimizer, these two query statements are actually the same.
PawSQL Optimizations Related to SELECT *
There are three PawSQL checks or optimizations related to SELECT *
:
Avoid Using SELECT *
PawSQL analyzes the SQL statements entered by users and warns for SELECT *
found in non-exceptional cases:
- Warn for the following SQL:
select * from customer as c where c.c_acctbal > 100
- Do not warn for the following SQL, which is an exceptional case:
select c_custname from customer c where EXISTS ( select * from orders o where o.c_custkey = c.c_custkey);
Projection Pushdown Rewrite
PawSQL's projection pushdown rewrite optimization can rewrite SELECT *
in subqueries to remove unnecessary SELECT *
:
- SQL before optimization,
SELECT *
exists in subquery:
select c_nationkey, count(*)
from (select *
from customer as c
where c.c_acctbal > 100)
group by c_nationkey
- After applying projection pushdown:
select c_nationkey, count(*)
from (select c_nationkey
from customer as c
where c.c_acctbal > 100)
group by c_nationkey
To verify PawSQL's handling logic of
SELECT *
with projection pushdown, you need to first disable the Query Folding rewrite optimization, because query folding will consolidate the subquery rewrite.
Covering Index Recommendation
As mentioned in the Performance Issues section, for query statements using SELECT *
, PawSQL's index recommendation engine will not consider recommending covering indexes.
About PawSQL
PawSQL focuses on automating and intelligizing database performance optimization, supporting MySQL, PostgreSQL, OpenGauss, etc. PawSQL products include:
PawSQL Cloud, an online automated SQL optimization tool that provides SQL review, intelligent query rewriting, cost-based index recommendations, suitable for DBAs and data application developers.
PawSQL Advisor, an IntelliJ plugin suitable for data application developers, can be installed from the IDEA/DataGrip marketplace by searching for "PawSQL Advisor".
Top comments (0)