DEV Community

Cover image for The Internals of PostgreSQL: Chapter 3 Query Processing (Part 1,2,3)
Hasnain Somani
Hasnain Somani

Posted on

The Internals of PostgreSQL: Chapter 3 Query Processing (Part 1,2,3)

This post summarizes Chapter 3 of the book "The Internals of PostgreSQL".
3.1:
PostgreSQL query processing comprises of several subsystems working together in order to respond to client queries. The main systems involved include:

  1. Parser: the parser receives an SQL statement in the form of a plain text, and in result, it generates a parse tree which represents the syntactic structure of the query. In addition, the parser also performs syntax checking, but semantics are not checked here.
  2. Analyzer: Symentic analysis of the parse tree (generated by the parser) is done here. A query tree is generated as a result, and the query tree represents the metadata and the structure of the query. It contains information such as the target list, range table, and sort clause.
  3. Rewriter: The rewriter applies rules present in the rule system so that necessary transformations can be made to the query tree. The query tree is therefore modified on the basis of pre-defined rules.
  4. Planner: The planner generates a plan tree from the query tree it receives from the rewriter. This plan tree is a representation of the most effective execution plan for the query, and a cost based optimization approach is used to determine the best plan. Factors like resources and statistics play a vital role in determining the optimal plan.
  5. Executor: At this stage, the plan tree generated by the planner is executed. The executor uses instructions mentioned in the plan tree to access tables and indexes accordingly. The retreival and processing of data is also done according to the order specified in the pkan tree. Temporary files are created if needed, and the executor enforces concurrency control mechanisms to maintain data consistency, and isolation during query execution.

3.2:
The cost of an operation in PostgreSQL is a dimentionless value, which is useful in the comparision of the relative performance, and it varies with operations. The types of costs are:

  1. Start-up cost: The cost before the first tuple is fetched.
  2. Run cost: The cost of fetching all tuples. There are vaious formulas for all types of scans.
  3. Total cost: This is the sum of start-up cost and run cost.

In this section, the cost estimation on sequential scan, and index scan is seen.

3.3:
This section discusses the process of the creation of a plan tree for a single table query in postgreSQL. 3 main steps are:

  1. Preprocessing: this is done in the PlannerInfo Structure. Target lists, limit clauses, and all other expressions are simplified by using built-in functions. In addition, nested AND/OR expressions are also flattened.
  2. Getting the cheapest Access path: the cost of all possible access paths is estimated, and the cheapest one is selected. RelOptInfo structure is created, which is used to store access paths, and their costs. Index scan, sequential scan, and bitmap scan are all parts of this structure.
  3. Creating the Plan tree: A plan tree is generated from the cheapest access path. This job is done by the planner. The root of the plan tree contains information related to the type of operation, and related tables for the query. The tree has many plan nodes, each curresponds to an operation such as the sequential, or index scan. These plan nodes have information such as the estimated costs, number of rows to be scanned, qual conditions, and so on - and these nodes are then linked together to form a plan tree.

3.4:
This section discusses about how the executor performs query processing by executing plan nodes in the reverse order, starting from the end of the plan tree, and moving towards the root node. Every plan node corresponds to an operation, which might be an index scan, or a sequential scan, or sorting, and functions associated with it are stored in the src/backend/executor/ directory.
The EXPLAIN command is the most appropriate command to examine the output. The result makes best sense when read from the bottom line to the top line. EXPLAIN ANALYZE is another command, which provides additional information such as the true row counts, memory usage, and run time.
Overall, the EXPLAIN command is useful to understand the working of executor in terms of query processing. It also helps determine the use of temporary fies in the process, which are created in the base/[g_temp subdirectory.

3.5:
PostgreSQL supports various join operations including nested loop join, merge join and hash join:

  1. Nested Loop Join: This is the most basic join operation, and can be used in any join conditions. It has 5 variations in PostgreSQL: The materialized jon reduces the cost of scanning the inner tables by storing the tuples in temporary file, The indexed nested loop join utilizes an index on the inner table to make direct searches and match tuples, instead of searching sequentially, and other variations which can reduce the cost in their own ways. 2.Merge join: it is used for natural and equi joins, and has 4 variations. Materialized merge join works in the similar way, by using temporary files, and other variations include outer index scan, which reduces cost by utilizing an index on the outer table.
  2. Hash join: involves hashing the join keys of both the inner and outer tablesm and then matching the tuples on the basis of their hash values. This approach is preferred with large datasets, or datasets with non-indexed columns. Hash join has its own 2 variations. An in-memory hash table is used for matching tuples.

The choice of the algorithm depends on conditions such as the size of input relations, the join condition selected, and user preference. Examples of all joins along with their results on the EXPLAIN command are shown in this section.

3.6:
This section highlights the creation of a plan-tree for multiple-table query. The steps are somehow similar:

  1. Preprocessing: the planner converts subqueries in the FROM clause using the built-in function pull_up_subqueries(). Moreover, in preprocessing, outer join queries are transformed, to make them inner join queries.
  2. Getting the cheapest path: the planner looks into various combinations of join methods and indexes to determine an optimal plan tree. It uses dynamic programming if there are less than 12 tables. If the number of tables are larger, Genetic Query Optimizer is used by the planner, which helps find a reasonable plan. The planner selects the cheapest access path for each level from level 1 to higher levels.
  3. Getting the cheapest path of a triple-table query: the planner makes cost estimations, and thus, the cheapest path is determined using a combination of RelOptInfo. In addition, Join oaths are also evaluated.

As a summary, this chapter provides an overview on query processing, and the optimization of query in PostgreSQL. Beginning with an introduction on how the query is processed, the document explains stages in query optimization including plan generation and cost estimation. The creation of plan trees is also discussed.

Top comments (0)