This presentation will guide participants in implementing their first SQL language optimizer using Egg, particularly catering to young developers. The objective is to implement a range of classic optimization rules in approximately 1000 lines of code and apply optimization techniques to real TPC-H queries.
Background
Egg is a program optimization library written in Rust. It leverages e-graph and equality saturation technologies to efficiently and flexibly construct custom languages and optimize them. This presentation will guide participants in implementing their first SQL language optimizer using Egg, particularly catering to young developers. The objective is to implement a range of classic optimization rules in approximately 1000 lines of code and apply optimization techniques to real TPC-H queries.
This presentation primarily encompasses the following topics:
1.Project Background and Motivation for Utilizing Egg
2.Fundamental Principles and Utilization of Egg
3.The Key Process of Implementing an SQL Optimizer Using Egg
4.Demonstration of Results and Analysis of the Advantages and Limitations of Egg
Why We Use Egg
The story begins with the RisingLight project. This is an OLAP educational database written in the Rust programming language, initiated and maintained by several students during their internship at RisingWave. The primary purpose of creating RisingLight is to provide an opportunity for interested students, engineers, and community developers to gain a deep understanding of the fundamental principles and implementations of databases. Therefore, RisingLight is designed to be simple, comprehensive, and easy to comprehend.
As an OLAP database, the query engine is a core component of it. The following diagram illustrates its overall structure:
When a user inputs an SQL statement, it goes through the following steps: Parser, Binder, to create a Logical Plan, then it's fed to the Optimizer to generate a Physical Plan, which is finally handed over to the Executor for execution.
Here, we'll illustrate the relationship between an SQL statement, a query plan, and an optimized query plan through an example:
Database optimization for SQL statements primarily falls into two main categories: Rule-Based Optimization (RBO) and Cost-Based Optimization (CBO). Both of these approaches involve transforming a specific pattern within a query plan into another pattern. The key difference lies in how they make this transformation. In RBO, it is assumed that the rewritten pattern is always better than the previous one; whereas in CBO, we need to estimate the cost of each pattern based on statistical information and ultimately select the one with the lowest cost from multiple equivalent representations.
RisingLight initially adopted the approach of developing its optimizer from scratch. However, over time, we gradually identified several issues associated with this approach.
Firstly, despite Rust having the pattern matching syntax of "match", it is not always as intuitive and fluent for describing certain patterns. Secondly, performing local rewrites on tree-like structures composed of structs and enums was not very convenient. Lastly, we only implemented RBO and did not implement CBO. This is because CBO requires relatively complex search strategies, data structures, and dynamic programming algorithms. It necessitates a comprehensive framework, which we didn't have the resources to implement at the time.
It wasn't until one day that I came across the "egg" project within the Rust community. After reading the project's website introduction, I immediately realized that this was the Rust optimization framework I had been searching for.
The basic principles of program optimization in Egg
The project homepage of Egg provides a clear illustration of its fundamental principles and optimization process through this image.
In Egg, the core data structure is called an "e-graph," where "e" stands for equivalent. The e-graph builds upon traditional graph structures but introduces the concept of "e-classes." An e-class is a collection composed of several nodes called e-nodes, representing that these nodes are logically equivalent to each other. For example, in the diagram below, "a * 2" and "a << 1" are considered equivalent expressions when 'a' is an integer, so they belong to the same e-class.
The process of rewriting an expression on the e-graph is illustrated in the diagram below. First, we have a rewriting rule that transforms the left-hand pattern (* ?x 2)
into the right-hand pattern (<< ?x 1)
, where ?x
represents a match for any subtree. In Egg, the process begins by matching the left-hand pattern ① within the original graph. Then, based on the right-hand pattern, a new subtree is inserted ②, and finally, their root nodes are merged (union) together ③. This completes the rewriting process for a rule.
In practical optimization, we typically define a series of rewriting rules in advance and provide an initial expression. Egg's approach is to continually attempt to match various rules within the graph and insert new expressions until it can no longer expand with new nodes. This state is referred to as "Saturation ①", meaning it is saturated. After saturation, Egg uses a user-defined cost function to find the representation with the lowest cost among all possible representations. This becomes the final optimization result. This process is known as "Equality Saturation." It's worth noting that this is a cost-based optimization method, making it naturally suited for implementing CBO rules.
Furthermore, another powerful feature in Egg is its support for Program Analysis. It allows us to associate arbitrary data with each e-class to describe some attribute of it. (You can contemplate why it's associating with e-classes rather than e-nodes.) For example, we can define Constant Analysis, where we associate an Option<Value>
with each e-class to describe whether each expression within it is a constant. As nodes are dynamically added and merged in the e-graph, the Analysis is also updated accordingly. This capability can be leveraged to implement dynamic constant propagation and constant folding optimizations.
Using Egg to implement an SQL optimizer
Here's an overview of the main process:
The first step when using Egg is to define the language. Egg provides the define_language!
macro that allows you to specify the different types of nodes with an enum.
Here, we have defined four types of nodes:
1.Value nodes: Represent constants or variables.
2.List nodes: Represent ordered lists composed of multiple nodes.
3.Expression operations: Represent operations like addition, subtraction, multiplication, and division.
4.SQL operators: Represent the main nodes of a query plan tree.
The specific specifications for each node are defined in the comments on the right. It's worth noting that this language directly describes query plans rather than SQL statements. Therefore, you'll notice that column names have been transformed into Column IDs by the binder, for example, $1.1
represents the first column of the first table.
Once the language is defined, you can succinctly describe an expression using a Lisp-like S-expression format. In Egg, the RecExpr
container type is used to store expressions. Internally, it's an array of nodes, where later nodes can reference earlier nodes, with the last one being the root node. This contiguous and compact memory layout is more efficient than recursive Box structures.
Next, we continue to use Egg's pattern matching language to define rules. First, we'll start with expression simplification rules, which are foundational rules that need to be implemented in various languages. The Egg official documentation provides many examples, and I believe that most people can understand them at a glance as they are quite intuitive.
Similarly, for SQL operators, there are simple rules like merging two adjacent identical operators, swapping the order of two operators, eliminating redundant operators, and so on. In this context, an (empty)
node is introduced to describe an operator that has no output.
Next, let's examine a less trivial optimization, a classic RBO known as Predicate pushdown. As shown in the diagram below, the goal is to push the Filter operator down below the Join operator. This allows the Filter to be executed earlier, reducing the amount of data that needs to be processed.
The challenge in this optimization lies in handling different scenarios for predicate expressions on the Filter operator:
1.↙Predicates that only involve columns from the left table A, which can be pushed down to the left subtree.
2.↘Predicates that only involve columns from the right table B, which can be pushed down to the right subtree.
3.□Predicates that involve columns from both left and right tables A and B, making it impossible to push down.
To implement this optimization, we define three rules:① First, we push the predicates on the Filter operator down to the Join conditions. Then, ② and ③ respectively determine whether each predicate can be pushed down to the left or right.
Here, we use conditional rewriting (if
) and introduce the columns_is_subsetfunction
to make these determinations.
To implement this function, we also introduce an analysis called "Column Analysis." Similar to classic liveness analysis, it associates all expression nodes with the columns they use and associates all operator nodes with the columns included in their output. This way, we can simply check if "the set of expressions is a subset of the set of operators" to determine whether pushing down is possible.
Finally, let's take a look at CBO. One of the most classic optimizations is Join Reordering. When multiple tables are being joined, the order in which they are joined can have a significant impact on performance. Additionally, most real-world joins involve equijoins on primary keys, which can be optimized using hash tables. This optimization can reduce computational complexity by an order of magnitude compared to the original Nested Loop Join. Therefore, these two optimizations are crucial for multi-table queries.
In the implementation, we use Rule ① to achieve right rotation of the Join subtree. It works on an initial left-deep tree and explores all possible combinations. Then, we define Rules ② and ③ to match patterns of equijoins and transform such Joins into HashJoins. These rules, when combined, can generate numerous ways to join multiple tables effectively.
Finally, to guide Egg in finding the truly optimal solution among all the possible combinations, we need to define a cost function. The entire code for this function has been provided on the right. However, debugging this function can be a challenging task. Any oversight or mistake can prevent Egg from finding the desired optimal result, and pinpointing the error can be quite difficult.
The primary limitation of Egg in this context is its lack of support for heuristic search. During the rule application process, it does not employ pruning based on the cost function. Instead, it attempts to exhaustively expand all representations before identifying the optimal solution, adhering to the original principles of Equality Saturation. As a consequence, when dealing with a substantial number of rules, Egg encounters the issue of combinatorial explosion, rendering it unable to produce reasonable results within a finite timeframe, even for moderately complex queries. To address this challenge, we have implemented a manual phased optimization approach along with multiple iterative rounds to alleviate the problem.
The main technical points for implementing an SQL optimizer using Egg have been outlined above. The overall workflow for the refactored query engine using Egg is illustrated in the diagram below:
In addition to the optimizer itself, we also utilize Egg's Analysis functionality for necessary analysis and processing of query plans during the Binding and Executor generation stages, both before and after optimization. The entire expression and query plan heavily rely on Egg's data structures for representation.
Evaluation and Analysis
RisingLight conducts benchmark tests using the classic TPC-H benchmark in the OLAP domain. In this context, we selected Q5 to evaluate the performance of the new optimizer. Q5 involves joining 6 tables, which poses a significant challenge for the optimizer. Our optimizer took 39ms for optimization and 1s for execution. In comparison, DuckDB only took 5ms for optimization and 15ms for execution. This indicates that our prototype system still has a considerable gap to catch up with industry-leading performance.
Certainly, let's break down the specific effects and contributions of each optimization:
It's evident that without optimization, this query would be practically infeasible. Predicate pushdown significantly reduces the amount of data, hash join further reduces computational complexity, and finally, projection pushdown and column pruning (not mentioned in this text) eliminate unused data.
In conclusion, let's summarize the benefits and limitations of Egg:
Egg's biggest advantage lies in introducing a domain-specific language (DSL) for describing optimization rules in a simple, intuitive, and elegant manner, reducing the complexity of rule writing to an incredible extent. Coupled with the Rust language, it also has the capability to describe complex logic. Interestingly, even though "Rewrite Everything In Rust!" has been promoted at this conference, in this case, it's more like "Rewrite Rust in Lisp." This illustrates that while Rust is an excellent language, it's not a one-size-fits-all solution. When working on programming language related tasks, using a dedicated language is often more suitable.
Because of its simplicity, Egg is well-suited for rapidly prototyping systems. I went from learning about Egg to fully implementing the optimizer in just one week, with the code totaling just over a thousand lines. I believe it's a great fit for educational projects like RisingLight. However, it's important to note that I spent an additional two weeks integrating it into RisingLight's pipeline. Transitioning an existing project to Egg would require a thorough overhaul.
Now, let's discuss some of the issues with Egg, which is why we are not currently considering using it in RisingWave:
Firstly, Egg is inherently a CBO framework, and it is not particularly friendly to scenarios that primarily require pure RBO. This is evident in the fact that you need to provide a cost function to obtain the final optimization results. Egg does not provide a straightforward method for removing the original expression after applying a rule. While some workarounds are possible, and it is theoretically feasible to implement a dedicated RBO solution independently of its Runner, there is room for improvement in this aspect.
Secondly, Egg lacks heuristic search capabilities. This makes it susceptible to the problem of combinatorial explosion when dealing with complex queries, rendering it unable to provide the optimal results within a limited timeframe. Of course, we have mentioned that multiple iterative rounds can be used to alleviate this problem.
Another somewhat tricky aspect is that Egg has essentially a dynamic type system. Note that all types of nodes are defined within the same enum, which means that, from Egg's perspective, combinations of various node types are considered valid. We cannot, for instance, dictate that children of expression nodes can only be other expressions. This leads to challenges in debugging when rules are written incorrectly, such as when parameter order is reversed.
Lastly, I'd like to share some recent advancements in the field. Egg is a creation of a programming languages research group at the University of Washington. This year, they have developed a follow-up project to Egg called "egglog." As the name suggests, it combines E-graph and Datalog, creating a rather sophisticated fusion. However, Egglog has evolved into a standalone language rather than a Rust library. Consequently, it might be challenging to integrate Egglog into Rust projects. Nonetheless, it remains an area worth further exploration and experimentation.
CONCLUSION
This article provided insights into the Egg program optimization framework and discussed our experience in using it to implement an SQL optimizer. If you are interested in these topics, you are encouraged to explore further details on RisingLight or even embark on your own journey to implement an SQL optimizer through the sql-optimizer-labs repository.
About RisingWave Labs
RisingWave is an open-source distributed SQL database for stream processing. It is designed to reduce the complexity and cost of building real-time applications. RisingWave offers users a PostgreSQL-like experience specifically tailored for distributed stream processing.
Official Website: https://www.risingwave.com/
Documentation: https://docs.risingwave.com/docs/current/intro/
Top comments (0)