Ken Woon

Posted on Apr 26, 2023

How I Learned Lex and Yacc with Apache AGE Functions

#postgres #apacheage #lex #yacc

Introduction

As a developer for the open source Apache AGE project, I have been learning to program using Lex and Yacc in order to create functionalities for the extension. Lex and Yacc are powerful tools that can help us generate parsers and scanners for our programming language. However, these tools can be difficult to learn and use without some guidance.

In the AGE Command Line Interface (CLI) project, we aim to simplify querying in Apache AGE by allowing users to code in Cypher without wrapping it inside a SQL query. In order to do this, we are using Lex and Yacc programs to convert Cypher clauses into functions recognizable by the psql CLI.

In this blog post, we'll take a closer look at how I used Apache Age functions, specifically the MATCH function to learn Lex and Yacc coding and start building my own program.

Language Processing

In order to understand Lex/Yacc, let's first talk about Language Processing as a whole. Language processing is a crucial component in the world of software development, and it involves several stages to transform source code into executable programs. The process typically involves four key stages: lexical analysis, syntax analysis, semantic analysis and code generation, and target code generation.

Lexical analysis involves breaking down the source code into a series of tokens. These tokens are then analyzed to identify their respective classes, such as identifiers, keywords, and operators. This stage is performed by a tool called a lexer or scanner, which uses a set of rules specified in a formal language like Lex to identify and extract the tokens.

Syntax analysis is the next step, which involves parsing the token stream to determine whether the code conforms to the grammar of the programming language. This stage is carried out by a parser, which uses a formal language like Yacc to generate a parse tree, representing the syntactic structure of the program.

Semantic analysis and code generation is the stage where the parse tree is analyzed to determine the meaning of the program and generate corresponding machine code. This stage involves a series of transformations on the parse tree, including type checking, scope resolution, and code generation.

Target code generation is the next stage, which involves generating machine code that can be executed on a particular hardware platform. This stage is typically carried out by a code generator, which translates the machine-independent code into machine-specific instructions.

We will be focusing on the first two steps, which are managed by writing Lex and Yacc files.

Compilation and Execution

In this blog post, I will be using Flex and Bison instead, which are basically just improved versions of Lex and Yacc. The bison and flex commands in the examples below can interchangeable with yacc and lex.

Before compiling, make sure both {filename}.y and {filename}.l exists in the same directory, with {filename} to be whatever name you desire. First run the command below in a terminal window.

bison -d {filename}.y

This creates a {filename}.tab.c parser file and the -d option generates the {filename}.tab.h header file, which may be referenced in the Lex file. Then, run the command below to compile the Lex file.

flex match.l

This generates a file called lex.yy.c. With the two generated C files, run the following command to compile them into an executable C program.

gcc lex.yy.c {filename}.tab.c -o {filename}

This generates an executable C with a filename you have specified after the -o option, which tells the compiler to output the file into the specified filename.

./{filename}

Run the command above to execute your newly created program and you have successfully built and executed your program!

For example, a complete compilation and execution for the file 'match.l' and 'match.y' will look like:

yacc -d match.y
flex match.l
gcc lex.yy.c match.tab.c -o match
./match

If any case that an error occurs, there are likely issues with either the Lex or Yacc file, which will be specified by the error message, along with the problematic line and character.

Lex Structure

The structure of Lex files will look something like the following:

%{
DECLARATIONS
%}

%%
PATTERNS    ACTIONS;
%%

FUNCTIONS

The first part is enclosed in %{ ... %} and includes C declarations such as #include.

The section enclosed with %% ... %% represents the pattern-action pairs that allows the parser to recognize specific tokens that are inputted by the user, and make an action or multiple actions based on the input.

The last section contains C functions that support the language processing. These functions may be called by the actions from the previous section.

Yacc Structure

The structure of Yacc files is similar to that of Lex files and will look something like the following:

%{
DECLARATIONS
%}

DEFINITIONS

%%
GRAMMARS : RULES;    {ACTIONS}
%%

FUNCTIONS

The first part is enclosed in %{ ... %} and includes C declarations such as #include.

The next section is for yacc definitions that begins with the % symbol, which consists of the %start, %token, %union, and %types definitions.

The section enclosed with %% ... %% represents grammars that are a set of productions. The left-hand side which are grammars is followed by :, and a ; to finish the set of rules or | to add more rules, and a right-hand side with actions that are enclosed in { ... }.

The last section contains C functions that support the language processing. These functions may be called by the actions from the previous section.

Working with Lex

In this example, we will try to replicate the 'MATCH' Cypher command commonly used in AGE.

In the first part, we will include the 'match.tab.h' header that will be generated after compiling the 'match.y' file. We will also create a variable yyerror with type void which the parser will call whenever a syntactical error occurs.

%{
#include "match.tab.h"

void yyerror(char* s);
%}

The next section is defined with patterns and their actions. The first pattern [ \t\n] defines whitespaces, tab spaces, and new lines, and tells the compiler to ignore these input. The MATCH and RETURN tokens are intuitive and simply returns MATCH and RETURN to the compiler. The following symbols are instead returned as key words that will be used in the Yacc file. The exit token will return the exit_command, and any string of characters starting with a alphabetic character followed by any alphabetic or numeric character will be recognized as a VAR. Anything else that are inputted by the user will cause a syntactical error.

%%
[ \t\n]                 ;
"MATCH"                 { return MATCH; }
"RETURN"                { return RETURN; }
"-"                     { return DASH; }
"("                     { return LP; }
")"                     { return RP; }
"["                     { return LB; }
"]"                     { return RB; }
"{"                     { return LC; }
"}"                     { return RC; }
"<"                     { return LT; }
">"                     { return GT; }
";"                     { return yytext[0]; }
"exit"                  { return exit_command; }
[a-zA-Z][a-zA-Z0-9]*    { return VAR; }
.                       { ECHO; yyerror("unexpected character"); }
%%

There is a function in the final part of the Lex file called yywrap(), which is required if multiple input files will be used.

int yywrap(void) {
    return 1;
}

Working with Yacc

In the first part, we will include the standard io and lib headers, as well as the 'match.tab.h' header that will be generated after compiling this file. We will also create a variable yyerror with type void which the parser will call whenever a syntactical error occurs.

%{
#include <stdio.h>
#include <stdlib.h>
#include "match.tab.h"

void yyerror(char* s);
%}

We will then have some Yacc definitions. %union will allow us to specify the different types that the analyzer can return, which are enclosed in { ... }. %start indicates the default production grammar to look for at the start of the program. %token describes the expected tokens that the analyzer will be receiving as the specified type. The <str> after %token stores the tokens into the union member 'str'.

%union { char *str; }
%start query
%token <str> MATCH RETURN DASH LP RP LB RB LC RC LT GT VAR exit_command

There will be 4 production grammars in total, with the query grammar being the default. This grammar will consist of the statement grammar, or the exit_command, which will exit the program. The statement grammar can either be the match_clause, the return_clause, or both together. The match_clause can have grammars that look like one from the list:

MATCH (u)
MATCH (u)-[r]-(v)
MATCH (u)<-[r]-(v)
MATCH (u)-[r]->(v)

The return_clause will only allow us to return one variable as of now, which will be fixed in the future.

%%
query : statement ';'                                                   { ; }
      | exit_command ';'                                                { printf("Exiting program...\n"); exit(EXIT_SUCCESS); }
      ;

statement : match_clause                                                { ; }
          | return_clause                                               { ; }
          | match_clause return_clause                                  { ; }

match_clause : MATCH LP VAR RP                                          { ; }
             | MATCH LP VAR RP DASH LB VAR RB DASH LP VAR RP            { ; }
             | MATCH LP VAR RP LT DASH LB VAR RB DASH LP VAR RP         { ; }
             | MATCH LP VAR RP DASH LB VAR RB DASH GT LP VAR RP         { ; }
             ;

return_clause : RETURN VAR                                              { ; }
              ;
%%

There are two functions in the final part of the Yacc file, which are the main() and yyerror() functions. The main() function simply returns yyparse(), which is a Yacc function that iterates through the grammar to continuously read the input and produce the corresponding action until manually exited or an error is encountered. The yyerror() is called whenever a syntax error occurs, which simply prints out the error.

int main(void) {
    return yyparse();
}

void yyerror(char *s) {
    fprintf(stderr, "%s\n", s);
}

Program Demo

After compiling and executing the program, you can try typing several different inputs to test if the program is working correctly.

Trying any of these from the list with interchangeable variable name will pass as successfully parsed.

MATCH (u);
MATCH (u)-[r]-(v);
MATCH (u)<-[r]-(v);
MATCH (u)-[r]->(v);
RETURN u;
MATCH (u) RETURN u;
MATCH (u)-[r]-(v) RETURN u;
MATCH (u)<-[r]-(v) RETURN u;
MATCH (u)-[r]->(v) RETURN u;
exit;

If you try any other combinations of recognized and unrecognized symbols, the program will spit an error message and exit the program.

As of now, there are no actions set to the production grammars, which will not show the users any output through successful parses, but will be updated in the future to make the program more functional.

Conclusion

Learning Lex and Yacc programming using Apache Age functions can be an exciting and rewarding journey for any programmer interested in language processing and compiler development. With the help of Apache Age functions, I have learnt how to create my own program and build a compiler that can transform source code into executable programs.

By mastering the various stages of language processing, including lexical analysis, syntax analysis, semantic analysis, code generation, and code improvement, you can gain a deep understanding of how programming languages work and develop skills that are highly sought after in the software industry.

I have used this tutorial on Lex and this tutorial on Flex to learn the basics of Lex and Flex programming.

Top comments (1)

Carla Sanches • May 10 '23

Great tutorial! I used it to recap the content about Flex/Bison and start the project, and it was really helpful!

DEV Community

How I Learned Lex and Yacc with Apache AGE Functions

Introduction

Language Processing

Compilation and Execution

Lex Structure

Yacc Structure

Working with Lex

Working with Yacc

Program Demo

Conclusion

Top comments (1)

Read next

Part 2: Implementing Vector Search with OpenAI

Deploying a Rails API-Only App with Postgres using Kamal 2

VACUUM In Postgres Demystified

Database Setup, JWT Implementation & API