Generating a parse tree from a shell grammar

#c #shell #ast #parser

As a student at 42 one of the first projects you have to accomplish by yourself is the reimplementation of a shell. There are many things to do to have a working shell so today I will only focus on the parse tree.

A parse tree is what represents the syntactic structure of our commands and will allow us to easily perform those commands typed by the user.

So we want to be able to translate something like this:
ls | cat -e
Into a shell job that we will be able to execute as a whole command.

The first step is to transform the text input into tokens. That way we will have 4 tokens that represents our text input in a more easier way to process.

Each token has a value (the text input) and a type (if it is an operator or a command argument):

ls -> WORD
| -> PIPE
cat -> WORD
-e -> WORD

Once parsed, we expect to have something that could be rendered like this:

              pipe_sequence
              /          \
simple_command      simple_command    
      |              /          \
   cmd_name       cmd_name     cmd_suffix
      |              |             |
     'ls'          'cat'        cmd_word
                                   |
                                 '-e'

So the second step is to come up with a grammar. A grammar is a set of rules to be followed by the parser to generate the parse tree.

This is a simple version of the grammar of sh:

pipe_sequence    : simple_command
                 | pipe_sequence '|' simple_command
                 ;
cmd_suffix       : cmd_word
                 | cmd_word cmd_word
                 ;
simple_command   : cmd_name
                 | cmd_name cmd_suffix
                 ;
cmd_name         : WORD                   
                 ;
cmd_word         : WORD
                 ;

What this grammar tells us is that:

A pipe_sequence is a simple_command followed by 0 or more simple_command
A cmd_suffix is one or more cmd_word
A simple_command is a cmd_name followed or not by a cmd_suffix
A cmd_name or cmd_word is a WORD token

Each rule will be implemented as a function.

Let's start with cmd_word

/*
** struct s_parser is a data type containing informations about the current parser
** (like the tokens list)
*/
struct s_cmd_word    *cmd_word(struct s_parser *p)
{
    struct s_cmd_word    *w = NULL;

    /*
    ** p->tokens is a linked list containing all the tokens
    ** p->tokens->type is an enum corresponding to the token's type
    */
    if (p->tokens->type == T_WORD)
    {
        w = malloc(sizeof(struct s_cmd_word));
        w->data = strdup(t->data);
        p->tokens = p->tokens->next
        return (w);        
    }
    else
        return (NULL);
}

Really simple so far. If the current token is a WORD then we return a struct s_cmd_word * containing the actual data. But if the current token is not a word (it could be an operator) we return NULL. And we should not forget to advance to the next token once we used it.

What about pipe_sequence?

struct s_pipe_sequence    *pipe_sequence(struct s_parser *p)
{
    struct s_pipe_sequence    *ps;
    struct s_simple_command   *sc;

    ps = malloc(sizeof(struct s_pipe_sequence));
    while (sc = simple_command(p))
    {
        /* This function pushes a new element into a linked list */
        ft_lstpush(sc, &ps->simple_commands);
        if (p->tokens->type == T_PIPE)
            p->tokens = p->tokens->next;
    }
    if (ps->sc)
        return (ps);
    free(ps);
    return (NULL);
}