DEV Community

Oguzhan Yagci
Oguzhan Yagci

Posted on • Updated on

Generating a parse tree from a shell grammar

As a student at 42 one of the first projects you have to accomplish by yourself is the reimplementation of a shell. There are many things to do to have a working shell so today I will only focus on the parse tree.

A parse tree is what represents the syntactic structure of our commands and will allow us to easily perform those commands typed by the user.

So we want to be able to translate something like this:
ls | cat -e
Into a shell job that we will be able to execute as a whole command.

The first step is to transform the text input into tokens. That way we will have 4 tokens that represents our text input in a more easier way to process.

Each token has a value (the text input) and a type (if it is an operator or a command argument):

ls -> WORD
| -> PIPE
cat -> WORD
-e -> WORD
Enter fullscreen mode Exit fullscreen mode

Once parsed, we expect to have something that could be rendered like this:

              /          \
simple_command      simple_command    
      |              /          \
   cmd_name       cmd_name     cmd_suffix
      |              |             |
     'ls'          'cat'        cmd_word
Enter fullscreen mode Exit fullscreen mode

So the second step is to come up with a grammar. A grammar is a set of rules to be followed by the parser to generate the parse tree.

This is a simple version of the grammar of sh:

pipe_sequence    : simple_command
                 | pipe_sequence '|' simple_command
cmd_suffix       : cmd_word
                 | cmd_word cmd_word
simple_command   : cmd_name
                 | cmd_name cmd_suffix
cmd_name         : WORD                   
cmd_word         : WORD
Enter fullscreen mode Exit fullscreen mode

What this grammar tells us is that:

  • A pipe_sequence is a simple_command followed by 0 or more simple_command
  • A cmd_suffix is one or more cmd_word
  • A simple_command is a cmd_name followed or not by a cmd_suffix
  • A cmd_name or cmd_word is a WORD token

Each rule will be implemented as a function.

Let's start with cmd_word

** struct s_parser is a data type containing informations about the current parser
** (like the tokens list)
struct s_cmd_word    *cmd_word(struct s_parser *p)
    struct s_cmd_word    *w = NULL;

    ** p->tokens is a linked list containing all the tokens
    ** p->tokens->type is an enum corresponding to the token's type
    if (p->tokens->type == T_WORD)
        w = malloc(sizeof(struct s_cmd_word));
        w->data = strdup(t->data);
        p->tokens = p->tokens->next
        return (w);        
        return (NULL);
Enter fullscreen mode Exit fullscreen mode

Really simple so far. If the current token is a WORD then we return a struct s_cmd_word * containing the actual data. But if the current token is not a word (it could be an operator) we return NULL. And we should not forget to advance to the next token once we used it.

What about pipe_sequence?

struct s_pipe_sequence    *pipe_sequence(struct s_parser *p)
    struct s_pipe_sequence    *ps;
    struct s_simple_command   *sc;

    ps = malloc(sizeof(struct s_pipe_sequence));
    while (sc = simple_command(p))
        /* This function pushes a new element into a linked list */
        ft_lstpush(sc, &ps->simple_commands);
        if (p->tokens->type == T_PIPE)
            p->tokens = p->tokens->next;
    if (ps->sc)
        return (ps);
    return (NULL);
Enter fullscreen mode Exit fullscreen mode

The other functions should be as easy to write as the above one.

At the end you should have a parse tree based on the grammar above.

Of course this is a very simplified version of the actual shell grammar but really this is all it takes to be able to generate by hand a parse tree.

If you want to implement a shell you should first visit this website. It's THE place to read first to understand how a shell like sh behaves.

Thanks for reading :)

Top comments (0)