What I learned contributing to Rust-Analyzer

#rust #opensource #compiler #vscode

What is rust-analyzer and its goal:

Rust-analyzer is a tool to improve the developer experience when writing Rust. It's working with different code editors to help you write your code. Examples of features that rust analyzer gives you: completions, goto definition, assistance to refactor your code to be more idiomatic, helps you with traits resolution, ... Rust already has the same kind of tool, RLS, which works but after the Rust survey 2018 the Rust team identified different challenges and improving the IDE experience was part of it. The instability and inaccuracy was pointed out for RLS. So the Rust team took the decision to create a new working group called rls-2.0 to rewrite an IDE-compiler from scratch and discover the best architecture to have accuracy, performances and stability. They decided to rewrite it from scratch because after inspecting the RLS codebase, its architecture wasn't a good foundation for a perfect IDE support in long-term. Rust-analyzer is the fruit of this working group with a different architecture than RLS but it keeps the Language Server protocol.

Infos about the Language Server protocol:

The Language Server protocol (LSP) is used between a tool (the client, usually a code editor) and a language smartness provider (the server) to integrate features like auto complete, go to definition, find all references and alike into the tool. The LSP was created by Microsoft to define a common language for programming language analyzers to speak. The LSP ecosystem is now growing fast, for example recently Golang (which already have a big tooling ecosystem) decided to switch to a LSP model and rewrote it from scratch with their new tool called GoPLS. One of the main reason of adopting a LSP is to concentrate efforts on only one codebase to generate features for your code editor. In fact with this solution we only have one engine and multiple clients for different editors.

What was my contributions:

As always I try to contribute to projects I use in my daily routine to feel the impact of my contributions and be aware about the use cases and the project itself. I use Rust analyzer in VsCode everyday when I write Rust code and I had an issue with it. The issue was a kind of false positive given by rust analyzer. Let me explain it briefly. In Rust we have feature attributes to enable conditional compilation with this syntax #[cfg(feature = "myfeaturename")] and disable with #[cfg(not(feature = "myfeaturename"))] which means, if the feature "myfeaturename" is disabled then don't compile this part. Here is an example snippet of code:

struct MyStruct {
    my_val: usize,
    #[cfg(feature = "foo")]
    bar: bool,
}

impl MyStruct {
    #[cfg(feature = "foo")]
    pub(crate) fn new(my_val: usize, bar: bool) -> Self {
        Self { my_val, bar }
    }


    #[cfg(not(feature = "foo"))]
    pub(crate) fn new(my_val: usize, _bar: bool) -> Self {
        Self { my_val } // Error here to tell me the "bar" field was missing
    }
}

Here the bar field is only enabled when feature foo is enabled. Then in my second method new the feature foo won't be enabled so I can just create my struct without bar field. But rust analyzer gaves me an error. At this time I decided it could be an opportunity to contribute to the project and learn a lot of new things. At the same time I also made other pull-requests to improve auto completions and assistances. For example, I added a pre-select about the completions when it's possible:

I'm not used to writing this kind of code which parse source code, creates an abstract syntax tree and find errors on this source code. I learned a lot about the compiler's grammar and wording. Also all the internal machineries and how it works from a macro point of view. I think it worth to share this with you to help you understand how it works and the big picture of the project. DISCLAIMER: I won't explain how I fixed the issue, the goal here is to explain what I learned from the rust-analyzer codebase.

From source code to assistances:

Let's dig in the strategies used by Rust Analyzer (RA) to parse your source code and give you assistances to write Rust code. It starts from the crate entity to a single expression in your source code in a file. In RA, developers decided to not deal with filesystem, so they took the decision to work with a virtual file system for several reasons. They don't want to make any IO operations during the analysis and they don't want to play inside ra codebase with file paths mainly for performance concerns. So every source files are represented by a generated identifier and not the filepath.

If we check about the steps to parse source code, here is a simplified overview:

                                      Memoized thanks to Salsa crate
                             +----------------------------------------------+

+--------------------+       +------------------+      +--------------------+      +-----------------------------+
|                    |       |                  |      |                    |      |                             |
|                    |       |                  |      |                    |      |                             |
|                    |       |     Abstract     |      |     High Level     |      |  Computations on HIR types  |
|    Source code     +------->      Syntax      +------>    Intermediate    +------>   to give feedbacks to      |
|                    |       |       Tree       |      |   Representation   |      |      your code editor       |
|                    |       |                  |      |                    |      |                             |
|                    |       |      (AST)       |      |       (HIR)        |      |                             |
+--------------------+       +------------------+      +--------------------+      +-----------------------------+

+----------------------------------------------->
               Using Rowan crate

The Abstract Syntax Tree (AST) generated by the crate Rowan here has several specifications. The AST is loss less so it keeps all the whitespaces, comments, parenthesis, ... and don't break on bad syntax, it means all the AST will be generated except the error part. It won't stop when it find a syntax error, which seems obvious in case of code editor when editing but not a strong requirement for a basic AST used for compilation. The tree is fully untyped, in fact what distinct different kind of nodes are just the kind field which is an enum, here is an example of generated AST for 3 lines of code:

fn foo() { // here is a comment
    1 + 1
}

SOURCE_FILE@[0; 14269)
  FN_DEF@[0; 43)
    FN_KW@[0; 2) "fn"
    WHITESPACE@[2; 3) " "
    NAME@[3; 6)
      IDENT@[3; 6) "foo"
    PARAM_LIST@[6; 8)
      L_PAREN@[6; 7) "("
      R_PAREN@[7; 8) ")"
    WHITESPACE@[8; 9) " "
    BLOCK_EXPR@[9; 43)
      BLOCK@[9; 43)
        L_CURLY@[9; 10) "{"
        WHITESPACE@[10; 11) " "
        COMMENT@[11; 31) "// here is a comment"
        WHITESPACE@[31; 36) "\n    "
        BIN_EXPR@[36; 41)
          LITERAL@[36; 37)
            INT_NUMBER@[36; 37) "1"
          WHITESPACE@[37; 38) " "
          PLUS@[38; 39) "+"
          WHITESPACE@[39; 40) " "
          LITERAL@[40; 41)
            INT_NUMBER@[40; 41) "1"
        WHITESPACE@[41; 42) "\n"
        R_CURLY@[42; 43) "}"

----------------------------------------------------------
// format is like: KIND@[start_offset; end_offset) "value"

Syntax trees are, by design, void of any kind of semantic information. Syntax tree contains exactly the same amount of informations as the source text it represents. Therefore it's very hard to compute data on it and rust analyzer don't really manipulate AST data as is. Instead it "lowers" the tree to a concrete type in Rust to add more semantic information, like "what is the type of this thing?", "what is the meaning of this name", etc.

The first step is to create a simple wrapper around the raw AST node (which is loaded lazily) so it's very easy to map back this "lower" type to the original one. To achieve this first step it takes the raw AST node, check the "kind" value to know what kind of node it is and then create the right type by casting it. By the way, as this new type is just a wrapper with additional methods linked, these types are generated by codegen. The last step of this "lowering" is to transform these ast::* types into a more specific Rust type (hir::*) to have better methods (and not generated) to interact with. At this step we have now what is called a "High-Level Intermediate Representation" (HIR) which let us use Rust types and be compiler friendly. Here is an example for a function definition:

// Example of an autogenerated ast::* type (1st step of lowering)
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct FnDef {
    pub(crate) syntax: SyntaxNode,
}
impl AstNode for FnDef {
    fn can_cast(kind: SyntaxKind) -> bool { kind == FN_DEF }
    fn cast(syntax: SyntaxNode) -> Option<Self> {
        if Self::can_cast(syntax.kind()) {
            Some(Self { syntax })
        } else {
            None
        }
    }
    fn syntax(&self) -> &SyntaxNode { &self.syntax }
}

// Example of hir::* type (2nd step of lowering)
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct Function {
    pub(crate) id: FunctionId, // We will see after where this identifier comes from
}

impl Function {
    // more usable functions to interact with in the RA codebase
    pub fn params(self, db: &dyn HirDatabase) -> Vec<TypeRef> {
        db.function_data(self.id).params.clone()
    }
    // ...  
}

Caching and incremental computation with salsa:

Last but not least.

All this work could be very CPU and time consuming without optimizations. The main idea of these optimizations is to have a kind of "on-demand" compiler. In fact all these new hir::* types work on top of a kind of database. This magic sauce is a crate called "Salsa" which has been developed by some rustc team members. For each node we have a deterministic identifier which let us find back the node later thanks to the database.

The way salsa works is with a sets of queries which are just pure functions taking parameters to create and save the node. Every time in the RA codebase you need to access to a specific element, you can call these functions/queries from the "salsa database" with the element id (example for function id: struct FunctionId(salsa::InternId)). The element id is an identifier (consider a non zero unsigned integer) generated by Salsa mainly to use as inputs parameters for different functions, it's better to have simple identifier than a complex datastructure for memory usage when you don't need details about the data itself. In Salsa and rust-analyzer it's called interning but it's not so far from the Entity Component System pattern often used in game development.

Almost all higher level functions of rust analyzer take the database in parameter to be able to query it. Salsa crate gives the ability to not recompute each part of the tree at every change therefore making incremental computation. If you manage your dependency queries with Salsa in a good way, then it won't go further in the tree if the element returned is the same as before. This is a circuit breaking to not recompute all the tree each change. To be able to know if something has changed or not, Salsa keeps an internal record of revisions and with this it knows if a change occurs. Indeed, if a change occurs, it increments the revision and as Salsa knows the queries on which it depends, it is enough to iterate over and check the revision number. Here is a very light example inspired from an article written by Nikolas Matsakis to explain the behavior of salsa.

db.set_source_text("a.rs", "fn main() {}") // I set a source file a.rs
db.set_source_text("b.rs", "fn main() {}") // I set a source file b.rs
// this function is use to generate AST for the entire codebase
db.whole_program_ast() // will execute (never invoked before):
//   - db.manifest() // Read the source file previously set as input
//   - db.ast("a.rs") // A query/function to create the ast from the source file a.rs
//   - db.ast("b.rs") // A query/function to create the ast from the source file b.rs

db.set_source_text("a.rs", "fn main() { }") // here I mock a change in my a.rs file, I added a whitespace
db.whole_program_ast() // can we reuse the result?
//   - iterate over the deps list:
//     - manifest(): this has not changed (checked with a revision number)
//     - ast(a.rs):
//       - "have you changed since the last revision"?
//       - look at its own input, determine they have changed
//       - re-execute and produce a new ast
//       - compare the old and new ast -- see that they are the same
//       - "no, I have not changed"
//     - ast(b.rs)
//       - no input has changed since last revision
//       - therefore still valid
//   - the value is still value, so we just recomputed 1 step instead of 2 in this case

It was just an overview about how rust-analyzer works, without any focus on optimizations already made. You now know what's under the hood when you're writing Rust code in your code editor and receive feedbacks from RA. There is still so much to say about rust-analyzer internals, but it doesn't really fit in a single article. Maybe we will dig in more specific parts in future blog posts. It could be a wonderful journey.

References

https://ferrous-systems.com/blog/rust-analyzer-2019/

https://www.youtube.com/playlist?list=PL85XCvVPmGQho7MZkdW-wtPtuJcFpzycE

https://langserver.org/

https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md

https://docs.microsoft.com/en-ie/archive/blogs/ericlippert/persistence-facades-and-roslyns-red-green-trees

https://www.youtube.com/watch?v=_muY4HjSqVw

https://gist.github.com/nikomatsakis/5b119a71465549b61743e8739a369b5e

https://joshleeb.com/posts/rust-ecs-thoughts/

https://github.com/nikomatsakis/salsa-rfcs/blob/02e38d7b164f66a45b1797589feb7fe3ee6b564f/RFC0002-Intern-Queries.md

https://youtu.be/N6b44kMS6OM