Removing comments from code-based data source

#programming #ai #csharp #machinelearning

Photo by Markus Spiske on Unsplash

Introduction

I am in the process of experimenting with training an LLM on a codebase. My goal is to build a foundational model that I can then create different generative AIs from that are more focused on a task, say Code Review or High-Level Documentation. I needed to start from a good known source even if it might be a little small for my ultimate goal so that I know whether or not I was going in the right direction. I started by setting up the training process for a BERT-style LLM. I chose BERT because I believe it is best for trying to build understanding of its source material.

Getting to work

Under the Microsoft username, there is a dataset called LCC_csharp. I started using this as the codebase I wanted to work with is also written in C#, but I quickly found a significant issue with the codebase.

// See the LICENSE file in the project root for more information. // // using System; using System.Runtime.InteropServices; using Windows.Foundation; #pragma warning disable 436 // Redefining types from Windows.Foundation namespace Windows.UI.Xaml.Media.Media3D

Where would you put the line breaks for the code snippet above?

The structured C# files had been compressed down to a single line. This isn't so much of an issue but it becomes hard to find where a comment ends and the code itself begins. I wasn't going to sit there and figure this out by hand, so I decided to program my way out of the problem.

The Execution

My plan was to use the Roslyn Analyzer developed by one of the .NET teams at Microsoft to do static analysis on C# code. The Analyzer has a concept known as Trivia, and each section of code is a different Trivia. Once the code is read into a Syntax Tree, it becomes a structured document and different parts of the file can be queried and modified easily.

var commentTrivia = from t in tree.GetRoot().DescendantTrivia()
    where t.IsKind(SyntaxKind.SingleLineCommentTrivia) ||
          t.IsKind(SyntaxKind.MultiLineCommentTrivia) ||
          t.IsKind(SyntaxKind.SingleLineDocumentationCommentTrivia)
    select t;

The above code block finds all of the major types of comment Trivia that exist in the document. The next thing I do is remove those items from the document and normalize the whitespace so that it takes on a more natural shape. I had intended to then save the document as is, with line breaks and all, but I quickly found out that this corrupted the Parquet format that the data was originally saved in. A lot of trial and error later, I settled on removing all of the line breaks, effectively putting all of the code back on one line, but this time, without the comments, the code could be read as a big line of code rather than a document where comments and code intermingled without a clear break between them all.

using System;using System.Runtime.InteropServices;using Windows.Foundation;#pragma warning disable 436 // Redefining types from Windows.Foundationnamespace Windows.UI.Xaml.Media.Media3D

Not perfect, but a lot better than before.

Conclusion

Now the codebase can be tokenized from scratch and should contain only meaningful code. Comments are incredibly important for us humans to fully understand a piece of code, but I feel that it's more important for a foundational LLM to be able to generate good code. From there I can reliably build Q&A, code-completion and documentation LLMs that fine-tune the base weights to be better at their individual tasks. Ideally I can then merge these all together in a Mixture of Experts model that can be good at a variety of tasks, and have all been trained or at least fine-tuned on the specific codebase.

DEV Community

Removing comments from code-based data source

Introduction

Getting to work

The Execution

Conclusion

Top comments (0)

Read next

AI-Powered Code Generation: Revolutionizing Development

AI Breakthrough Turns Black and White Photos into Colorized 3D Scenes You Can Explore

Tools I use in software engineering

Types: char and boolean