Premature and overdue modularisation problems
When you begin building your project, there may be a strong temptation to modularise it right from the start in order to save costs down the line. However, I believe that premature modularisation not only drains your resources but can also hinder the long-term success of your project. At the early stages, your product vision may not be fully formed, and it can undergo significant changes. The module boundaries that you establish initially can quickly become outdated as the project evolves through numerous iterations and achieves success.
However, as your team grows and the number of features starts to accumulate, the tech debt of modularisation can send a chill down your spine. You begin to notice an increase in git conflicts and heisenbugs caused by a lack of clear separation of concerns within your application. Eventually, you reach a point where you declare "enough is enough" and decide to modularise your monolith. But the question remains: where do you begin unraveling this tangled mess? You don't want to stop developers from pumping out new features for your project, so you need to pinpoint the most impactful areas that can be extracted with minimal effort. But how can you achieve this without spending a lot of time delving into the void your codebase had become?
Introducing Lobzik: The Modularisation Toolkit
Having been tasked with modularising the codebase of my work project, with a whopping 200kloc monolith, I embarked on a quest to find a way to reason about modularising this chonky boy. Being an enthusiast of graphs, I was interested in the network of dependencies within the monolith. Soon enough, I discovered that this network could serve as a good place for for community detection algorithms, which could reveal structures that looked like modules. After weeks of experimentation, I successfully devised a way to extract the dependency graph, carefully selected the most suitable community detection methods, and came up with the tricks to yield optimal results.
These insightful findings led to the birth of my pet-project: the Lobzik Gradle Plugin. Rather than relying on complex GUI graph toolkits like Gephi or spinning up Jupyter Notebooks filled with NetworkX Python code, you can effortlessly integrate my tool into your build pipeline. Lobzik provides guidance, pointing you towards the optimal path for modularising your project. However, this tool needs some knowledge to operate, so let this article serve as your guide to use this tool correctly.
Applying Lobzik to the ProtonMail Android App
For the reference project, I've chosen the ProtonMail Android App, which is one of the largest open-source Android apps that has not been modularised yet. With over 50kloc in the main module, it truly represents a monolith that is worth modularising.
cloc app/src/main --include-lang=Kotlin,Java
github.com/AlDanial/cloc v 1.96 T=0.39 s (2388.4 files/s, 249735.1 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Kotlin 766 8867 16675 52105
Java 156 2371 3381 13006
-------------------------------------------------------------------------------
SUM: 922 11238 20056 65111
-------------------------------------------------------------------------------
Setting up Lobzik
To start using Lobzik, we need to apply the xyz.mishkun.lobzik
plugin in the root build.gradle.kts
file:
plugins {
// ...
id("xyz.mishkun.lobzik") version "0.6.0"
}
Then, we can set up the basic configuration in the same build.gradle.kts
file as shown below:
lobzik {
monolithModule.set(":app")
packagePrefix.set("ch.protonmail.android")
variantName.set("betaDebug")
}
Here, we set the name of our monolith module (notice the ":" in the module name!), the name of the variant we will be analyzing, and the package prefix of our classes. With this configuration, only the code in packages starting with ch.protonmail.android
inside the :app
module will be checked, using the betaDebug
variant. This is crucial for our tool to work, because we don't want to deal with all of the library dependencies and standard kotlin library messing our dependency graph.
Running Lobzik for the first time
Now that we are all set up, we can run Lobzik for the first time using the command:
./gradlew lobzikReport
If everything was set up correctly, you will find build/reports/lobzik/analysis/report.html
file in your project root. Now let's take a closer look at how to interpret this report.
Interpreting the Lobzik Report
Lobzik report consists of four sections:
- Core Candidates
- Monolith Modules Table
- Module Graphs
- Whole Graph
Monolith Modules Table
The first thing that catches our eye is the Monolith Modules Table. It lists all of the modules detected by Lobzik. They can be sorted by several metrics: coductance, cut and monolithCut.
The conductance score is the core metric of this part of the report, as it indicates the benefit-to-effort ratio of extracting modules. A lower score is preferable, with a score of 0 indicating that extracting the module requires virtually no effort since it has no dependencies on other modules.
The cut and monolithCut scores show us how many dependencies should be broken to successfully extract the module. It helps to refine the estimates on how much effort we need to extract this module.
The names of the modules are automatically generated from their classes using the TF-IDF method. Clicking on a module name will take us to the detailed report in the Module Graphs section.
Module Graphs
This section contains per-module detailed reports, each presenting three subsections:
- Dependency graph of the module and its neighbourhood
- List of all of classes belonging to this module
- List of dependencies that need to be broken to extract this module
Whole Graph
This section at the bottom of the report represents the module dependency graph, which can help identify modules that are relatively easier to extract due to their fewer dependencies on the rest of the project.
The "Star" problem
A careful reader may notice that I have omitted the first section of the report, called Core Candidates. This section is collapsed under a spoiler, but it plays a crucial role in enhancing the report's effectiveness. To fully comprehend its value, let's explore what I refer to as the "star" problem.
Let's consider a scenario where we have a class called ListUtil.kt
that contains various list utilities. This class is heavily used throughout our codebase, resulting in numerous connections to other nodes in the network. Due to the high degree of connections, our community detection algorithm of choice, the Louvain method, may mistakenly identify this class as the core of a large community. It's important to note that community detection algorithms were initially designed for social networks, where such hubs represent a significant community led by an outstanding individual.
However, for a codebase modularisation problem this class should be extracted to the core modules. By doing so, we can reveal a better modularisation path for the rest of the code, as depicted in the image above. To assist in visualizing the benefits of extracting such classes, Lobzik offers the ignoredClasses
configuration parameter which accepts a list of regexes of class names that should be excluded from the analysis.
lobzik {
// ...
ignoredClasses.addAll("^ListUtils$")
}
But how we identify such classes, you may ask? These classes are commonly found in sections responsible for Dependency Injection (DI) and Navigation, as they serve as the glue that connects otherwise loosely coupled features code. It is a good choice to ignore your Application class and well-known utility classes too. But can we automatically identify more core classes if we have already eliminated the obvious ones? This is where the Core Candidates section of the report becomes valuable.
Core Candidates
The Core Candidates section presents a table that consists of the top 95 percentile of classes based on Degree or Authority metrics. Thoroughly reviewing this list can help identify the classes that should be excluded from the report. In the case of ProtonMail, the following classes might be considered for elimination:
lobzik {
// ...
ignoredClasses.addAll(
".*UserManager$",
".*Constants$",
".*ProtonMailApiManager$",
".*Util.*",
".*ProtonMailApplication$",
".*ResponseBody$",
"Base.*",
".*Module",
"^Message$",
"^User$",
"^ProtonMailApi$"
)
}
By eliminating these classes from the report, we can improve our algorithm's performance, measured by the modularity score, going from 0.586 to 0.684. A great improvement! Now we can use Lobzik report to start extracting each of detected 24 modules one by one.
Conclusion
You can find a fork of ProtonMail client with integrated Lobzik on my github. I hope you will enjoy using Lobzik for modularising you codebase. I encourage you to try it and don't hesitate to submit any issues to the project's github
Top comments (0)