Previously in the series, we saw that AI tools have limitations. We also
discussed maximizing our productivity by using them to write code - without
shooting ourselves in our legs.
In this post, we'll cover one more piece of the puzzle: how they can (or can't)
help us to refactor.
This post was heavily influenced by thoughtworks' excellent podcast episode
Refactoring with AI.
The Goal of Refactoring
If we want to boil it down, refactoring has a simple goal: to make it easier to
change the code.
People usually make this identical to improving code quality. Most of the time,
this is indeed what we want to achieve. Higher-quality code is easier to
understand, reason about, and, as a result, modify.
However, there is a case where we decrease code quality during refactoring.
Sometimes, we need to restructure the code to introduce new behavior later. For
example, consider we have a Dog
class:
class Dog {
String communicate() {
return "woof";
}
String move() {
return "run";
}
}
We realize that we want to have other types of pets, so we introduce a Pet
superclass:
abstract class Pet {
abstract String communicate();
abstract String move();
}
class Dog extends Pet {
@Override
String communicate() {
return "woof";
}
@Override
String move() {
return "run";
}
}
The code got unnecessarily complicated - for the current feature set. However,
now it's much easier to introduce cats to the codebase:
class Cat extends Pet {
@Override
String communicate() {
return "meow";
}
@Override
String move() {
return "climb";
}
}
In the rest of the article, we'll focus on code quality improvement because
that's the primary motivation for refactoring.
Measuring Code Quality
We want to quantify code quality objectively. Otherwise, it'd be hard to reason
whether refactoring improved it.
CodeScene has a metric called code health. Adam
Tornhill, the CTO and founder of the company, explains it in the following way:
The interesting thing with code quality is that there's not a single metric
that can capture a multifaceted concept like that. I mean, no matter what
metric you come up with, there's always a way around it, or there's always a
counter case. What we've been working on for the past six to seven years, is
to develop the code health metric.The idea with code health is that, instead of looking at a single metric, you
look at a bunch of metrics that complement each other. What we did was that,
we looked at 25 metrics that we know from research that they correlate with an
increased challenge in understanding the code. The code becomes harder to
understand.What we do is, basically, we take these metrics, there are 25 of them, stick
them as probes into the code, and pull them out and see what did we find.
Then, you can always, weigh these different metrics together, and you can
categorize code as being either healthy, or unhealthy. My experience is
definitely that when code is unhealthy, when it's of poor quality, it's never
a single thing. It's always a combination of factors.
Improving Code Quality With AI
Now that we can measure code quality, we can benchmark AI tools' capabilities.
Fortunately, Adam did the heavy lifting for us:
Basically, the challenge we were looking at was that, we have a tool that is
capable of identifying bad code, prioritizing it, but then of course, you need
to act on that data. You need to do something with the code, improve it. This
is a really hard task, so we thought that maybe generative AI can help us with
that. What we started out with was a data lake with more than 100,000 examples
of poor code. We also had a ground truth, because we had unit tests that
covered all these code samples.We knew that the code does, at least, what the test says it does. What we then
did was that we benchmarked a bunch of AI service as like Open AI, Google,
LLaMA from Facebook, and instructed them to refactor the code. What we found
was quite dramatic, in 30% of the cases, the AI failed to improve the
code. Its code health didn't improve, it just wrote the code in a different
way, but the biggest drop off was when it came to correctness, because in
two-thirds of the cases, the AI actually broke the tests, meaning, it's not
the refactoring, it has actually changed the behavior. I find it a little bit
depressing that in two-thirds of the cases, the AI won't be able to refactor
the code.
What does that mean to us? We can't use AI to refactor the code?
Fortunately, that's not the case. Even though we can't unquestionably trust AI
to improve our code, we can utilize it in a controlled fashion:
- Detection: Static code analyzers are very efficient but have limited capabilities. They identify bad practices as patterns - which are sometimes hard to detect. AI tools can deal with more complex cases - especially if they work with the abstract syntax tree instead of the code itself.
- Suggestions: Once we detect a problem, we can suggest solutions. Generative AI can shine in that, too.
- Localized refactoring: If we restrict the scope of the refactoring, tools have to work with a much less complex problem space. Less complexity means less room for errors - namely, breaking the tests.
The commonality in these techniques is that we have control over what is
happening in the codebase - which is crucial for today's AI tools.
Conclusion
The tools that are available today can't reliably improve code quality. Or even
ensure correctness. For these reasons, tests and manual supervision are still
essential when we use AI tools.
We should take Adam's results with a grain of salt, though. The tools he tried
weren't optimized for refactoring. We saw that code quality and correctness are
quantifiable through code health and the number of failing tests. Once AI tool
creators optimize their products toward these metrics, we can expect them to
improve - significantly. I can't wait to see it.
Top comments (0)