loading...

That’s not very Data Warehouse - an argument against elegant SQL code

ronsoak profile image ronsoak ・4 min read

In my role I build one off reports for clients.

Recently I had my code peer-reviewed (as we do for all code) and it got the green light.

Great! Two thumbs up... it does what it set out to do and won’t lie to the customer.

However, that peer review came with some advice. My methodology wasn’t very ‘data warehouse’ and wasn’t a methodology that gets used in advanced applications.

Now I’m not here to write a scathing (and cowardly) response to that person’s unsolicited advice. I’m here to challenge the culture where that advice came from.

See I have trained four different people SQL in the past few years and every single one of them has naturally, for some unknown reason, gravitated towards the exact same fallacy, that; the more joins, aggregations, functions, CTE’s and sub-queries they can fit into the least amount of code the better a coder they are. They believe elegant code is the best code.

This is what they meant when they critiqued my methodology, it wasn’t elegant, and I think that’s a load of BS.

I’m not sure how everyone ends up in the same mindset, but I know why elegant code is wrong.

Simply put, I don’t write my SQL scripts for me, I write them for my tester and/or for the member of my team who has to read it later (for whatever reason).

To do that I keep it simple.

I write lots of notes and break the script out into bite sized chunks attempting to only do one or two things at a time.

I do this so another person can easily understand what the script is doing, for a tester this makes it easy to follow and the test points obvious. And if another coder has been tasked with enhancing that code or wants to extract out chunks for their own purposes they can do so easily. They don’t have to reverse engineer a nuclear warhead just to understand.

Sometimes I’ve had code so elegant that a simple addition required hours of retooling.

Yes there are scenarios where adding an extra column in step 1 benefits step 7 and I have no issue with that provided its properly noted, however when you can add bits to step 1 that benefit step 3, 8, 9, and 11 this becomes too complicated for a third party to quickly understand, and if your tester struggles to understand your code, the risk of them missing something increases.

The other thing I do is reduce complexity by not doing too many joins, unions or CTE’s. Every time you do something like this you increase or decrease your row count, do too many and you lose the ability to say with 100% certainty that you are loosing or gaining rows with accuracy, especially when dealing with large data sets.

A good tester will try verify that every join is doing exactly what its set out to do, by making six joins to a source dataset means your tester has to do a lot more heavy lifting to make sure its doing what its meant to.

For all you know, a bad join has lost you a hundred thousand rows but that may not be apparent to you by doing a simple count(*) at the end, and the complexity of your code has just decreased the chances your tester will pick up on that.

That is how I code, it benefits my team and my testers and while we will always find ways to improve I’m not going to deviate from the above points just because it doesn’t look elegant or in hindsight it could have been half the length. You can always look back on code and make it better, however just because you can re-write code, doesn’t necessarily mean you should, that’s certainly not a valid point for my team, our goal is get to the results faster, safer.

Not to mention: elegant code isn’t significant.

Code should be accurate and not needlessly slow. Elegance is subjective, human subjective, fundamentally a SQL compiler doesn’t care whether it gets asked to do 12 easy statements in a row or whether you have smushed all 12 steps into an ‘elegant’ mega query. It doesn’t care.

And so that’s what I wanted to challenge. Write code that works for your teams goals, don’t strive to make your code elegant, don’t re-write it to make it smaller if it’s going to make a testers job harder, and don’t write it for you. You already know what your doing and why, but no one else can read your mind, they can only read the script and the ticket, if it doesn’t help them understand then that’s not their failure it’s yours.

Posted on by:

ronsoak profile

ronsoak

@ronsoak

Data Analysis Team Lead at Xero in Wellington NZ. Dev tag moderator and passionate about space! All views expressed here are my own.

Discussion

markdown guide
 

Code should be first and foremost readable. If it is readable then it can be fixed by someone, it can be changed by someone, it can be optimised by someone.

Unless you are working with an extremely resource limited environment, then there is no reason why your code should not be Dev first.

Great post.

 

I've worked doing SQL for back end operational systems AND data warehousing. SQL is literally the one constant in my 20 years of professional work.

Whoever said that about your code clearly missed the mark. Unless you were doing something horribly inefficient, making it readable and traceable is actual "elegance". Doubly so if you're doing analysis that might have to be demonstrated to someone else, or anything that gets reused down the road (dashboards, data integration, etc.) or has to be repeatable.

Good post and that person has the cowboy mentality that makes me think they have a fiefdom to protect via obscurity. Fail.

 

I love to read SQL articles from you. More of this please.

 

That nails one of the problems I have with SQL. It often - not always - puts performance at odds with readability.

Part of it is the need for set-based operations. The logic we want to implement applies to individual records. If the record contains this, do that. But once we transform that intent into set-based queries the original logic is often difficult or impossible to discern.

I know that this is a worst-case scenario and it doesn't always have to be this bad, but I've seen SQL where I know for a certainty that no one in a week would be able to read it and understand what it did or why. The developers often couldn't keep track while they were writing it. It was unmaintainable in the most literal sense because the next developer was 100% certain to give up and rewrite it.

It was compounded by the need to use Azure SQL Data Warehouse which deprived us of whatever tools T-SQL offered to break up code into testable units.

That one scenario led me to write this blog post. The point was that 1,000 lines of untestable and unreadable code will be bug-ridden and unmaintainable. We can say "It's SQL, that's how we write it," but the code won't somehow be nice to it because it's SQL and we had no choice. It will punish us without mercy.

 

Great first post ... welcome to Dev.to :D