loading...
Cover image for The assembly line, code, and Big O.

The assembly line, code, and Big O.

yelluw profile image Pablo Rivera ・3 min read

I'm currently refactoring a really messy codebase at work. It has made me think about the reasons about why it's messy. Last night, while watching an episode of Stranger Things, I realized why it was so awful.

Good code works like an assembly line

The reason this codebase is bad is simple: there is no clear step by step process of the state of the data flowing through it. There many things happening at once. The density of each line of code is very high. Global state is also used and abused. Understanding the status of the data at a given point in time is difficult. It requires detective work. Lots of it.

If we compare the old code to the refactor, the one thing that stands out is that the data clearly flows from one point to another. You can clearly understand how the input is transformed into the output. Just like how assembly lines work.

Wikipedia defines an assembly line as:

... a manufacturing process in which parts (usually interchangeable parts) are added as the semi-finished assembly moves from workstation to workstation where the parts are added in sequence until the final assembly is produced...

Using that definition we can make a connection that good code mimics an assembly line. It breaks down processes into steps (workstations) and things get added or removed in sequence until the output is produced. I don't know about you, but DRY, dependency injection, and a whole lot of other design patterns immediately came to mind.

An assembly line aims to for a constant O(1)

The other thing that annoys me about this codebase is the fact that every step is, at best, O(n). Accessing items on a hash table uses loops. Strings are built by exploding other strings. Just downright inefficient. There wasn't any thought of getting things done with the least amount of work. Good code is lazy. This one is as overworked as a fast food worker (I was one).

In contrast, the refactored code always aims for O(1). It doesn't always achieve it, but a lot of thought has been given to the data structures used. You see, the trick to performance is to always think about what data structure works best for this particular process. You also need to think in terms of the content of the data structures themselves. This messy code maintains every piece of data on global state. The refactored code keeps things local.

What does state have to do with performance? Let's use the assembly line example again. Imagine that you are building a car, and it's time to install the doors. Instead of the doors being next to you in your workstation, you have to walk all the way to the back of the factory to get them. That's not all. When you get there the doors are not sorted by color or side. You have to manually go through each door to find the one that fits. That's global state. Having one door for each side of the car in the correct color at the correct moment in the assembly process greatly improves efficiency. Your job is now simply to bolt the doors. Six bolts in total. One O(1) operation.

When I say that an assembly line aims for a constant O(1), what I mean is that each step is broken down to the point where the operations do not require a lot of processing. Each of the previous steps in the assembly line breaks down everything in smaller steps and state is always kept local.

Software engineering and manufacturing

I know there are a lot of people who are against the term software engineering. I understand why. There really isn't a by-the-books approach to it, yet. The industry is still too young for anything to be highly formalized. However, we can learn a lot from manufacturing. Studying how assembly lines work these days will provide a lot of insight. I suggest you look into it.

Posted on by:

yelluw profile

Pablo Rivera

@yelluw

I build awesome stuff at Yelluw. My blog --> pablojuan.com

Discussion

pic
Editor guide
 

I know there are a lot of people who are against the term software engineering. I understand why. There really isn't a by-the-books approach to it, yet. The industry is still too young for anything to be highly formalized.

I assure you, that's not the reason. It has nothing to do with age, and everything to do with the nature of programming itself. The "engineering", "assembly line", and "modular" (a.k.a. "LEGOLAND") premise has been proposed countless times over the past 30 years, and summarily proven wrong every single time. No criticism to you, of course, because it feels like it should work that way...but it doesn't hold water in practice, no matter how much research and management gusto we throw behind it.

This isn't engineering because programming isn't science. You need to read my article, The Cake Is A Lie, and then go get a copy of Dreaming in Code by Scott Rosenberg for the full explanation.

Don't misunderstand me, there may well be lessons we can draw from engineering and the assembly line. It doesn't nullify some of your ideas here. It's just that the analogy falls apart when you go beyond a few lessons learned.

 

Having worked at manufacturing companies, I have to agree with you on this one. The difference will become readily apparent if you have the misfortune to be directly managed by manufacturing oriented engineering management rather than a buffer layer of management "spear catchers" who know software.

In one case, a new operations manager asked why we couldn't just add a second shift of programmers to insure that we meet our project target dates. Others have come up with similar bad ideas.

On the flip side, I had to make them understand how the software development process was different. This often proved a challenge because they want to look at classic project management metrics like "lines of code" and such. Even worse, they think programming a complex web app isn't any different from them recording an Excel macro or doing a calculation and a chart in Matlab.

 

If it helps, I developed the Quantified Task Management (QTM) standard specifically to address management's need/desire for realistic programming metrics.

 

I appreciate your answer. It is refreshing to engage about this topic in an open way.

This point I was trying to convey with the post is that we could approach writing programs with the intent of keeping things simple and efficient. The comparison to the assembly line is to guide the point towards the efficiency of doing one thing fast and right over doing multiple things at once.

I don't dare think software should be built in an assembly line.

 

I can appreciate that.

Of course, whether we should do one thing at a time or multiple things at a time also depends on the nature of the program. One change may necessitate seven others, two of which have to be done concurrently to avoid breaking the software. And then you have the complexity of multiple people working on the code at the same time; one person can be adding feature A, and another adding feature B, and a third fixing bug X. They don't think they're working in the same area (or maybe they do), but they wind up slamming into each other because A needs C and B needs D (which is incompatible with C) and the fix for X blocks both C and D.

Software is messy. :P

But yes, all that aside, I agree that we should (as much as possible) focus on doing one thing right instead of eighteen things half-right.

Now, there is a time when a code base can become TOO DRY (and DRY spaghetti is still spaghetti). Just take a look at the source for libstdc++ (compare and contrast it with LLVM's libc++.)

If we try to make our code too modular, or modularize it too early, it can become harder to maintain, feature-bloat quickly, and take until doomsday to release.

Let's take, for example, my C++ testing library, Goldilocks (part of PawLIB). I wrote it with some interfacing functions, but I hardcoded the shell for it in another application that was using Goldilocks. I needed it functional quickly, so that was NOT the time to build the robust shell.

Then, last week, I knew I needed to set up another project to use Goldilocks as well. Instead of C/Ping the hardcoded shell, I moved that into PawLIB proper, and added some abstraction. I got it functioning well...which, as I think you pointed out, is priority one.

Next, I realized I wanted to simplify the Goldilocks shell commands. While I was at it, I wanted to cut down on repetition, so I moved some common argument validation code into a separate function, and wrote a function for each major shell command. It worked, yay, moving on.

My eventual goal is to actually abstract to the level of writing Blueshell, which will allow an end-developer to just specify the commands, their argument data, and their function callbacks within a Blueshell instance; the shell is automatically created, with all the features you'd expect. That is a level of abstraction a ways down the road, as it will take quite a bit of work to achieve. I'm not there yet, but GoldilocksShell has moved me a few inches closer. I just add one abstraction and feature at a time, until I achieve that eventual modular structure.

So, that's kinda what you're describing, in my mind, but there's an important lesson to be learned for other readers. Well built code isn't always modular in nature. Unlike an assembly line, we can't start with a chassis and add piece by piece to make the final car. Instead, we start with a golf cart someone put together in a hurry (the critical and unavoidable proof-of-concept), and we have to use our assembly line process to convert it into a sports car.

 
 

I think you may be confusing the issues of global state and efficiency. Whether something is "global" rarely impacts its algorithmic complexity. The structure of the data, not where it's located, is what impacts the complexity.

The comparison and use of O(n) also feels incorrect. Though efficiency and modularity may be related, I think you may be confusing the topics. You may also be using Big-O in a few places where Big-Theta, or just the time complexity function itself is desired.

Am I being picky? Yes, I am. But if you're looking "for anything to be highly formalized" (your words), this is an area that is. O(n) has a specific meaning and it should be used correctly if formalization is the goal.