Fine-tuning an old data-processing pipeline

#devops #rambling #datascience

For work, I've built a data processing pipeline that generates prediction models from data. There's a few steps in there, namely a few optimization routines that we put the data through in order to obtain those prediction models, but the processing itself is still pretty manually triggered. I have a script for each of the steps, and I have shell scripts to batch-trigger each of the steps multiple times for different datasets, but I have yet to link them all together via one function call.

one reason this one-click-to-run-entire-pipeline structure hasn't been built yet is because we have historically needed to iterate within one of the steps and not across all steps. this is because we don't frequently obtain new datasets, so it was often the case that step 1 would be completed and we'd iterate in the later steps. only when we get new data would we go back to step 1.

one issue with not having a structure to run the entire pipeline is that it becomes mentally taxing to remember which datasets and what parameters we used to fit the current set of models. for each step, we do record this information in a .json, but because this information is spread out in multiple places, it becomes hard to know for sure all the parameters involved for a given set of models.

Is this really a priority though? Perhaps it is, now that we don't have much else to do. What would be especially impressive is to integrate it with Gitlab. I believe the trigger for that could be a new pull request that gets merged into the develop or main branches. then, it'd process everything from step 1 to step n and make all the plots that we usually use to analyze the models' performance. That would be really sweet.

Perhaps I can start with learning how to trigger a script action upon a push commit

DEV Community

Fine-tuning an old data-processing pipeline

Top comments (0)

Read next

Python Database Connectivity and SQL Basics for EDA 🐍📊

Why is coding so ridiculously overcomplicated?

Data Science & DevTools: Visual Studio Code

A better way to do environment variables