Discussion on: Consider SQL when writing your next processing pipeline

View post

If you want to check exactly what data is being output by any given processing stage of a SQL pipeline, you can simply pull out those results into some relevant SELECT query.

Ok. You have a 1200-line stored procedure that returns garbage. What exactly do you do to figure out what the "relevant SELECT query" is?

Oh, I forgot: the stored procedure also modifies some tables, because why not? That means you can't easily re-run it without affecting data, which can create heisenbugs. Good luck! :D

BenBirt • Jun 28 '19

Yup, this would be a tricky situation! To be honest, my advice would be:

only run SQL that is checked in to some source control somewhere - this rules out stored procedures, at least unless you have some release process that you trust that pushes your code to a stored procedure
figuring out the relevant SELECT is a matter of (a) having the original SQL be modular enough that you can easily do so, and (b) a reasonably standard debugging procedure (look at the output, if it's wrong then look one step before the output, etc)
it's bad practice to be running any form of processing pipeline that modifies underlying schemas, at least unless the modification is eventually consistent, for exactly the reason you mention