Cross posted from my blog Analyze the Data not the Drivel
Microsoft's SSIS (SQL Server Integration Services)
is a ubiquitous ETL (Extract Transform and Load) tool. Despite its
widespread use, SSIS is not loved! At best, it's tolerated for its
undeniable utility, but SSIS's "utility" comes with a host of hideous
warts with the ugliest being its file handling.
You would think that an ETL tool would excel at file manipulation; how
else would you "extract" data? But, for reasons lost to posterity, SSIS
totally screws the file handling pooch. Now there are "sanctioned" ways
to unscrew the pooch. Use "script tasks" is a common retort. With script
tasks, you code your file handling bits in something like C# and then
use SSIS to compile and incorporate the C# parts into the larger ETL
process. Script tasks work but they're a pain to debug and maintain.
There are other pooch unscrewers. Many code everything in SQL stored procedures, others build GAC dlls, some use command tasks
to shell out to file utilities. Sadly, the abundance of SSIS file
handling work-a-arounds just affirms that SSIS blows hard here!
I could bitch about SSIS file handling for fortnights but, bitching does not solve problems. Luckily, there is an easy way to use SSIS's good bits without getting sucked into the script task swamp.
SSIS typically calls helper processes like script tasks, but if you
flip things around and call SSIS from helper processes you can easily
leverage the strengths of both SSIS and helpers.
The following Python snippet, taken from a recent project, illustrates
this technique.
all_cube_zips = etl.daily_zip_files(working_dirs=rtp.working_dirs,
runtime_parms=rtp.runtime_parms)
for cnt, zipfile in enumerate(all_cube_zips):
etl.clear_working_dirs(working_dirs=rtp.working_dirs,
runtime_parms=rtp.runtime_parms,
name_prefix="DailyRetails")
dailyretail_xml_files = etl.unzip_prefix_xml(name_prefix="DailyRetails",
daily_zip_file=zipfile,
working_dirs=rtp.working_dirs,
runtime_parms=rtp.runtime_parms)
tsv_files = dcx.write_dailyretail_tsvs(xml_files=dailyretail_xml_files,
daily_zip_file=os.path.basename(zipfile),
column_defaults=rcd.DailyRetails_column_defaults,
working_dirs=rtp.working_dirs,
runtime_parms=rtp.runtime_parms,
append_only=False)
os.system(r"\\Shares\DailyCode\bats\SelectReload.bat")
This Python program unzips hundreds of XML files to a working directory and then "flattens" them to a simple TAB delimited text file. This is typical ETL stuff. After writing the TAB delimited file Python runs a simple batch script to execute an SSIS package. The batch script is called with:
os.system(r"\\Shares\DailyCode\bats\SelectReload.bat")
The following script lines from SelectReload.bat
show how to set up and run a package with dtexec.exe
.
rem change to your dtexec directory
cd "c:\Program Files\Microsoft SQL Server\150\DTS\Binn"
rem execute an SSIS package - use fully qualified paths
dtexec /Project "\\Shares\DailyCode\ispacs\flat.ispac" /Package Reloads.dtsx
if %ERRORLEVEL% NEQ 0 goto Error18
Here, all the irritating file hacking is done with Python leaving
streamlined ETL jobs for SSIS. Both processes are simplified. Both are easy to test and debug. With software, it's always wise to try simple first.
As a final warning, before you adopt this approach be aware that you need full "Administrative Rights" to run dtexec.com
like this. Many anally retentive IT outfits will simply outlaw such renegade scripting. If you're incarcerated in Administrative Rights jail I'd suggest taking advantage of the current labor shortage and find another job. Squabbling over administrative rights is a demeaning waste of IT talent that can no longer be tolerated.
Top comments (0)