Oh well the matter still resides... you should learn python.
The truth is that its fast becoming a staple skill here in the United Counties of Actionable Insights.
See it's not just for engineers and scientists, and this isn't another 'SQL is dead, learn this instead' it's more that data analysis is becoming more than just where clauses and group bys.
No longer are the oppressed forced to load data into tables and have to set the right timezone, no longer do the downtrodden need to rely on excel or tableau to visualize data NO LONGER DO THE WEAK NEED TO REMEMBER TO ADD A SEMICOLON AFTER EVERY STATEMENT AND A COMMA BETWEEN COLUMNS.....
Okay... here's why:
Unstructured data is getting more common, and while some teams do a good job at smushing it into a format that is consumable for a SQL database, that's not always possible or necessary. Other solutions are required if we are to provide results back to the business quickly with minimal overhead.
BI is also moving away from just a suite of reports to a suite of products, products that are part of a CI/CD pipeline and developed using languages other than SQL and while some of them will be built in Java or C you can build a lot of things with Python.
And while it may seem like Data Science is sprinting in the opposite direction from BI, its really not going to be long until Data Science is seen as a part of BI. Expect every self respecting Data Platform to have at least one AI/ML/NN model that sits alongside the other models in the platform. At the moment AI / ML is primarily being written in Python, there's no guarantee you'll be able to use a Data Science model using SQL.
And finally, BI has mostly been the process of looking at present and historic data and use those findings to guess at what you can do in the future. Data analysis is trending more and more towards making statistically sound predictive analysis and SQL currently isn't that good at that sort of thing, it reads tables not algorithms. There's no guarantee SQL will be able to, or will be the best at being able to leverage future predictive analysis.
However, the day of the spreadsheet is nearly over. Excel caps out between 500k and 1mil, in the world of big data, a million records could be the thirty minutes of events. Excel is not the adhoc analysis tool of the future, Python is. Crunching a couple of million rows of data in Python using Pandas is stupid easy, you can load in as much data as your RAM can take without any overheads, and if your're crunching too much data Python allows you to batch process data or randomly sample it, all with a few lines of code.
BI us as much about the back end as it is the front end. You can use Python as part of your ETL process, you can automate tasks, monitor platforms or even build better capabilities.
For example Airflow is a data pipeline tool that is configured in Python, you can move data between systems using Airflow.
In my team we've used Python to read our SQL code and produce test scripts (article incoming.)
One of our scientists needed a data dump off of one of our internal systems and our platform team didn't have resource spare to get that data through our traditional channels, so they used Python to ping the API and directly import it in (don't worry it was above board).
Python has been described as 'the second best coding language for everything' and it really does so many things effortlessly. Setting up a local web server to host a web app is literally two lines of code using Flask, we really are in the future.
It's because of the above reasons that Python should be the next thing you should learn in your data career. Its going to offer you a more flexible and feature rich way to analyse data or improve the way you work over any other tool in your existing arsenal.
Well Python can feel overwhelming to learn because it can do anything, however we'll just focus on analyzing data with Python.
You do this using Pandas and Jupyter.
Pandas is a library you import into Python and its brings with it the functionality to hold data in virtual tables and analyse it.
Notable things you can do in Pandas:
- Import data out of a CSV / API / Parquet / or the clipboard (love that one)
- Select, transform, join, group, aggregate just like SQL
- EXPLAIN - tell pandas to look at a data set and explain it to you and it will run away and tell you all sorts of random information about your data set, mins, max's, upper quartiles etc the works!
- Pivot data (management love pivots)
- Graph your data (using Matplotlib)
Jupyter is the software you should use. It takes the form of a living document and allows you to present text and code in a chronological format.
Why is this important? Unlike SQL, Python won't show it's results unless you ask it to and traditional code environments will output Python code in a terminal. Jupyter is the best tool for learning on as you can write code and execute it in blocks and then as you learn you can grow your code in blocks while still being able to see earlier blocks.
So of course there are a million youtube videos and interactive code camps out there for you to pick up.
The video that best helped me was this guy, Keith Galli, maybe that was because he seems genuinely interested in showing you Pandas and not growing his brand....