loading...

Data Analyst? TIME TO LEARN PYTHON!!!🐍🐼

ronsoak profile image ronsoak ・5 min read

Question
Do you work with data?
If so...
You need to learn the language of the snake...
Snake
No, not that snake language!...
Python.

After all...
It is year of the Snake....
Wrong
oh...
I'm being informed that it is in fact NOT year of the snake...

Oh well the matter still resides... you should learn python.

That means....Yes this is yet another learn python article.
Spongebob

But if it wasn't good advice people wouldn't keep saying it...
Truth

The truth is that its fast becoming a staple skill here in the United Counties of Actionable Insights.

See it's not just for engineers and scientists, and this isn't another 'SQL is dead, learn this instead' it's more that data analysis is becoming more than just where clauses and group bys.

No longer are the oppressed forced to load data into tables and have to set the right timezone, no longer do the downtrodden need to rely on excel or tableau to visualize data NO LONGER DO THE WEAK NEED TO REMEMBER TO ADD A SEMICOLON AFTER EVERY STATEMENT AND A COMMA BETWEEN COLUMNS.....
speech

Okay... here's why:

BI is evolving

evolving
BI is rapidly moving away from just using data that fits neatly into relational databases.

Unstructured data is getting more common, and while some teams do a good job at smushing it into a format that is consumable for a SQL database, that's not always possible or necessary. Other solutions are required if we are to provide results back to the business quickly with minimal overhead.

BI is also moving away from just a suite of reports to a suite of products, products that are part of a CI/CD pipeline and developed using languages other than SQL and while some of them will be built in Java or C you can build a lot of things with Python.

And while it may seem like Data Science is sprinting in the opposite direction from BI, its really not going to be long until Data Science is seen as a part of BI. Expect every self respecting Data Platform to have at least one AI/ML/NN model that sits alongside the other models in the platform. At the moment AI / ML is primarily being written in Python, there's no guarantee you'll be able to use a Data Science model using SQL.

And finally, BI has mostly been the process of looking at present and historic data and use those findings to guess at what you can do in the future. Data analysis is trending more and more towards making statistically sound predictive analysis and SQL currently isn't that good at that sort of thing, it reads tables not algorithms. There's no guarantee SQL will be able to, or will be the best at being able to leverage future predictive analysis.

Big Data is getting Bigger

Bigger
Question. What do people do now when they need to analyse data outside of a data warehouse? They use spreadsheets.

However, the day of the spreadsheet is nearly over. Excel caps out between 500k and 1mil, in the world of big data, a million records could be the thirty minutes of events. Excel is not the adhoc analysis tool of the future, Python is. Crunching a couple of million rows of data in Python using Pandas is stupid easy, you can load in as much data as your RAM can take without any overheads, and if your're crunching too much data Python allows you to batch process data or randomly sample it, all with a few lines of code.
Simple

It can also help behind the scenes

BI us as much about the back end as it is the front end. You can use Python as part of your ETL process, you can automate tasks, monitor platforms or even build better capabilities.

For example Airflow is a data pipeline tool that is configured in Python, you can move data between systems using Airflow.

In my team we've used Python to read our SQL code and produce test scripts (article incoming.)

One of our scientists needed a data dump off of one of our internal systems and our platform team didn't have resource spare to get that data through our traditional channels, so they used Python to ping the API and directly import it in (don't worry it was above board).

It really is the tool of the future

Python has been described as 'the second best coding language for everything' and it really does so many things effortlessly. Setting up a local web server to host a web app is literally two lines of code using Flask, we really are in the future.
Future

It's because of the above reasons that Python should be the next thing you should learn in your data career. Its going to offer you a more flexible and feature rich way to analyse data or improve the way you work over any other tool in your existing arsenal.
Example

So how / what do I learn

Well Python can feel overwhelming to learn because it can do anything, however we'll just focus on analyzing data with Python.

You do this using Pandas and Jupyter.

Pandas

Pandas is a library you import into Python and its brings with it the functionality to hold data in virtual tables and analyse it.

Pandas Home Page

Notable things you can do in Pandas:

  • Import data out of a CSV / API / Parquet / or the clipboard (love that one)
  • Select, transform, join, group, aggregate just like SQL
  • EXPLAIN - tell pandas to look at a data set and explain it to you and it will run away and tell you all sorts of random information about your data set, mins, max's, upper quartiles etc the works!
  • Pivot data (management love pivots)
  • Graph your data (using Matplotlib)

Jupyter Notebook

Jupyter is the software you should use. It takes the form of a living document and allows you to present text and code in a chronological format.

Why is this important? Unlike SQL, Python won't show it's results unless you ask it to and traditional code environments will output Python code in a terminal. Jupyter is the best tool for learning on as you can write code and execute it in blocks and then as you learn you can grow your code in blocks while still being able to see earlier blocks.

Tutorials

So of course there are a million youtube videos and interactive code camps out there for you to pick up.

The video that best helped me was this guy, Keith Galli, maybe that was because he seems genuinely interested in showing you Pandas and not growing his brand....


Who am I?

You should read....

Posted on by:

ronsoak profile

ronsoak

@ronsoak

Data Analysis Team Lead at Xero in Wellington NZ. Dev tag moderator and passionate about space! All views expressed here are my own.

Discussion

markdown guide
 

I am already exploring this domain professionally, as I've seen the light you allude to. Python truly is powerful. If you want an example... there was a dashboard I ran monthly, until the data stopped existing, that took 2.5 days to do by hand. I eventually automated 99.7% of it with Python, reducing the run time to... 10 minutes! Talk about huge time savings.

 

That’s great news. If you haven’t already could we tempt you to write an article about that ? Sounds like an epic win!!!

 

Given that I haven't even touched SciPy yet for the most part and I've already seen huge wins from Python, let alone written here about it... I suppose I should do that, then!

It sounds like a great first article idea! We always need more writers! Tag me in it when you do !

 

So many gifs are quite distracting.

 

Exactly what I was thinking

 

As another old SQL head that has seen the Python light I fully concur with this article.

Also I don't know how long it took you to find those gifs but they are GREAT.

 

Cheers Nathan, it took a long time to find the voldemore one