I currently work for a great company called Xero doing all sorts of fun data things. When you're reading my articles I need you to understand that my words are my own, I'm not speaking on behalf of my employer and If i'm talking about something negative in the field that may not be indicative of Xero, I've worked many interesting roles and I read a lot about my field.
There are a boat load of articles across the blog-o-sphere outlining all the amazing reasons you should switch to a career as an analyst, however as someone who has been an applications analysts, an information analyst, an operations analyst, a data analyst and now a Data Analyst Team Lead I feel obliged to outline the stuff those articles seem to consistently gloss over.
Now the goal isn’t to scare you away. Due to the lack of formal education around data, a vast majority of those working in data have come from another profession which makes it a very welcoming field for those making the switch, as we’ve all been there. But it’s important you make the switch fully informed.
Data and databases in this format have been around nearly half a century, the structure query language first appeared in 1974! However this hasn’t brought with it 50 years of advancement per se. The field has moved along, some key people have worked very hard to get it here, however when compared to other fields in tech we appear quite immature.
Compare us to software development which in less time has found a way to build entire coding frameworks to enforce standard ways of coding, and end to end CICD flow with automated testing even.
Data on the other hand....
The actual SQL language differs depending on the databases and you have, and you need to remember those differences yourself and adapt, no one has made an universal sql interpreter yet.
Then, depending on the version of the database some functionality may be missing and not all databases get an upgrade. I remember one role where I was responsible for two TSQL 2003 DB’s and one 2009.... in 2017. Sometimes I would come across the perfect solution on Stack-overflow only to find that that code didn’t work on those older versions.
Best practices also differ depending on the underlying technology. A common example is the short cut practice known as a CTE. In databases where the memory is centralised CTE’s can be very quick to execute and while not considered best practice when making production code its very commonly used in day to day adhoc queries. For databases that run on distributed processing which have multiple memory banks (ie the job is shared) CTE’s run horribly and should be avoided at all costs.
The lack of coding frameworks isn’t for a lack of trying, the key difference between the data field and software development field is that the data is completely different in every single company. Which makes coding highly context dependant which is why we don’t have as much opportunity to standardise the code going in or to automate what's coming out.
This problem of highly contextual data goes beyond just the SQL language itself, some analysis is done in SAP, Python, SPSS, and R. The problem remains the same in each of those methodologies.
We’re also a volatile field. No one has really settled on a best practice for very long and you’ll see a lot of migrations in your time as companies ditch now obsolete hardware or methodologies for the new shiny only to do the same again within 2-5 years. I’ve witnessed six migrations in seven years.
We’ll get there eventually but until we do, data analysis will continue to be very manual, inconsistent, and messy. Which just slows everything down, sharing code is hard to do as everyone codes differently, the data bases are different, and the underlying data is contextual. Code that works on one platform, won’t work on another, Testing someone's code is often down to ‘best efforts’ at the time. It's not an exact science and as other fields get better at what they do it leaves the us looking a bit immature by comparison.
In the same vein as the above point your technical ability will be limited by what systems you use and what's being asked of you. For many a career as an analyst is a responsive thing, you’ll provide what you're being tasked to do.
99% of what my customers need are in our SQL database, which means my skills in other languages are not being challenged. I know python, dabbled in google analytics, and I'm familiar with doing analysis on JSON but I'm forever loosing my edge as all I need to do my current role is just SQL.
In some roles this can limit your ability to grow out of your role and to the next, and we don’t all have the luxury of dedicating time outside of work, you can potentially find yourself stunted in a role and unable to apply for other roles as you don’t meet their requirements but your current role can’t grow you in that direction.
Your customer has an influence on this too. If all that's been requested of you is ‘simple’ stuff you again won’t have an opportunity to grow. A rising tide will lift all boats and it may be necessary to educate your customer base (or wherever your workload comes from) in order for you to be stretched mentally in the role.
Some people may end up in what we call ‘report farms’ roles where you just pump out reports for a demanding user base. These can be some of the most un-rewarding roles and sometimes the least technical ie. just spreadsheets. I would of course caution you to try and avoid these roles but they are hard to distinguish up front.
This isn't a job where you hide in a back room alone for eight hours a day chugging coffee. Nine times out of ten your job is to give other people reports. You will need to talk to those people and often before, during, and after you’ve delivered a report.
Unless your customer has been an analyst in their previous life they will have no context around how long something should take or whether its feasible or not. They will almost always assume its easy and will want it by tomorrow. They will also not be very data literate and I found expectations will most likely be in the polar opposite place from where you need them to be.
They’ll be overly concerned about accuracy when they don’t need to be and will want you to ‘wing it’ and ‘just find a pattern’ in the most nebulous of data sets. They will never really 100% understand what they are asking for and you’ll quickly learn the difference between deliberate scope creep and scope creep as a result of the customer starting to understand what they really need at the later stages of the analysis.
They will be rude and demanding at times, uninterested in answering your questions, “just do it they’ll” say and sometimes they just wont’ understand the problem. Your job, every step of the way, is to be nice and hold their hand. You’ll quickly discover that the only way to navigate through those scenarios is to proactively manage your customer and it’s very hard work but believe me, leaving your customer in the dark and not managing them will cause more work in the long run and a deteriorating stakeholder relationship.
I nearly always hire staff with good stakeholder management skill over those with the technical ability. I can teach you SQL easier than I can teach you to be a customer service super star.
This job is actually nigh impossible to prepare for if you're not already in the field. All the code academy and medium articles will have you believe that learning SQL and how to make a graph is 90% of the job. They'll have you do some left joins and a group by and you're ready for your first day.
Sadly that's not the case.
Knowing SQL is 20% of the job. It's like how knowing how to drive is only a small part of being a taxi / Uber driver. The SQL teaches you how to navigate the companies data. Learning the data, much like a taxi driver learning the city is most of the job. When someone says they want sales figures for a certain region but for only for a customer that has done X activity. You need to know where all that data is housed, and how it functions and how to join it together and if you don’t know, your going to have to figure it out.
The job is also a largely complex problem solving. Figuring out how to smush together, and in what order, multiple data sets to get your customers result turns you into a mini inventor. There is a high likelihood that sometimes you will need to match up a combination of data that no one ever in history has ever done before. There won’t be an answers page for you to turn to at the end and again no online course get truly teach you that, as the complexity of the problem solving needed will be dictated by your companies unique data.
You also can’t be prepared for big data. Nothing online can prepare you for joining a billion rows of data with another hundred million. Much of being an analyst in a big data space (not all roles are big data) is doing some light data engineering. Knowing how to move such large data sets around in the safest, quickest, most efficient manner is nothing i’ve seen an online data course teach.
In 2020 it’s actually uncommon for companies to be sitting on high quality data. Undocumented and inconsistent data is common place across the field and made infinitely worse by ‘big data’. By the time most companies realized they needed to understand the fidelity of their own data they where already years of data down the line and to go back and retrospectively fix that data would take a lot of work.
You’ve also got the reality that these data sets weren’t built to be joined or analysed, they were built for an entirely different purpose and then your data warehouse has just made a copy. There are scenarios where joining two datasets together is impossible, or requires an enormous amount of manual work to get it done something your customer won’t think about when they make the request.
You'll forever find issues with data, and the bigger it is, the more issues are hiding from you. You’ll find legacy data that no longer done that way, entire days or months of missing data because there was a known or unknown issue at that time. Some data will only be current state and is missing history, while some historic data will be difficult to accurately parse, columns won’t always be named in a descriptive or consistent manner and tables will be named as good as the passing fancy on the day. And don’t get me started on dates. I'm lucky enough to live in a data environment mostly in UTC but I've worked in environments where data was recorded in multiple timezones and that's not easy to work with.
If you're lucky, those datasets will be documented right down to the column, however most companies are deathly allergic to documentation and what you’ll most likely find is a bit of documentation (most likely created by a frustrated analyst) and the rest of information you need will be stored in various peoples heads. In one scenario I had to do reporting on an abandoned feature that wasn’t documented and the only person left in the company who knew how it worked was one of the founders so I had to wrestle my way into their diary ask them to cast their memory back 9 years.
As data and companies grow so will the spread of data. In my early days it was common to run reports directly off the production environments, now you risk slowing down the product and causing customer complaints if you do so.
So in most enterprise scenarios that data is streamed or replicated elsewhere into a data warehouse. You now have a primary (source) and secondary (replicated) data set scenario, this can cause issues..
If the source data is changed after it is copied some environments don’t capture that and every day that passes between your two out-of-sync data sets the more inaccurate your secondary gets. Which can be an issue if you're trying to do accurate financial reporting or trying to gauge the impact to customers after an issue.
Some primarily datasets don’t retain history, often overwriting themselves in the interest of speed and saving space, this again can cause issues between sets as both set will no longer be inline with each other.
Then within your own data warehouse you may have a different primary and secondary relationship.
Models which are best described as short cuts are a good example of this.
Say the most commonly requested customer information is actually separated across 12 different tables. Rather than join those 12 tables together over and over again one might create a model that is a single table holding those values in one convenient place. This is a very simplified example of a model but now within your own data warehouse you have primary and secondary data sets of the same data. In simple models this shouldn't cause issue however if data was transformed for better reading or to align with business logic you may find scenarios where you need to make the decision as to whether the model and how it was built suits your needs and you may have to parse modelled and un-modelled data together, again neither one was built to work together and no one could have predicted your exact scenario and an easy solution isn’t promised to you.
You are always going to be tackling the issues that come from having the same data split out into different buckets and it's on you to get familiar with their strengths and pitfalls so to understand what ones to use.
Completely stepping around the fact that most data isn’t documented properly or of a good enough quality, what you’ll struggle with next is what exact documentation is needed and how many different types you'll need to juggle and discover.
Metadata is the descriptive information about the smallest data point, ie what data goes into this column, and if its secondary data what transformations have been applied to it. Aka why does this column exist and what goes into it.
Business Logic is the application of the companies rules applied over the top of the data. So say your a company with a couple of million customers. However some of those customers are test accounts, free accounts, press accounts and so internally the company will have an understanding of what types of customers they should always exclude when they are referring to ‘customers’. This should normally be held across every department so that different departments can be on the same page. This of course will result in some common SQL code back at your end when your customer goes ‘only valid customers please.’ While business logic should be company wide it more often than not ends up being department specific and so marketing will have their own logic you'll have to remember and then so will finance. The bigger the company is, the harder it is to get unified ‘business logic’, sometimes the right business logic can’t be done in the data or takes 9 times a long, something else you’ll bear the responsibility to solve.
Operational Logic is more how the data changes when its interacted with. Sometimes a single action in your product will change 9 different things in 7 tables, you won’t catch or understand that with meta data or business logic alone and its often hard baked into the code of your application. Say a customer misses a payment and your terms and conditions say that after 12 days of non payment the account is deleted. That entire process will change data in all sorts of places at different times over those 12 days and those interactions are hard to figure out by watching the data as those movements will be hidden within the noise. Most of the time you will need to be told by whomever created the operational logic what is happening at what times.
Best practice is similar to business logic but really its more your own internal logic that you and other analysts discover over time or to make up for poor data quality. Like excluding data from a certain ID as that's a testing ID or excluding a certain threshold of data because the data is less realiable past that threshold. Over time you’ll build up a lot of best practice.
At all times of doing your job you need to rely on the above, which may be missing in most cases and so overtime you’ll build up your own understanding but you’ll still be missing most of it.
Most of an analysts job is navigating through the uncertainty of the data caused by the fact that you lack all of the above.
There's a lot that can go wrong and it can from time to time go wrong all at once. There will always be ‘that’ report an analyst can remember. The one that kept going wrong.
This can be extremely frustrating and I've seen it push some people out of the field. It can feel like the cards are stacked against you, if the data's bad it’s your job to work around that, if the data’s missing it’s your to explore every avenue, if no one knows what to do or where to look it’s your job to figure it out. If the customer isn't data literate it’s your job to up-skill them, if your customer comes to you with a short deadline it's your job to sprint to the finish line, if the database is slow it’s your job to make sure that doesn’t impact the customer’s delivery date, if they think of something last minute it’s your job to just add that ‘one more thing’. If there is a problem or issue, nine times out of ten you will be the one bearing the brunt of the issue. It is sadly a role where the power dynamic is uneven by default. You can’t account for everything because the spectrum of issues is very very broad and you're too busy doing reports to fix them up.
I’m still in data after all these years and it’s a career I like, regardless of the above, in fact what I like is tackling the above issues head on but that's not everyone's cup of tea. People enjoy different things and I hope that if you do make the switch you quickly find the thing in data that makes you happy. But if you are considering a career in data, keep the above in mind, it’s not all rainbows and high paying data science roles the medium articles will have you believe!
Reminder: All views expressed here are my own and do not represent my employer, Xero.