How important is stats knowledge in software development?

twitter logo github logo ・1 min read

I'm not necessarily discussing the hyper-statistic-oriented tangents of our craft, like data science, but for general software development.

I find with myself I sort of take for granted that I have a pretty solid aptitude in stats and it lets me think through problems with those concepts in mind. But I'm wondering what other folks think. What's your understanding of stats, and how important is it as a skill?

twitter logo DISCUSS (21)
markdown guide
 

As everything, you don't need it per se but the more tools you have the less your problems look like nails.

That said, there is some very basic understanding of stats that everyone should know in order to not end up drawing stupid conclusions:

  • You need a good sample to make stats (A/B testing doesn't mean anything on 10 people)
  • Correlation does not imply causation, aka before attributing a cause to a problem please make your research and don't just pin it onto it because you noticed a correlation
  • Average is really not a valid aggregation function in most cases (Digital Ocean metrics I'm staring right at you)

Those are really the points I'd like everybody to know, the rest is really something you can pick up as you need.

For the record I have an Engineer's degree in France which more or less includes a Bachelor's degree in statistics.

 

Average is really not a valid aggregation function in most cases (Digital Ocean metrics I'm staring right at you)

Great point, in fact this demonstration from AutoDesk shows that several common aggregate functions can easily be tricked.

GIF
(here all of the plots have the same summary stats, despite having very different shapes)

The lesson? Always plot your data!

 

Average is rarely a good metrics when law is not normal which is very often the case. Median which is even simpler to calculate is a better approximator, it seems traders on Wallstreet because that's one of their preferred indicator ;)

If I follow you, since all distributions are normal you just need the average to understand your data? Because it seems to me that the normal function takes two parameters (average and standard deviation). With just one of those parameters you can reach the conclusion that your data is maybe centered on that average if the distribution happens to be normal.

Now suppose that instead of considering that traders on Wall Street use averages we look at the fact that they all wear navy socks. What does this tell us? That Wall Street traders wear navy socks and possibly other people too.

While your business might be dependent on knowing that it is possible that some people wear navy socks, I highly doubt that it yields any practical value.

All statistics are here to answer a specific question. If there is no question you might as well put 42 everywhere.

To give an example, you could ask yourself if your website is lagging. Then the answer would be to look at the 99th percentile of response generation time and to make sure that no requests take longer than an acceptable amount of time.

And very often, the answer is not an average. It's something specific to the business question you're asking. I don't doubt that Wall Street uses averages but the real question is to know what they do with it.

"All distributions follow normal"?!!! absolutely NOT that's why AVERAGE IS NOT USUALLY A GOOD METRICS or at least you should consider multiple samples like in a method called "Statistical Process Control" (I've been specialist of that in the past)

So Traders (I've been semi-professional trader's analyst for a bank 20 years ago ) do NOT USE AVERAGE for assessing what they call horizontal support and resistance, they use average only for mobile indicators.

It's funny people don't even read before answering ;)

 

I would think for most folks, it isn't all that important. I only have a Stats 101 level knowledge. Years I have been doing software development, I have not needed to do any real statistics. Except for when I am doing side projects with computer graphics, I hardly use math past super basic equations at all.

I'm curious how stats plays into your thought process, Ben.

 

Even forgetting software development, some basic knowledge of statistics is insanely useful as a person. There are a lot of bad things in life you can avoid by just having some basic knowledge of statistics.

In actual software development, it's useful because of how it encourages you to think more than anything else, especially when dealing with debugging. Being able to reason properly about how much impact a bug actually has, as well as deducing logical reproduction conditions from dozens of disjoint bug reports, is an insanely useful skill, and it's much easier if you approach such things from a statistics perspective.

Personally though, I've found my knowledge (however limited) of set theory and graph theory to be far more useful for most other aspects of development than my knowledge of stats, even factoring in that 'way you think' bit.

 

As a front-end dev, I wind up needing to understand charts a lot more frequently than I expected. We've got multiple monitoring tools outputting tons of data like New Relic Browser for front-end logs/metrics, as well as backend logs/metrics.

For any given bug, I need to be able to answer via metrics/dashboards:

  • has it happened before?
  • How many people is it affecting?
  • How often are they running into it?
  • whos had it the worst?

Some of the tools we use like New Relic Insights or Splunk provide convenience functions to make it easy to answer those questions, but you still need to think about what data you currently, what you want to know, and how to munge that data in order to display what you think you might want.

For what its worth, I got a D in my stats class in college which I absolutely hated - but I love using these charts/graphing tools for understanding logs.

 

As mentioned before, the ability to understand statistics is essential to understanding time world.

One addition is that our industry has removed the veil that is usually between engineers and customers. You don't necessarily need stats to create functionality, but if you want to conceptualize a problem and make progress on a solution, you need stats.

As someone who has to think about the product and how is evaluated I need to design experiments and drive metrics. I need to understand team performance, to decide if I should automate something, to figure out roi on decisions, demographics of features etc.

 

I took every stats class my college offered, right now since I do mostly web-based stuff it isn't that helpful, but my first job I was the go between the data science and web teams. I did a lot of implementation of the data scientists ideas for automating their processes. It was incredibly helpful then! I think in general it's so important to emphasize and utilize the intersections of your skills -- domain knowledge on top of programming knowledge helps you communicate and understand the needs of the users/clients better.

 

When I dabbled in some machine learning as part of my job, I quickly found that I was out of my depth.

I could handle the implementation side and could make some reasoned decisions. But, when it really came down to it... I didn't know if what I was doing was right or wrong.

I didn't know how to properly justify whether my code changes were improving or damaging the output model.

When I sat down with someone with some stats knowledge, they were able to shed a lot of light on the direction I needed to go and provided a ton of tips while they were there.

It was pretty clear there, that stats knowledge would have been extremely useful to me.

Outside of that experience, I haven't felt like I have been heavily affected by my lack of stats knowledge.

So the answer from me is. It depends.

 

As a Data Scientist and Software Developer I kind of live in the middle ground. I can say that calculus is a far more important subject for Software Development than stats, When you start to transition to more analytic jobs then you start seeing more statistics.

When people ask me for math content my first thought is always go first on calculus and then branch off from there.

 

Most people that stats are reduced to calculus, more important is probability which is related to scientific method.

 

I find with myself I sort of take for granted that I have a pretty solid aptitude in stats and it lets me think through problems with those concepts in mind.

Until you said that, I didn't realize I do the same. I think I have a lot of people I need to apologize for getting frustrated with them. My most recent transgression was explaining how to integrate A/B testing into existing architecture, and being met with blank stares.

 

Math ~ is important to software development in general. You don't have to be a math wiz, but being curious about what basic concepts will help greatly.

Drawing from the previous statement but as applied to statistics, for example, understanding the whole standard deviation concepts and margin error, will help you when dealing with bug reduction efforts or understanding which part of the software is bringing the pain to the project. If this rings bell to "applying the pareto principle to software engineering", you are welcome.

Another application of statistics in daily programming life, comes when trying to understand logs. For example, if how frequent is 500 error is happening at which time of the day may reveal that you either need to refactor your code, or need to scale. Same applies to predict if an IP address is trying to initiate some illicit activities. But the thing is you have to understand the rates at which particular events are happening in your system and draw some conclusions. That is stats my friends!

A better understanding of logarithms/exponent/summation and combinatorics help big time understanding and being able to do basic performance analysis of an algorithms in everyday development. Web or not.

The database(relational) are built around the set theory. If you did well, chances are you will have extra tools(not struggle) to understand, for example, how LEFT/RIGHT/INNER JOINS work using just Venn Diagrams. The whole relational algebra is just dedicated to this, and it is math.

The functional programming, popular in these days thanks to JavaScript frenzy, is all built around lambda calculus. The better you understand lambda calculus, the better your functional programs will become.

I will leave it here with this story of mine ~ The worst advice I received from a friend of mine who did computer science when I was just getting started was: You will not need math after you graduate. The more I progress in my career, the more I regret not taking my math classes seriously. I restarted reading anything math related after finding out 10 years later!

Do not get me wrong, you will not need to be a math wiz to be successful with a software development career, but understanding some maths principles will make your life easier, or make you a more effective software developer.

 

MVC for statistician: M=Variable under Observation, V=Projection of M on several Viewpoints, Controller=holds the transformation rules

Apart from concepts, I won't use so much stats because of the huge Heteroscedasticity (~unpredictable variance) in this field ;)

Nevertheless, as I used to be a specialist of Deming/Shewhart Statistical Process Control (see my review of one of my favorite books amazon.com/Statistical-Viewpoint-Q...), it is less the numbers by themselves that matter than the spirit: stats are not an end by themselves, what matters is to improve the capability of the system and so Deming did always advocate Visual Tools for visualizing flow to improve it and so reduce that dreaded Heteroscedasticity.

But Heteroscedasticity is not so awfull depending on the context: if you're working on an innovative project or if you're learning, it is expected and wanted otherwise it means you don't advance by leap frogging, but if you're doing a usual one and it's so chaotic, then you should worry and look for the root causes ;)

In any cases, even in the learning and/or innovative context, it's all about finding the causes and effects. For artist like Leonardo Da Vinci, that's how you can perfect a great art. He said many people do things too quickly without trying to discover the fundamental laws behind, that's the reason he himself studied Science to understand the science of water or light to be able to paint them magnificently. The research for causes and effects, Deming calls this SOPK (System of Profound Knowledge) and stats methods like SPC are just a mean to SOPK not the other way round, not for just making stats about how many defects you have detected in testing. As a matter of fact he hates this "after-the-fact" mentality that is so widespread in traditional industry, he would be horrified I guess also by software industry. Scrum is based on PDCA according to Jeff Sutherland the inventor of Scrum himself and PDCA (Plan Do CHECK Act) is supposedly from Deming, he ignores Deming hated PDCA spreaded by Toyota (he is the one who taught Toyota as first Toyota CEO says it in a tribute on their website) so much he created an other acronym PDSA (Plan Do STUDY Act) once again to assert it's about scientific process not about checking numbers. Other important concept introduced by Deming to the Japanese was Visual thinking, like for visualizing the flow, that's where Kanban originated from, it could be something visual that is different than just a board if it helps you to DESIGN and to MODEL ;)

So in software development, the mentality should be the same: to perfect one's craft, one should also look for superior principles instead of just stats or even ...Design Patterns (though they can be very usefull for OOP that's not the point). They should not be the ultimate causes for good design, according to Alan Kay, they may somehow be holding back the software industry so one should reflect upon all our current (not so good though many think they are) software practices like fascination for pure stats (measurements without acting upon them) and other habits and hypes without understanding what's really behind: are they really so good or are there something better/simpler ;)

 

Lots of "It depends". A basic understanding is generally pretty helpful. I find it useful when needing to provide metrics or analysis for proposals, postmortems, and some automation, but usually just understanding the high level ideas and being able use them as encapsulated by a library is all I need.

 

I've never done the stats, charts, and stuff, but I've used them and read them a lot, and without stats, it felt like working blind, or like I was working from scratch.

 

Just as important as the knowledge of your stats in: 'Super Mario RPG: Legend of the seven stars.'

 

I found that sentential logic, predicate logic and the like really help me in problem-solving and even debugging.

Classic DEV Post from Mar 15

What was your win this week?

Got to all your meetings on time? Started a new project? Fixed a tricky bug?

Ben Halpern profile image
A Canadian software developer who thinks he’s funny.

dev.to now has dark mode.

Go to the "misc" section of your settings and select night theme ❤️