How to lie with statistics

#engineeringmonday #statistics

Introduction

"People can come up with statistics to prove anything, 40% of all people know that" - Homer Simpson. Statistics have been used to "sensationalize, inflate, confuse, and oversimplify" collected data and informations for a long time now. We blindly trust any sentence starting with “Studies have shown” without ever checking the study or the gathered data. I mean, why would we? People doing the studies know the subject much better than us, then why should we question their expertise? In 1954 Darrell Huff published a book How to Lie with Statistics, where he warned us about some methods researchers and reporters use to report false conclusions.

The Sample with Built-in Bias

The sampling procedure is a technique used to select subset of whole population. Today, most of the time data sets are too big to fully analyse, which is why we only use small sample that is still large enough to be representative. There are some other sampling method but the most basic one is Simple Random Sampling. In SRS each member of population has the same probability to be included in the sample we will analyse. Duff says 'River cannot rise above its source. It is equally true that the result of a sampling study is no better than the sample it is based on'. Not using the proper sampling technique will make our whole research biased and using the false premise we can prove any conclusion to be true. When choosing a sample we should always ask ourselves 'Does every name or thing in the whole population have an equal chance to be in the sample?'.

The Well-chosen Average

When the reports says the average of some data is X, we usually think about arithmetic average mean or just mean which is calculated by adding up all numbers and dividing them by the count of numbers (mean of 1, 3, 4, 6, 9, 9, 20 is 7.42). Other common types of averages are median and mode. Median is the value separating higher half from the lower half (mean of 1, 3, 4, 6, 9, 9, 20 is 6). Mode is probably the least used and it represents the most common value in dataset (mode of 1, 3, 4, 6, 9, 9, 20 is 9). Problem arises when we are not told which type of average has been used in the research. For example, if we are told average salary in some company is $40,000, we can only speculate what that actually means. Maybe there are 9 employees with $10,000 salary and one with $310,000. Using the mean average in this case would give us $40,000 but we can see why this average is not good in this situation and why median or mode averages are better suited.

The Little Figures That Are Not There

In his book Duff warns us about research which use statistically inadequate sized sample. For an example, he uses a toothpaste brand whose users report 23% fewer cavities. Upon further reading we could discover that sample consisted of 12 users which is far too few. Using such a small sample size, sooner or later a test group will by the operation of chance show a big improvement worthy of a headline says Duff. "The importance of using a small group is this: With a large group any difference produced by chance is likely to be a small one and unworthy of big type. A two-percent-improvement claim is not going to sell much tooth-paste". Same effect can be observed by tossing a coin 10 times. We all know tossing a coin has fifty-fifty chance of coming up heads or tails, but maybe coin will, by chance, come up heads eight out of ten times in our experiments. That might mean our coin is unbalanced, but most likely it means we weren’t patient enough. In other words, if we toss a coin let's say thousand times, we are almost guaranteed to come out with a result very close to half heads. We can conclude that "Only when there is a substantial number of trials involved is the law of averages a useful description or prediction." - Duff.

There are many more ways we can lie with statistics but these are most common ones and I hope you will remember them. If you want to learn more about this topic I highly recommend the book from the introduction of this article.

DEV Community

How to lie with statistics

Introduction

The Sample with Built-in Bias

The Well-chosen Average

The Little Figures That Are Not There

Top comments (0)

Read next

The Top 10 Places to Live in Hyderabad: Affordable, Safe, and Convenient

Welcome Everyone

Microservice Security with CSRF Tokens and JWTs

TypeScript Tip #3: Folder-wise config