yaswanthteja

Posted on Aug 25, 2022

What is Distribution in Statistics?

#statistics #tutorial #python #datascience

What is Distribution?

The distribution of a statistical dataset is the spread of the data which shows all possible values or intervals of the data and how they occur.

A distribution is simply a collection of data or scores on a variable. Usually, these scores are arranged in order from ascending to descending and then they can be presented graphically.

The distribution provides a parameterized mathematical function which will calculate the probability of any individual observation from the sample space.

What is Data?

Data is a collection of information (numbers, words, measurements, observations) about facts, figures and statistics collected together for analysis.

Common Data Types
Before we jump on to the explanation of distributions, let’s see what kind of data can we encounter. The data can be discrete or continuous.

Discrete Data, as the name suggests, can take only specified values. For example, when you roll a die, the possible outcomes are 1, 2, 3, 4, 5 or 6 and not 1.5 or 2.45.

Continuous Data can take any value within a given range. The range may be finite or infinite. For example, A girl’s weight or height, the length of the road. The weight of a girl can be any value from 54 kgs, or 54.5 kgs, or 54.5436kgs.

Example: Distribution of Categorical Data (True/False, Yes/No): It shows the number (or) percentage of individuals in each group.

How to Visualize Categorical Data: Bar Plot, Pie Chart and Pareto Chart.

Distribution of Numerical Data (Height, Weight and Salary): Firstly, it is sorted from ascending to descending order and grouped based on similarity. It is represented in graphs and charts to examine the amount of variance in the data.

How to Visualize Numerical Data: Histogram, Line Plot and Scatter Plot.

Variables
A varable is a property that can take on any value
eg: height ={151,142,156,------}

Two types of variables

*1.Quantitative variable *(measured numerically)
Ordinal – Position in Race and Date Interval – Temperature in Celsius, Year of Birth Ratio – Height, Age, Weight

2.Qualitative/Categorical variables
Nominal – Brand-name, Zip-code and GenderOrdinal – Grades, Star Reviews

Why are distributions important?
Sampling distributions are important for statistics because we need to collect the sample and estimate the parameters of the population distribution. Hence distribution is necessary to make inferences about the overall population.

For example, The most common measures of how sample differs from each other is the standard deviation and standard error of the mean.

Difference between Frequency and Probability Distribution

Frequency Distribution:

The number of times each numerical value occurs.
It records how often an event occurs. It is based on actual observations.

Probability Distribution
-List of Probabilities associated with each of its possible numerical values.
-It records the likelihood that an event is to occur. It is based on theoretical assumption of what should happen

Types of Distributions

Bernoulli Distribution
Uniform Distribution
Binomial Distribution
Normal Distribution
Poisson Distribution
Exponential Distribution

Bernoulli Distribution

A special case of binomial distribution. It is the discrete probability distribution and has exactly only two possible outcomes – 1(Success) and 0(Failure) and a single trial.

Example: In Cricket: Toss a Coin leads to win or lose the toss. There is no intermediate result. The occurrence of a head denotes success, and the occurrence of a tail denotes failure.

The probability of success (1) is 0.4 and failure(0) is 0.6

A famous example is the coin flip, in which we could call either side a success. The probability of success is 0.5. This would lead to the following graph:

from scipy.stats import bernoulli

# probability of flipping a coin 1 time
p = 0.5
bernouilli_variable = bernoulli(p)

fig, ax = plt.subplots(1, 1)
x = [0, 1]
ax.vlines(x, 0, bernouilli_variable.pmf(x), label='probability')
ax.legend(loc='best', frameon=False)
plt.show()

But the 50/50 is not a part of the Bernoulli distributions. Another example of the Bernoulli distribution is the probability of throwing a dart in the bull’s eye. It’s either in there, or it isn’t, so this makes it a 2-outcome situation. For a bad darts player, the probability of success could be 0.1, giving the following distribution:

Here, the occurrence of a head denotes success, and the occurrence of a tail denotes failure.
Probability of getting a head = 0.5 = Probability of getting a tail since there are only two possible outcomes.

The probability mass function is given by: px(1-p)1-x where x € (0, 1).
It can also be written as

The probabilities of success and failure need not be equally likely, like the result of a fight between me and Undertaker. He is pretty much certain to win. So in this case probability of my success is 0.15 while my failure is 0.85

Here, the probability of success(p) is not same as the probability of failure. So, the chart below shows the Bernoulli Distribution of our fight.

Here, the probability of success = 0.15 and probability of failure = 0.85. The expected value is exactly what it sounds. If I punch you, I may expect you to punch me back. Basically expected value of any distribution is the mean of the distribution. The expected value of a random variable X from a Bernoulli distribution is found as follows:

E(X) = 1*p + 0*(1-p) = p

The variance of a random variable from a bernoulli distribution is:

V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)

There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure and Winning (success) or losing (failure) the game.

Normal Distribution

It is otherwise known as Gaussian Distribution and Symmetric Distribution. It is a type of continuous probability distribution which is symmetric to the mean. The majority of the observations cluster around the central peak point.

It is a bell-shaped curve.

Examples: Performance appraisal, Height, BP, measurement error and IQ scores follow a normal distribution.

Mean = Median = Mode

The standard normal distribution is a normal distribution with µ = 0 and б = 1.

Basic Properties:

The normal distribution always run between –α and +α
Zero skewness and distribution is symmetrical about the mean.
Zero kurtosis
68% of the values are within 1 SD of the mean
95% of the values are within 2 SD of the mean
99.7% of the values are within 3 SD of the mean
Normal Distribution in Python

from scipy.stats import norm plt.plot(np.linspace(-5, 5, 1000) ,norm.pdf(np.linspace(-5, 5, 1000)))

Human IQ is also a very famous example of the normal distribution, where the average is 100 and the standard deviation is 15. Most people are average intelligent, some are a bit smarter or a bit less smart, and few are very intelligent or very unintelligent.

Any distribution is known as Normal distribution if it has the following characteristics:

The mean, median and mode of the distribution coincide.
The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
The total area under the curve is 1.
Exactly half of the values are to the left of the center and the other half to the right. A normal distribution is highly different from Binomial Distribution. However, if the number of trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed is given by:

Mean -> E(X) = µ

Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.
The graph of a random variable X ~ N (µ, σ) is shown below.

A standard normal distribution is defined as the distribution with mean 0 and standard deviation 1. For such a case, the PDF becomes:

Binomial Distribution

The most widely known discrete probability distribution. It has been used hundreds of years.

Assumptions:

The experiment involves n identical trials.
Each trial has only two possible outcomes – success or failure.
Each trial is independent of the previous trials.
The terms p and q remain constant throughout the experiment, -
where p is the probability of getting a success on any one trial and q = (1 – p) is the probability of getting a failure on any one trial. Binomial Distribution in Python

The two parameters for the Binomial distribution are the number of experiments and the probability of success. A basic example of flipping a coin ten times would have the number of experiments equal to 10 and the probability of success equal to 0.5. This gives the following probability for each number of successes out of 10:

from scipy.stats import binom

# Probability of outcomes of 'number of successes' when throwing 10 coins
n = 10
p = 0.5
x = np.linspace(0, 10, 11)

fig, ax = plt.subplots(1, 1)
binomial_variable = binom(n, p)
ax.vlines(x, 0, binomial_variable.pmf(x), label = 'probability')
ax.legend(loc='best', frameon=False)
plt.show()

Another example of the Binomial distribution would be the probability of getting in a traffic jam in a given week, knowing that the probability of getting in a traffic jam on 1 given day is 0.2. This is a repetition of 1 Bernoulli yes/no variable on 5 works days, so the parameters are: number of experiments is 5 and the probability of success is 0.2. The outcome graph below shows that it is most likely to have 1 traffic jam, then 0 and then 2, 3, 4, and 5 respectively.

The mathematical representation of binomial distribution is given by:

A binomial distribution graph where the probability of success does not equal the probability of failure looks like

Now, when probability of success = probability of failure, in such a situation the graph of binomial distribution looks like

The mean and variance of a binomial distribution are given by:

Mean -> µ = n*p

Variance -> Var(X) = n*p*q

Poisson Distribution

It is the discrete probability distribution of the number of times an event is likely to occur within a specified period of time. It is used for independent events which occur at a constant rate within a given interval of time.

The occurrences in each interval can range from zero to infinity (0 to α).

Examples:

How many black colours are there in a random sample of 50 cars
No of cars arriving at a car wash during a 20 minute time interval

from scipy.stats import poisson

# Probability of number of events, eg people coming into your store per unit time
lam = 4 # lambda per unit time
x = np.linspace(0, 15, 16)

fig, ax = plt.subplots(1, 1)
poisson_variable = poisson(lam)
ax.vlines(x, 0, poisson_variable.pmf(x), label = 'probability')
ax.legend(loc='best', frameon=False)
plt.show()

Other examples of Poisson events could be the number of cars passing at a certain location. Also, almost anything that has a count per unit time could be considered for a Poisson distribution.

distribution is called Poisson distribution when the following assumptions are valid:

Any successful event should not influence the outcome of another successful event.
The probability of success over a short interval must equal the probability of success over a longer interval.
The probability of success in an interval approaches zero as the interval becomes smaller.

Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some notations used in Poisson distribution are:

λ is the rate at which an event occurs,
t is the length of a time interval,
And X is the number of events in that time interval.
Here, X is called a Poisson Random Variable and the probability distribution of X is called Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The PMF of X following a Poisson distribution is given by:

The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that interval. The graph of a Poisson distribution is shown below:

The graph shown below illustrates the shift in the curve due to increase in mean.

It is perceptible that as the mean increases, the curve shifts to the right.

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ
Variance -> Var(X) = µ

Uniform Distribution

It is a continuous or rectangular distribution. It describes an experiment where an outcome lies between certain boundaries.

When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.

A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like

You can see that the shape of the Uniform distribution curve is rectangular, the reason why Uniform distribution is called rectangular distribution.

For a Uniform Distribution, a and b are the parameters.

The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of 40 and a minimum of 10.

Let’s try calculating the probability that the daily sales will fall between 15 and 30.

The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5

Similarly, the probability that daily sales are greater than 20 is = 0.667

The mean and variance of X following a uniform distribution is:

Mean -> E(X) = (a+b)/2

Variance -> V(X) = (b-a)²/12

The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform density is given by:

Gamma Distribution

It deals with continuous variables which take on a wide range of values such as individual call times. Based on which we can model probabilities across any range of possible values using a gamma distribution function. First one is shape parameter (α) and the second one is scale parameter (β).

It takes two parameters: the lambda parameter of the exponential distribution, plus a k parameter for the number of events to wait for.

As an example, you can think of an attraction park that can only launch an attraction when it is full, let’s say, 10 people. If they have an event rate of 4 customers coming in every 2 minutes on average, they could describe the waiting time for launching the attraction using a Gamma distribution.

Examples:

The amount of rainfall accumulated in a reservoir.
The size of loan defaulters and aggregation of insurance claims


from scipy.stats import gamma

# Probability of time before 10 customers coming in
lam = 4 # event rate of customers coming in per unit time
n_simulated = 10000000

random_waiting_times = gamma(10, scale = 1 / lam).rvs(n_simulated)
pd.Series(random_waiting_times).hist(bins = 20)

Exponential Distribution

It is concerned with the amount of time until some specific event occurs.

The Exponential distribution is related to the Poisson distribution. Where the Poisson distribution describes the number of events per unit time, the exponential distribution describes the waiting time between events.

It takes the same parameter as the Poisson distribution: the event rate. In some cases, however, (amongst others in Python’s Scipy) people prefer to use the parameter 1 / event rate.

Example:

The amount of time until an earthquake occurs has an exponential distribution
The amount of time in business telephone calls
The car battery lasts.
The amount of money customers spend on one trip to the supermarket follows an exponential distribution. There are more people who spend small amounts of money and fewer people who spend large amounts of money.
The exponential distribution is widely used in the field of reliability.

Note: Reliability deals with the amount of time a product lasts.

from scipy.stats import expon

# Probability of time between customers coming in
lam = 4 # event rate of customers coming in per unit time
n_simulated = 10000000

random_waiting_times = expon(scale = 1 / lam).rvs(n_simulated)
pd.Series(random_waiting_times).hist(bins = 20)

You should read the x-axes as a percentage of unit time. In the Poisson example, we said that unit time is 15 minutes. Now if we have 4 people in 15 minutes, we are most likely to wait 0.25, or 25% of this unit time for each new person. 25% of 15 minutes is 3,75 minutes.

DEV Community

What is Distribution in Statistics?

What is Distribution?

What is Data?

Bernoulli Distribution

Normal Distribution

Binomial Distribution

Poisson Distribution

Uniform Distribution

Gamma Distribution

Exponential Distribution

Top comments (0)

Read next

Exploratory Testing: A Detailed Guide

Bridging the Gap: A Case Study on Synchronizing Shopify and Microsoft Dynamics GP

Set Git to Recognize Case Changes

How Digital Onboarding KYC is Transforming Identity Verification