DEV Community

Anurag Verma
Anurag Verma

Posted on

Probability for data science

Day 10 of 100 Days Data Science Bootcamp from noob to expert.

Recap Day 9

Yesterday we have studied in detail about statistics Python.

Let's Start

Probability

Probability is the measure of the likelihood of an event occurring. It is a number between 0 and 1, with 0 indicating that an event will never happen and 1 indicating that an event will always happen. For example, the probability of flipping a coin and getting heads is 0.5 because there is a 50% chance of getting heads.

Example:

The probability of rolling a 6 on a fair die is 1/6 because there is only 1 favorable outcome (rolling a 6) out of 6 possible outcomes (rolling a 1, 2, 3, 4, 5, or 6).

# Calculation of probability of rolling a 6 on a fair die
p = 1/6
print(p)

Enter fullscreen mode Exit fullscreen mode
0.16666666666666666
Enter fullscreen mode Exit fullscreen mode

Random Variable

A random variable is a variable that can take on different values based on the outcome of a random event. For example, the number of heads obtained in a coin flip is a random variable because it can take on different values (0, 1, 2, etc.) depending on the outcome of the coin flip.

Example: The number of heads obtained in a coin flip is a random variable because it can take on different values (0, 1, 2, etc.) depending on the outcome of the coin flip.

# Creating a list of outcomes for a coin flip
outcomes = ['heads', 'tails']

# Using numpy's random.choice to simulate a coin flip 10 times
import numpy as np
np.random.seed(0)
results = np.random.choice(outcomes, size=10, replace=True)
print(results)

Enter fullscreen mode Exit fullscreen mode
['heads' 'tails' 'tails' 'heads' 'tails' 'tails' 'tails' 'tails' 'tails'
 'tails']
Enter fullscreen mode Exit fullscreen mode

calculating Probability

Calculating probability is done by counting the number of favorable outcomes and dividing it by the total number of possible outcomes. For example, if we want to find the probability of flipping a coin and getting heads, we would count the number of heads (1) and divide it by the total number of possible outcomes (2, heads or tails).

Example: If we want to find the probability of flipping a coin and getting heads, we would count the number of heads (4) and divide it by the total number of possible outcomes (10).

# Counting the number of heads in the simulated coin flip results
num_heads = sum(results == 'heads')

# Calculating the probability of getting heads
p = num_heads/len(results)
print(p)

Enter fullscreen mode Exit fullscreen mode
0.2
Enter fullscreen mode Exit fullscreen mode

Binomial Distribution

The binomial distribution is a probability distribution that describes the number of successes in a fixed number of trials. For example, if we were to flip a coin 10 times, the binomial distribution would describe the probability of getting a certain number of heads in those 10 flips. In R, we can use the function "dbinom" to calculate the probability of a specific number of successes in a fixed number of trials.

Example: If we were to flip a coin 10 times, the binomial distribution would describe the probability of getting a certain number of heads in those 10 flips.

# Using scipy's binom.pmf to calculate the probability of getting 4 heads in 10 coin flips
from scipy.stats import binom
p = binom.pmf(4, 10, 0.5)
print(p)

Enter fullscreen mode Exit fullscreen mode
0.2050781249999999
Enter fullscreen mode Exit fullscreen mode

Continuous Random variable

A continuous random variable is a random variable that can take on any value within a given range, rather than just discrete values. For example, the height of a person is a continuous random variable because it can take on any value within a certain range (e.g. between 1 and 7 feet).

Example: The height of a person is a continuous random variable because it can take on any value within a certain range (e.g. between 1 and 7 feet).

# Generating a random sample of heights using numpy's random.normal
np.random.seed(0)
heights = np.random.normal(loc=5, scale=1, size=100)

# Plotting the distribution of heights using matplotlib
import matplotlib.pyplot as plt
plt.hist(heights, bins=20)
plt.xlabel('Height (feet)')
plt.ylabel('Count')
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

Central Limit Theorem:

The Central Limit Theorem states that the distribution of the mean of a large number of random variables will be approximately normal, regardless of the distribution of the individual random variables. For example, if we were to take the average of 100 coin flips, the Central Limit Theorem tells us that this average will be normally distributed, even though the individual coin flips may not be.

Example: If we were to take the average of 100 coin flips, the Central Limit Theorem tells us that this average will be normally distributed, even though the individual coin flips may not be.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generating 1000 sets of 100 coin flips using numpy's random.choice
np.random.seed(0)
outcomes = [0, 1]
flips = np.random.choice(outcomes, size=(1000, 100), replace=True)
averages = flips.mean(axis=1)

# Plotting the distribution of averages using matplotlib
mu, std = norm.fit(averages)
plt.hist(averages, bins=20, density=True, alpha=0.6, color='blue', label='Sample Means')
x = np.linspace(0, 1, 100)
plt.plot(x, norm.pdf(x, mu, std), 'r-', lw=2, label='Normal Distribution')
plt.xlabel('Probability of Heads')
plt.ylabel('Count')
plt.legend()
plt.show()

Enter fullscreen mode Exit fullscreen mode

Image description

Normal Distribution:

The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that is symmetric around the mean. It is commonly used to model real-world data, such as test scores or blood pressure levels. In R, we can use the function "dnorm" to calculate the probability density of a specific value within a normal distribution.

Example: We can use the normal distribution to model test scores, with a mean of 75 and a standard deviation of 10.

# Using scipy's norm.pdf to calculate the probability density of a score of 80 in a normal distribution with mu=75 and std=10
from scipy.stats import norm
p = norm.pdf(80, 75, 10)
print(p)

Enter fullscreen mode Exit fullscreen mode
0.03520653267642995
Enter fullscreen mode Exit fullscreen mode

Z-scores:

Z-scores are used to standardize a value within a normal distribution, allowing for comparison between different data sets. A z-score is calculated by subtracting the mean of the distribution from a specific value and dividing by the standard deviation. In R, we can use the function "scale" to calculate the z-score of a value within a data set.

Example: We can use the scipy's stats.zscore function to calculate the z-score of a value within a data set, such as finding the z-score of a test score of 80 in the example above.

from scipy.stats import zscore

scores = np.random.normal(75, 10, 100)

#Calculating the z-score of a test score of 80 
z = zscore(scores)[0]
print(z)

Enter fullscreen mode Exit fullscreen mode
-0.3000028431476816
Enter fullscreen mode Exit fullscreen mode




Summary:

This article provides an overview of the key concepts of probability and statistics in the context of machine learning and data science. It begins by defining probability and discussing the concept of random variables. The article then goes on to explain how to calculate probability and introduces the binomial distribution. It also covers the continuous random variable and the central limit theorem. Finally, the article discusses the normal distribution, z-scores and some open challenges in the field. The article aims to provide a comprehensive understanding of probability and statistics for machine learning and data science practitioners. It uses python to explain the concepts and provides examples and sample data.

Oldest comments (0)