Data and Sampling Distributions- II

#machinelearning #datascience #python #beginners

At the end of Part-I, we talked about how to calculate an estimate for Standard Error of a Statistic. We will be continuing the discussion and discuss further.

The Bootstrap

One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample. This procedure is called the bootstrap, and it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.

Conceptually, you can imagine the bootstrap as replicating the original sample thousands or millions of times so that you have a hypothetical population that embodies all the knowledge from your original sample.

In practice, it is not necessary to actually replicate the sample a huge number of times. We simply replace each observation after each draw i.e, we sample with replacement.

The algorithm for bootstrap resampling of the mean for a sample size of n is as follows:

Draw a sample value, record it, and then replace it.
Repeat n times
Record the mean of n resampled values
Repeat steps 1-3 R times
Use the R results to:
1. Calculate their standard deviation ( estimates sample mean standard error)
2. Produce Boxplot or Histogram
3. Find Confidence Interval

The number of iterations of the bootstrap: R, is set arbitrarily. The more the iterations, the more accurate is the estimate of standard error.

From the previous dataset of Red Wine Quality Estimation, we are taking Total Sulfur Dioxide as a key feature to calculate the bias and an estimate of standard error.

from sklearn.utils import resample
boot_sample = 1000
results = []
for nrepeat in range(1000):
  sample = resample(data['total sulfur dioxide'], replace = True, n_samples = boot_sample)
  results.append(sample.mean())

results = pd.Series(results)

print('Bootstrap Statistics:')
print('Original Population Size : ', data['total sulfur dioxide'].shape[0])
print('Bootstrap Sample Size : ', boot_sample)
print('Original: ', data['total sulfur dioxide'].median())
print('Bias: ', results.mean() - data['total sulfur dioxide'].mean())
print('Standard Error: ', results.std())

#Output:
Bootstrap Statistics:
Original Population Size :  1599
Bootstrap Sample Size :  1000
Original:  38.0
Bias:  -0.016345870231326387
Standard Error:  1.071951943585676

The bootstrap can be used with multivariate data, where the rows are sampled as units.
A model might then be run on the bootstrapped data, for example, to estimate the stability (variability) of model parameters, or to improve predictive power.
With CART Algorithm (Random Forest), running multiple trees on bootstrap samples and then averaging their predictions (or, with classification, taking a majority vote) generally performs better than using a single tree.

So as we can observe that, the concept of Bootstrap has been used extensively in Machine Learning.

Confidence Intervals

The concept of Confidence Interval lies in the idea of uncertainty. Usually, there are point estimate which are estimated but presenting a range of values to counteract this tendency.

Confidence intervals always come with a coverage level, expressed as a (high) percentage, say 90% or 95%.
One way to think of a 90% confidence interval is as follows: it is the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic.
More generally, an x% confidence interval around a sample estimate should, on average, contain similar sample estimates x% of the time (when a similar sampling procedure is followed).

Bootstrap is a general tool that can be used to generate confidence intervals for most statistics, or model parameters.

The percentage associated with the confidence interval is termed the level of confidence. The higher the level of confidence, the wider the interval.
Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty)

For a data scientist, a confidence interval is a tool that can be used to get an idea of how variable a sample result might be.

# Creating a dataset from normal distribution
dataset = 0.5 + np.random.rand(1000) * 0.5
# bootstrap
scores = list()
for _ in range(100):
    # bootstrap sample
    indices = np.random.randint(0, 1000, 1000)
    sample = dataset[indices]
    # calculate and store statistic
    statistic = np.mean(sample)
    scores.append(statistic)
print('50th percentile (median) = %.3f' % np.median(scores))
# calculate 95% confidence intervals (100 - alpha)
alpha = 5.0
# calculate lower percentile (e.g. 2.5)
lower_p = alpha / 2.0
# retrieve observation at lower percentile
lower = max(0.0, np.percentile(scores, lower_p))
print('%.1fth percentile = %.3f' % (lower_p, lower))
# calculate upper percentile (e.g. 97.5)
upper_p = (100 - alpha) + (alpha / 2.0)
# retrieve observation at upper percentile
upper = min(1.0, np.percentile(scores, upper_p))
print('%.1fth percentile = %.3f' % (upper_p, upper))

In this article, we covered two major concepts: Confidence Intervals and Bootstrap, this two concepts are used majorly in field of Data Science for various applications.

Fin