Statistics can be used to explain many things like DNA testing or factors associated with diseases like cancer or heart disease or the idiocy of playing the lottery. Statistics is present everywhere in our day-to-day life from batting averages in cricket to US presidential election polls, from weather prediction probabilities to data science and machine learning. Statistics is the branch of mathematics that deals with the collection, organization, analysis, interpretation, and representation of data.
Machine Learning which is the most sought after tech in the present time is basically the analysis of statistics to help computers make decisions based on repeatable characteristics found in the data.
In this post, we will be seeing the basics of statistics like mean, median, mode, and standard deviation being used with the help of Python.
Mean here refers to the average of numbers it means that we add the numbers and divide it by the total number of items present. The code for which is given below.
a=[11,21,34,22,27,11,23,21] mean = sum(a)/len(a) print (mean)
We can also calculate the mean using
numpy the code for that is as follow.
import numpy as np mean = np.mean(a) print (mean)
Median is the middle term which occurs in a sorted array. For odd number of elements it is the middle term and for even number of elements it is the average of two terms in the middle. The implementation for which is given below, the array used is same as before.
def median(nums): if len(nums)%2 == 0: return int(nums[len(nums)//2-1]+nums[len(nums)//2])//2 else: return nums[len(nums)//2] print (median(a))
numpy code for finding median is as follow.
import numpy as np print(np.median(a))
Mode refers to the element having the highest frequency in a list of elements. It is the element which occurs most number of times. The python implementation to find mode is given below.
from collections import Counter data = dict(Counter(a)) mode = [k for k, v in data.items() if v == max(list(data.values()))] print (mode)
Scipy provides a method to find mode of an array or list of elements. One drawback of this method is that it gives only one solution even if the data is multimodal.
from scipy import stats print (stats.mode(a))
The quartiles divide the data in four parts. The first part is start to first quartile(Q1), the second part is first quartile to second quartile(Q2), third part is Q2 to Q3 and fourth part is Q3 to end. The data must be sorted in order to find the quartiles. The code for finding the quartiles is given below, the
median function is the function used above in the median section.
def quartiles(nums): nums=sorted(nums) Q1 = median(nums[:len(nums)//2]) Q2 = median(nums) if len(nums)%2 == 0: Q3 = median(nums[len(nums)//2:]) else: Q3 = median(nums[len(nums)//2+1:]) return Q1,Q2,Q3 print (quartiles(a))
Standard deviation is the measure of dispersion or spread of data. It is the square root of Variance. The simple python implementation to find Standard deviation is given below.
n=len(a) std=(sum(map(lambda x: (x-sum(a)/n)**2,a))/n )**0.5 print(std)
The numpy function to find the standard deviation is given below.
import numpy as np print (np.std(a))