Roberto B.

Posted on Jan 30, 2022 • Updated on Feb 1, 2022

Statistics with PHP

#php #statistics #math #package

I was playing with FIT files.
A FIT file is a file where is collected a lot of information about your sport activities. In that file you have the tracking of your Hearth Rate, Speed, Cadence, Power etc.
I needed to apply some statistic functions to understand better the numbers and the sport activity performance.
I collected some functions like mean, mode, median, range, quantiles, first quartile (or 25th percentile), third quartile (or 75th percentile), frequency table (cumulative, relative), standard deviation (population and sample), variance (population and sample) etc...
So, my decision to write a PHP package where to collect all these functions.
This package is inspired by the Python statistics module, and I needed something similar to the python module but in my favorite language, PHP.

Installation

You can install the package via composer:

composer require hi-folks/statistics

Source code

Source code is here:
https://github.com/Hi-Folks/statistics

Usage

You have some classes:

Stat class with static methods for basic statistics (descriptive statistics) functions like mean, median...
Freq class with statich methods for generating frequency tables
Statistics class that relies on the Statand Freq class for allow the developer to instance a Statistics object to store data and manage it.

The package is full tested with Pest an elegant PHP Testing Framework. Now I just relesased the version 0.1.4, with full converage (100%).

Stat class

This class provides methods for calculating mathematical statistics of numeric data.
Stat class has methods to calculate an average or typical value from a population or sample like:

mean(): arithmetic mean or "average" of data;
median(): median or "middle value" of data;
medianLow(): low median of data;
medianHigh(): high median of data;
mode(): single mode (most common value) of discrete or nominal data;
multimode(): list of modes (most common values) of discrete or nominal data;
quantiles(): cut points dividing the range of a probability distribution into continuous intervals with equal probabilities;
thirdQuartile(): 3rd quartile, is the value at which 75 percent of the data is below it;
firstQuartile(): first quartile, is the value at which 25 percent of the data is below it;
pstdev(): Population standard deviation
stdev(): Sample standard deviation
pvariance(): variance for a population
variance(): variance for a sample
geometricMean(): geometric mean
harmonicMean(): harmonic mean

Stat::mean( array $data )

Return the sample arithmetic mean of the array $data.
The arithmetic mean is the sum of the data divided by the number of data points. It is commonly called “the average”, although it is only one of many mathematical averages. It is a measure of the central location of the data.

use HiFolks\Statistics\Stat;
$mean = Stat::mean([1, 2, 3, 4, 4]);
// 2.8
$mean = Stat::mean([-1.0, 2.5, 3.25, 5.75]);
// 2.625

Stat::geometricMean( array $data )

The geometric mean indicates the central tendency or typical value of the data using the product of the values (as opposed to the arithmetic mean which uses their sum).

use HiFolks\Statistics\Stat;
$mean = Stat::geometricMean([54, 24, 36], 1);
// 36.0

Stat::harmonicMean( array $data )

The harmonic mean is the reciprocal of the arithmetic mean() of the reciprocals of the data. For example, the harmonic mean of three values a, b and c will be equivalent to 3/(1/a + 1/b + 1/c). If one of the values is zero, the result will be zero.

use HiFolks\Statistics\Stat;
$mean = Stat::harmonicMean([40, 60], null, 1);
// 48.0

You can also calculate harmonic weighted mean.
Suppose a car travels 40 km/hr for 5 km, and when traffic clears, speeds-up to 60 km/hr for the remaining 30 km of the journey. What is the average speed?

use HiFolks\Statistics\Stat;
Stat::harmonicMean([40, 60], [5, 30], 1);
// 56.0

where:

40, 60 : are the elements
5, 30: are the weights for each element (first weight is the weight of the first element, the second one is the weight of the second element)
1: is the decimal numbers you want to round

Stat::median( array $data )

Return the median (middle value) of numeric data, using the common “mean of middle two” method.

use HiFolks\Statistics\Stat;
$median = Stat::median([1, 3, 5]);
// 3
$median = Stat::median([1, 3, 5, 7]);
// 4

Stat::medianLow( array $data )

Return the low median of numeric data.
The low median is always a member of the data set. When the number of data points is odd, the middle value is returned. When it is even, the smaller of the two middle values is returned.

use HiFolks\Statistics\Stat;
$median = Stat::medianLow([1, 3, 5]);
// 3
$median = Stat::medianLow([1, 3, 5, 7]);
// 3

Stat::medianHigh( array $data )

Return the high median of data.
The high median is always a member of the data set. When the number of data points is odd, the middle value is returned. When it is even, the larger of the two middle values is returned.

use HiFolks\Statistics\Stat;
$median = Stat::medianHigh([1, 3, 5]);
// 3
$median = Stat::medianHigh([1, 3, 5, 7]);
// 5

Stat::quantiles( array $data, $n=4, $round=null )

Divide data into n continuous intervals with equal probability. Returns a list of n - 1 cut points separating the intervals.
Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set n to 100 for percentiles which gives the 99 cuts points that separate data into 100 equal sized groups.

use HiFolks\Statistics\Stat;
$quantiles = Stat::quantiles([98, 90, 70,18,92,92,55,83,45,95,88]);
// [ 55.0, 88.0, 92.0 ]
$quantiles = Stat::quantiles([105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,103, 107, 101, 81, 109, 104], 10);
// [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

Stat::firstQuartile( array $data, $round=null )

The lower quartile, or first quartile (Q1), is the value under which 25% of data points are found when they are arranged in increasing order.

use HiFolks\Statistics\Stat;
$percentile = Stat::firstQuartile([98, 90, 70,18,92,92,55,83,45,95,88]);
// 55.0

Stat::thirdQuartile( array $data, $round=null )

The upper quartile, or third quartile (Q3), is the value under which 75% of data points are found when arranged in increasing order.

use HiFolks\Statistics\Stat;
$percentile = Stat::thirdQuartile([98, 90, 70,18,92,92,55,83,45,95,88]);
// 92.0

Stat::pstdev( array $data )

Return the Population Standard Deviation, a measure of the amount of variation or dispersion of a set of values.
A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

use HiFolks\Statistics\Stat;
$stdev = Stat::pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]);
// 0.986893273527251
$stdev = Stat::pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75], 4);
// 0.9869

Stat::stdev( array $data )

Return the Sample Standard Deviation, a measure of the amount of variation or dispersion of a set of values.
A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

use HiFolks\Statistics\Stat;
$stdev = Stat::stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75]);
// 1.0810874155219827
$stdev = Stat::stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75], 4);
// 1.0811

Stat::variance ( array $data)

Variance is a measure of dispersion of data points from the mean.
Low variance indicates that data points are generally similar and do not vary widely from the mean.
High variance indicates that data values have greater variability and are more widely dispersed from the mean.

For calculate variance from a sample:

use HiFolks\Statistics\Stat;
$variance = Stat::variance([2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]);
// 1.3720238095238095

If you need to calculate the variance on the whole population and not just on a sample, you need to use pvariance method:

use HiFolks\Statistics\Stat;
$variance = Stat::pvariance([0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]);
// 1.25

Freq class

With Statistics package you can calculate frequency table.
A frequency table is the list of the frequency of various outcomes in a sample.
Each entry in the table contains the frequency or count of the occurrences of values within a particular group or interval.

Freq::frequencies( array $data )

use HiFolks\Statistics\Freq;

$fruits = ['🍈', '🍈', '🍈', '🍉','🍉','🍉','🍉','🍉','🍌'];
$freqTable = Freq::frequencies($fruits);
print_r($freqTable);

You can see the frequency table as an array:

Array
(
    [🍈] => 3
    [🍉] => 5
    [🍌] => 1
)

Freq::relativeFrequencies( array $data )

You can retrieve the frequency table in relative format (percentage):

$freqTable = Freq::relativeFrequencies($fruits, 2);
print_r($freqTable);

You can see the frequency table as an array with percentage of the occurrences:

Array
(
    [🍈] => 33.33
    [🍉] => 55.56
    [🍌] => 11.11
)

Statistics class

$stat = HiFolks\Statistics\Statistics::make(
    [3,5,4,7,5,2]
);
echo $stat->valuesToString(5) . PHP_EOL;
// 2,3,4,5,5
echo "Mean              : " . $stat->mean() . PHP_EOL;
// Mean              : 4.3333333333333
echo "Count             : " . $stat->count() . PHP_EOL;
// Count             : 6
echo "Median            : " . $stat->median() . PHP_EOL;
// Median            : 4.5
echo "First Quartile  : " . $stat->firstQuartile() . PHP_EOL;
// First Quartile  : 2.5
echo "Third Quartile : " . $stat->thirdQuartile() . PHP_EOL;
// Third Quartile : 5
echo "Mode              : " . $stat->mode() . PHP_EOL;
// Mode              : 5

Calculate Frequency Table

Statistics packages has some methods for generating Frequency Table:

frequencies(): a frequency is the number of times a value of the data occurs;
relativeFrequencies(): a relative frequency is the ratio (fraction or proportion) of the number of times a value of the data occurs in the set of all outcomes to the total number of outcomes;
cumulativeFrequencies(): is the accumulation of the previous relative frequencies;
cumulativeRelativeFrequencies(): is the accumulation of the previous relative ratio.

use HiFolks\Statistics\Statistics;

$s = Statistics::make(
    [98, 90, 70,18,92,92,55,83,45,95,88,76]
);
$a = $s->frequencies();
print_r($a);
/*
Array
(
    [18] => 1
    [45] => 1
    [55] => 1
    [70] => 1
    [76] => 1
    [83] => 1
    [88] => 1
    [90] => 1
    [92] => 2
    [95] => 1
    [98] => 1
)
 */

$a = $s->relativeFrequencies();
print_r($a);
/*
Array
(
    [18] => 8.3333333333333
    [45] => 8.3333333333333
    [55] => 8.3333333333333
    [70] => 8.3333333333333
    [76] => 8.3333333333333
    [83] => 8.3333333333333
    [88] => 8.3333333333333
    [90] => 8.3333333333333
    [92] => 16.666666666667
    [95] => 8.3333333333333
    [98] => 8.3333333333333
)
 */

What's next

If you have some suggestion to improve the code , or you want to add some new functions or request a new feature, feel free to open a new issues here: https://github.com/Hi-Folks/statistics/issues

Todo list:

I'm going to implement and add classes/methods for:

[ ] covariance and correlation
[ ] normal distributions

Follow me on Twitter: https://twitter.com/RmeetsH

Top comments (2)

InvalidLenni • Feb 1 '22 • Edited

Exactly what I needed in such kind. Thanks! :)

Roberto B. • Feb 1 '22 • Edited

Happy to hear that.
Feel free to suggest the implementation of new methods / functions.
It is under development, so I'm going to add new things like covariance, correlation, normal distribution... etc...

DEV Community