I would like to start this site as a small on board journey where I'll be keeping my notes in a format that can make them useful for others.
So this first post series will be all about data analysis we will start by the very fundamentals of statistics and move on to more advanced topics such as supervised and unsupervised learning as well as deep learning.
Intro to descriptive statistics
Data is literally everything we can keep track off and measure, it could be a list of the cars you are seeing on the street, your school records, the cost you have to pay for your favorite coffee recorded for every day of the year or all the usernames and emails of the whole netflix database among others. So everything we can measure, that is data.
And data itself is useless, if we record data it is because we want to do something with it. A complex example may be a hedge fund investor, that always has an eye on the stock market, accessing huge amounts of data including: names and stock prices of many companies, currency change rates as well as a lot of previous buy/sell operations made by other investors. By using that data a hedge fund analyst may evaluate the state of the market related to its interests and perhaps try to perform some preditions. That last example is probably what comes to our minds when we try to think about a data analyst or someone who works with large amounts of data, but the fact is that we use data analysis on a daily basis, for example when we go out for buying something, we may quick scan through all of the items in the store that have the conditions we need and then we select the one with the lower price, we may do that without htinking but deep down inside, we are applying data analysis techniques!
In general when we talk about data we talk about qualitative and quantitative data.
Qualitative data? We won't do any numerical operation with this kind of data. Think about data such as a car model, a coffee brand or a company name. It represents a quality.
Quantitative data? For example, number of cars. And if we look at this kind of data, we can see that it can be continuous or discrete. Continuous data may take negative, positive or even decimal values, the key point here is that it represents information that has an order and so it is continuous in some way. Discrete data is just quantitative data that does not follow any order. Again, there can be a bit of confusion regarding to this, as any set of quantitative data can be defined as discrete.
Measures of central tendency
And we can perform all kinds of operations on data, some of them may make zero sense but, some others are pretty well kown and widely used! Everyone starts with the measures of central tendency, by using them we can get a pretty general overview about what kind of data we have.
As we are willing to do some data science at the end, the best we can do is to start getting used to the tools, so we'll offer examples of our calculations by using python and some popular data science/math libraries such as numpy and scypy.
MEAN
The mean is useful for obtaining a general overview of the data, it is obtained by adding all the values of the list (we can call that list a dataset for now) and then dividing by the total lenght of the list. It is useful when we need to obtain a result that includes the whole set, for example the mean is used when your teachers calculate your final marks, as every exam is important, perhaps some exams may be more important than others but we'll dig deeper into that further in these series.
The main problem of using the mean though is that if we have a list of numbers for example such as: 1,2,3,2,3,2,99999 we can see that the mean will look close to 14287 and that number does not represent our list, that is mostly comprised of numbers between 0 and 5. We use other measures such as the mode when we have problems such as this one. By the way, that large number we just saw is called an outiler! And we must be aware of them.
The following example will calculate the mean of a python array by using numpy.
import numpy as np
arr = [20, 2, 7, 1, 34]
print("mean of arr : ", np.mean(arr))
>>> 12.8
MEDIAN
We can obtain it by ordering the values from smallest to the greatest and then the mean will be the value that is right in the centre. If we can extract a centre (if we have an even number of items) the median will be calculated as the mean of the two central values.
The median works particularly well if we have a dataset that has values that are way different than the others!
And the median can be calculated as follows using numpy
import numpy as np
arr = [20, 2, 7, 1, 999,-30]
print("mean of arr : ", np.median(arr))
>>> 7.0
MODE
The mode is the most common value of the dataset, so it is the value that occurs most often. For example, we may be searching for a new appartment to rent, the mean appartment price of a zone may be a good idea but, we can also be interested in knowing the mode as a lot of the appartments have prices in a range (ex: its easy to have 1500USD, 1800USD, 5000USD... but more rare to have 1443USD, 1221 USD and such) we can also use the mode so.
On this case we see that numpy is not able to calculate the mode as easily, but scipy does so we'll use that.
from scipy import stats
arr = [20, 2, 2, 7, 1, 34]
print("mean of arr : ", stats.mode(arr))
>>> 2
MIN
It is simply the item that has the minimum value on the selected dimension, ex: the item with the lowest price.
The MIN can be calculated using numpy as follows:
import numpy as np
arr = [20, 2, 7, 1, 34]
print("mean of arr : ", np.min(arr))
>>> 1
The MAX can be calculated using numpy as follows:
MAX
It is the element that has the maximum value ex: the biggest house.
import numpy as np
arr = [20, 2, 7, 1, 34]
print("mean of arr : ", np.max(arr))
>>> 20
On the next parts we'll look at the measures of spread, and we'll start to see some charts!
Top comments (0)