daud99

Posted on Apr 20, 2022 • Edited on Jan 28, 2024

Understanding/Exploring dataset (Part 1)

#machinelearning #python #datascience #prepocessing

Before getting start with actual coding. Let's setup our environment.

As mentioned in the previous blog, after downloading our dataset. We will extract the downloaded zip file GeneratedLabelledFlows.zip. Once, we get all the files they will be upload to google drive in my case /content/gdrive/My Drive/project/dataset/original. Once, its done we will create a new notebook and connect it with google drive.

from google.colab import drive
drive.mount('/content/gdrive')

This may require permission for your google drive.

Hurray! Now, we are successfully connected with google drive which means we can easily create, edit or delete files using Google Colab as we would do in our PC using Jupyter Notebook.

Getting an idea of data

We will use pandas to create Data Frame in order to get an idea of how our data looks like. You can choose any file. I'm going with Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv.

dataset_path = '/content/gdrive/My Drive/project/dataset/'
import pandas as pd
df = pd.read_csv(dataset_path+'original/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv')
df.head()

Head return first five rows of the data frame by default.

We can see that we have a total of 85 fields/columns in our dataset.

Combining all dataset files into one Pandas Data Frame

In order to merge all files into one data frame. We need to make sure all the files have same columns. Remember! All columns not number of columns.

We will create a list of dataframe. Each entry in the list correspond to the dataframe for the respective CSV file in the dataset.

all_files = [dataset_path+'original/'+each_file for each_file in os.listdir(dataset_path+'original/')]
all_dfs = [pd.read_csv(each_file,encoding='cp1252') for each_file in all_files]

Now, that we have a list of dataframes. We will check either all the dataframes have same columns or not.

total_columns = all_dfs[0].columns
all_same_column = np.array([])
for (index,df) in enumerate(all_dfs):
  all_same_column = df.columns == total_columns
  if False in all_same_column:
    print(f"This {all_files[index]} doesn't have the same columns")

If all the files have same columns then we will proceed which will be the case here. Otherwise, we will be performing further processing. Finally, we will merge all the dfs in one single Data Frame.

if np.all(all_same_column):
  print("All files have same columns")
  merge_df  = pd.concat([each for each in all_dfs]).drop_duplicates(keep=False)
  merge_df.reset_index(drop=True, inplace = True)
  print("Total Data Shape :" +str(merge_df.shape))
else:
  print("All files have not same columns")

1. `merge_df.info()`

We will use merge_df.info to find the number of rows, columns and datatype of each field in the data frame.

Here, object indicates this a datatype other than integer for instance boolean or string etc. However, to find the exact type access to the exact value is required then you can use the Python type operator.

2. `merge_df.describe()`

The pandas.describe function is used to get a descriptive statistics summary of a given dataframe.

Desriptive Statistics involves describing, summarizing and organizing the data so that it can be easily understood.

The main goal is not to print this descriptive statistics but interpreting/understand it. So, we can utilize that in our further processing of data.

Count simply tells us the number of values present for each field. Normally, all of the columns should have the same count. In case, some columns have different count it should ring the bell. That you should note it down.
Mean is the most basic measure of the distribution/dataset. However, there is one problem with mean (average) it is highly influnced by the outliers. To get your mind around this concept, imagine that ten guys are sitting on bar stools in a middleclass drinking establishment in Seattle; each of these guys earns $35,000 a year, which makes the mean annual income for the group $35,000. Bill Gates walks into the bar. Let’s assume for the sake of the example that Bill Gates has an annual income of $1 billion. When Bill sits down on the eleventh bar stool, the mean annual income for the bar patrons rises to about $91 million.If I were to describe the patrons of this bars having an average annual income of $91 million, the statement would be both statistically correct and grossly misleading. This isn’t a bar where multimillionaires hang out; it’s a bar where a bunch of guys with relatively low incomes happen to be sitting next to Bill Gates.
Min & Max is self explanatory as the name suggests.
Std stands for Standard Deviation. The standard deviation is the descriptive statistic that allows us to assign a single number to this dispersion (spread) around the mean. For instance, a feature with 0 standard variances may not be useful. 0 STD indicates that all the values of the feature column are the same. STD also tells you how big of swings the data takes relative to the mean. For many typical distributions of data, a high proportion of the observations lie within one standard deviation of the mean (meaning that they are in the range from one standard deviation below the mean to one standard deviation above the mean). To illustrate with a simple example, the mean height for American adult men is 5 feet 10 inches. The standard deviation is roughly 3 inches. A high proportion of adult men are between 5 feet 7 inches and 6 feet 1 inch. far fewer observations lie two standard deviations from the mean, and fewer still lie three or four standard deviations away.
50% hmm this one is special. You may wonder, we are talking about descriptive statistics but couldn't found the median. Here it comes you can simply think 50% as median. The median is the point that divides a distribution in half, meaning that half of the observations lie above the median and half lie below. (If there is an even number of observations, the median is the midpoint between the two middle observations.) Good thing is median doesn't get influnced by the outliers. So, one good notable interpretation can be if the mean and median are really close to each other then there are not many outliers influencing the mean.
25% as we've already discussed, the median divides a distribution in half. The distribution can be further divided into quarters, or quartiles. The first quartile consists of the bottom 25 percent of the observations; This is what 25% is indicating here. If expressed simply, it numberized the median of the first 25% of the instance/examples in the dataset.
75% indicates the median of the last 25% of the instance/examples in the dataset or the dataset other than first 75%.

What if we observe a relatively high value for 25% or 75% percentile?

This could be the possible interpretation of having exterme outlier in the last or first quarter of the dataset respectively.

How mean and standard deviation help us understanding the data?

Once we know the mean and standard deviation for any collection of data, we have some serious intellectual traction. For example, suppose I tell you that the mean score on the SAT math test is 500 with a standard deviation of 100. As with height, the bulk of students taking the test will be within one standard deviation of the mean, or between 400 and 600. How many students do you think score 720 or higher? Probably not very many, since that is more than two standard deviations above the mean.

Even after all this talk these number still are quite daunting! Right?

Don't worry we will make them talk in the next blog.
David Out.

DEV Community

Understanding/Exploring dataset (Part 1)

Getting an idea of data

Combining all dataset files into one Pandas Data Frame

1. `merge_df.info()`

2. `merge_df.describe()`

What if we observe a relatively high value for 25% or 75% percentile?

How mean and standard deviation help us understanding the data?

Top comments (0)

Getting an idea of data

Combining all dataset files into one Pandas Data Frame

1. merge_df.info()

2. merge_df.describe()

What if we observe a relatively high value for 25% or 75% percentile?

How mean and standard deviation help us understanding the data?

1. `merge_df.info()`

2. `merge_df.describe()`